Regression model

Lab Manual: Data Analysis and Regression Modeling with Pandas

Objective

The objective of this lab is to apply data analysis techniques using pandas and train a linear regression model to predict the calorie burnage of players in a training dataset.

Requirements

Python 3.x
pandas
numpy
matplotlib
seaborn
scikit-learn
statsmodels

Dataset

The dataset, data.csv, contains health records of players in training. It has the following columns:

Player_ID: Unique identifier for each player.
Duration: Training session duration in minutes.
Average_Pulse: Average heart rate during training.
Max_Pulse: Maximum heart rate during training.
Calorie_Burnage: Calories burned during training.

The first row contains headers, and values are separated by commas.

Step 1: Load the Dataset

import pandas as pd

df = pd.read_csv("data.csv")

print(df.head())

Step 2: Data Cleaning

Invalid entries should be removed to maintain data integrity.

# Drop rows with missing values

df_cleaned = df.dropna()

# Drop rows where 'Calorie_Burnage' or 'Average_Pulse' are non-numeric

df_cleaned = df_cleaned[pd.to_numeric(df_cleaned['Average_Pulse'], errors='coerce').notna()]

df_cleaned = df_cleaned[pd.to_numeric(df_cleaned['Calorie_Burnage'], errors='coerce').notna()]

print(df_cleaned.info())

Step 3: Convert Data Type of Average_Pulse

df_cleaned['Average_Pulse'] = df_cleaned['Average_Pulse'].astype('float64')

print(df_cleaned.dtypes)

Step 4: Train a Linear Regression Model

We train a linear regression model to predict Calorie_Burnage from Average_Pulse.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

X = df_cleaned[['Average_Pulse']]

y = df_cleaned['Calorie_Burnage']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

print("Model Coefficients:", model.coef_)

print("Model Intercept:", model.intercept_)

Step 5: Scatter Plot Between Calorie_Burnage and Average_Pulse

import matplotlib.pyplot as plt

import seaborn as sns

sns.scatterplot(x=df_cleaned['Average_Pulse'], y=df_cleaned['Calorie_Burnage'])

plt.xlabel("Average Pulse")

plt.ylabel("Calorie Burnage")

plt.title("Scatter Plot of Calorie Burnage vs Average Pulse")

plt.show()

Step 6: Train an OLS Model Using Average_Pulse and Duration

import statsmodels.api as sm

X_ols = df_cleaned[['Average_Pulse', 'Duration']]

X_ols = sm.add_constant(X_ols) # Adds intercept term

y_ols = df_cleaned['Calorie_Burnage']

ols_model = sm.OLS(y_ols, X_ols).fit()

print(ols_model.summary())

Conclusion

In this lab, we applied pandas to clean and preprocess data, trained a linear regression model to predict Calorie_Burnage using Average_Pulse, and visualized the relationship using a scatter plot. We also used an OLS model to evaluate the impact of both Average_Pulse and Duration on Calorie_Burnage.

Page updated

Google Sites

Report abuse