The objective of this lab is to apply data analysis techniques using pandas and train a linear regression model to predict the calorie burnage of players in a training dataset.
Python 3.x
pandas
numpy
matplotlib
seaborn
scikit-learn
statsmodels
The dataset, data.csv, contains health records of players in training. It has the following columns:
Player_ID: Unique identifier for each player.
Duration: Training session duration in minutes.
Average_Pulse: Average heart rate during training.
Max_Pulse: Maximum heart rate during training.
Calorie_Burnage: Calories burned during training.
The first row contains headers, and values are separated by commas.
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
Invalid entries should be removed to maintain data integrity.
# Drop rows with missing values
df_cleaned = df.dropna()
# Drop rows where 'Calorie_Burnage' or 'Average_Pulse' are non-numeric
df_cleaned = df_cleaned[pd.to_numeric(df_cleaned['Average_Pulse'], errors='coerce').notna()]
df_cleaned = df_cleaned[pd.to_numeric(df_cleaned['Calorie_Burnage'], errors='coerce').notna()]
print(df_cleaned.info())
df_cleaned['Average_Pulse'] = df_cleaned['Average_Pulse'].astype('float64')
print(df_cleaned.dtypes)
We train a linear regression model to predict Calorie_Burnage from Average_Pulse.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = df_cleaned[['Average_Pulse']]
y = df_cleaned['Calorie_Burnage']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x=df_cleaned['Average_Pulse'], y=df_cleaned['Calorie_Burnage'])
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.title("Scatter Plot of Calorie Burnage vs Average Pulse")
plt.show()
import statsmodels.api as sm
X_ols = df_cleaned[['Average_Pulse', 'Duration']]
X_ols = sm.add_constant(X_ols) # Adds intercept term
y_ols = df_cleaned['Calorie_Burnage']
ols_model = sm.OLS(y_ols, X_ols).fit()
print(ols_model.summary())
In this lab, we applied pandas to clean and preprocess data, trained a linear regression model to predict Calorie_Burnage using Average_Pulse, and visualized the relationship using a scatter plot. We also used an OLS model to evaluate the impact of both Average_Pulse and Duration on Calorie_Burnage.