Exploratory Data Analysis and Prediction Modeling in Python
Written on
Chapter 1: Introduction to Exploratory Data Analysis
This article delves into the concept of data storytelling through exploratory data analysis (EDA). When confronted with a large dataset, understanding its implications can be challenging at first glance. It requires thorough effort and analysis to derive valuable insights from the data.
In this article, we will utilize a dataset while leveraging popular Python libraries such as Numpy, Pandas, Matplotlib, and Seaborn to extract meaningful insights. We will conclude by implementing a prediction model using the Scikit-learn library.
As a data scientist or analyst, you might encounter datasets in unfamiliar domains, such as medical records. Although many columns may contain medical terminology, this should not deter you. With the right tools and techniques, exploration is achievable.
This article will address:
- Approaches to extract meaningful insights from datasets.
- Utilizing machine learning models for prediction.
Section 1.1: Dataset Overview
The dataset in focus is the "heart failure clinical records" dataset. You can download it from Kaggle to follow along.
We will begin by importing the necessary libraries and loading the dataset into the Jupyter Notebook environment:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("heart_failure_clinical_records_dataset.csv")
The dataset comprises 299 rows with the following columns:
df.columns
Output:
Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
'ejection_fraction', 'high_blood_pressure', 'platelets',
'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
'DEATH_EVENT', 'sex1', 'death'],
dtype='object')
Here, the variables 'age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', and 'time' are continuous, while 'anaemia', 'diabetes', 'high_blood_pressure', 'sex', and 'smoking' are categorical. The categorical variables are binary, with values of 0 or 1, indicating conditions such as male/female and presence of high blood pressure.
To initiate our EDA, we will examine the distribution of the continuous variables:
df[['age', 'creatinine_phosphokinase',
'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']].hist(bins=20, figsize=(15, 15))
plt.show()
This histogram provides an overview of the population distribution and the characteristics of the dataset.
To deepen our understanding, we will calculate descriptive statistics such as mean, median, maximum, minimum, standard deviation, and quartiles:
continuous_var = ['age', 'creatinine_phosphokinase',
'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']
df[continuous_var].describe()
Having reviewed the distributions and statistical parameters, we'll analyze how these variables correlate with the 'DEATH_EVENT' column, which we will relabel to 'yes' and 'no' for clarity. Similarly, we'll adjust the 'sex' column to indicate 'male' and 'female':
df['sex1'] = df['sex'].replace({1: "Male", 0: "Female"})
df['death'] = df['DEATH_EVENT'].replace({1: "yes", 0: "no"})
After these modifications, we'll visualize the relationships between the continuous variables and the death events using a pairplot that distinguishes between 'yes' and 'no' based on death events:
sns.pairplot(df[["creatinine_phosphokinase", "ejection_fraction",
"platelets", "serum_creatinine",
"serum_sodium", "time", "death"]], hue="death",
diag_kind='kde', kind='scatter', palette='husl')
plt.show()
In this visualization, red represents death events while green indicates survival. The combination of scatter and density plots effectively illustrates the data distinctions based on death events.
To further clarify our insights, we will use boxplots:
continuous_var = ['age', 'creatinine_phosphokinase',
'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']
plt.figure(figsize=(16, 25))
for i, col in enumerate(continuous_var):
plt.subplot(6, 4, i*2+1)
plt.subplots_adjust(hspace=.25, wspace=.3)
plt.grid(True)
plt.title(col)
sns.kdeplot(df.loc[df["death"]=='no', col], label="alive", color="green", shade=True, kernel='gau', cut=0)
sns.kdeplot(df.loc[df["death"]=='yes', col], label="dead", color="red", shade=True, kernel='gau', cut=0)
plt.subplot(6, 4, i*2+2)
sns.boxplot(y=col, data=df, x="death", palette=["green", "red"])
To support our findings, we will compute the mean and median for each continuous variable based on death events. This quantitative analysis will offer valuable insights into the dataset's characteristics:
y = df.groupby("death")["creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time"].agg([np.mean, np.median])
This analysis reveals significant differences, particularly in the 'time' variable concerning death events.
Next, we will evaluate the impact of high blood pressure on mortality rates across genders:
df.groupby(['sex1', 'high_blood_pressure', 'death']).size().unstack().fillna(0).apply(lambda x: x/x.sum(), axis=1)
The results show variations in proportions, with notable differences observed in high blood pressure conditions rather than gender.
Section 1.2: Analyzing Categorical Variables
Besides the 'death' variable, there are five additional categorical variables worth investigating. We will employ a countplot to visualize their relationships with the 'death' variable:
binary_var = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex1', 'smoking']
plt.figure(figsize=(13, 9))
for i, var in enumerate(binary_var):
plt.subplot(2, 3, i+1)
plt.title(var, fontsize=14)
plt.xlabel(var, fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.subplots_adjust(hspace=0.4, wspace=0.3)
sns.countplot(data=df, x=var, hue="death", palette=['gray', "coral"])
The plot above clearly illustrates discrepancies in death events across different variables, such as gender and health conditions. However, it also highlights an imbalance in the dataset regarding smoking habits and diabetes prevalence.
To gain a clearer understanding, we can utilize crosstab analysis for the 'sex1' variable:
x = pd.crosstab(df["sex1"], df['death'])
Following this, we will calculate proportions for better insights:
x.apply(lambda z: z/z.sum(), axis=1)
This reveals that death rates among males and females are approximately equal at around 32%. Similar analyses for the other categorical variables yield the following observations:
- Anemia vs Death: The death rate is higher among those with anemia.
- Diabetes vs Death: No difference in death rates between those with and without diabetes.
- Smoking vs Death: Death rates are nearly identical for smokers and non-smokers.
- High Blood Pressure vs Death: Individuals with high blood pressure exhibit a higher death rate.
Having examined the relationships among categorical variables, we can explore further insights.
To visualize the distribution of 'time' based on smoking status and gender, we will create a violin plot:
plt.figure(figsize=(8, 6))
a = sns.violinplot(df.smoking, df.time, hue=df.sex1, split=True)
plt.title("Smoking vs Time Segregated by Gender", fontsize=14)
plt.xlabel("Smoking", fontsize=12)
plt.ylabel("Time", fontsize=12)
plt.show()
This plot indicates similar distributions for males and females among non-smokers, while notable differences emerge within the smoking population.
Next, we will analyze the relationship between 'ejection_fraction' and 'time' categorized by death status:
sns.lmplot(x="ejection_fraction", y="time",
hue="death", data=df, scatter_kws=dict(s=40, linewidths=0.7, edgecolors='black'))
plt.xlabel("Ejection Fraction", fontsize=12)
plt.ylabel("Time", fontsize=12)
plt.title("Ejection Fraction vs Time Segregated by Death", fontsize=14)
plt.show()
Although this plot provides limited information, it reveals the regression line and confidence band, indicating data density.
Next, we will analyze how 'time' varies with 'age', distinguishing between genders:
fig = plt.figure(figsize=(20, 8), dpi=80)
g = sns.lmplot(x='age', y='time',
data=df,
robust=True,
palette="Set1", col="sex1",
scatter_kws=dict(s=60, linewidths=0.7, edgecolors="black"))
for ax in g.axes.flat:
ax.set_title(ax.get_title(), fontsize='x-large')
ax.set_ylabel(ax.get_ylabel(), fontsize='x-large')
ax.set_xlabel(ax.get_xlabel(), fontsize='x-large')
Notice how the regression line is steeper for males, indicating a more pronounced decline in 'time' with age.
To summarize relationships between all variables, a heatmap can be useful:
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True, linewidths=0.5, cmap="crest")
plt.show()
This heatmap will assist in the next section regarding death prediction.
Chapter 2: Predicting Death Using Machine Learning
Using the variables from the dataset, we can develop a machine learning model for predicting death outcomes. We will import a decision tree classifier from the Scikit-learn library for this purpose.
Data Preparation
Initially, we need to clean the dataset by removing non-numeric columns created earlier:
df = df.drop(columns=['sex1', 'death'])
Next, we will remove any rows containing null values, ensuring the dataset is clean:
df = df.dropna()
The correlation heatmap from earlier will guide us in selecting relevant features for the model. Notably, the correlation between 'DEATH_EVENT' and variables such as 'anaemia', 'diabetes', 'sex', and 'smoking' is minimal, prompting us to exclude these columns:
df = df.drop(columns=['anaemia', 'diabetes', 'sex', 'smoking',
'creatinine_phosphokinase'])
To standardize the variables, we will scale the continuous variables:
df2 = df
continuous_var = ['age', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']
for i in continuous_var:
df2[i] = df2[i]/max(df2[i])
The output variable for our model will be 'DEATH_EVENT':
y = df2['DEATH_EVENT']
The remaining variables will serve as input features:
X = df2.drop(columns=['DEATH_EVENT'])
Separation of Training and Test Data
Before training the model, we typically reserve part of the dataset as a test set to evaluate performance. We will use the train_test_split function from Scikit-learn:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=22)
Using the Decision Tree Classifier
For our model, we will employ a decision tree classifier:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
clf_tree = DecisionTreeClassifier(random_state=21, max_depth=7, max_leaf_nodes=6).fit(X_train, y_train)
y_pred = clf_tree.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred, pos_label=0))
print("Recall:", metrics.recall_score(y_test, y_pred, pos_label=0))
print("F Score:", metrics.f1_score(y_test, y_pred, pos_label=0))
print("Confusion Matrix:n", metrics.confusion_matrix(y_test, y_pred))
Output:
Accuracy: 0.84
Precision: 0.9215686274509803
Recall: 0.8545454545454545
F Score: 0.8867924528301887
Confusion Matrix:
[[47 8]
[ 4 16]]
The accuracy of our model is 84%, with an F Score of 0.89. This indicates that given information about 'age', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', and 'time', we can predict an individual's death status with a high degree of accuracy.
Conclusion
In this demonstration, we explored various techniques for understanding a dataset and implementing a predictive model. There are many approaches to EDA, and the methods presented here are just one way to conduct such an analysis.
For further engagement, feel free to connect with me on Twitter and like my Facebook page for more insights and discussions.
More Reading:
This video titled "Exploratory Data Analysis With Pandas || Python Machine Learning PT.1" provides an excellent introduction to EDA using Pandas in Python.
In this video "Python Data Science: Automating EDA: Univariate Statistics and Visualizations," viewers can learn about automating exploratory data analysis processes and visualizations in Python.