garyprinting.com

Exploratory Data Analysis and Prediction Modeling in Python

Written on

Chapter 1: Introduction to Exploratory Data Analysis

This article delves into the concept of data storytelling through exploratory data analysis (EDA). When confronted with a large dataset, understanding its implications can be challenging at first glance. It requires thorough effort and analysis to derive valuable insights from the data.

In this article, we will utilize a dataset while leveraging popular Python libraries such as Numpy, Pandas, Matplotlib, and Seaborn to extract meaningful insights. We will conclude by implementing a prediction model using the Scikit-learn library.

As a data scientist or analyst, you might encounter datasets in unfamiliar domains, such as medical records. Although many columns may contain medical terminology, this should not deter you. With the right tools and techniques, exploration is achievable.

This article will address:

  • Approaches to extract meaningful insights from datasets.
  • Utilizing machine learning models for prediction.

Section 1.1: Dataset Overview

The dataset in focus is the "heart failure clinical records" dataset. You can download it from Kaggle to follow along.

We will begin by importing the necessary libraries and loading the dataset into the Jupyter Notebook environment:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

df = pd.read_csv("heart_failure_clinical_records_dataset.csv")

The dataset comprises 299 rows with the following columns:

df.columns

Output:

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',

'ejection_fraction', 'high_blood_pressure', 'platelets',

'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',

'DEATH_EVENT', 'sex1', 'death'],

dtype='object')

Here, the variables 'age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', and 'time' are continuous, while 'anaemia', 'diabetes', 'high_blood_pressure', 'sex', and 'smoking' are categorical. The categorical variables are binary, with values of 0 or 1, indicating conditions such as male/female and presence of high blood pressure.

To initiate our EDA, we will examine the distribution of the continuous variables:

df[['age', 'creatinine_phosphokinase',

'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']].hist(bins=20, figsize=(15, 15))

plt.show()

This histogram provides an overview of the population distribution and the characteristics of the dataset.

To deepen our understanding, we will calculate descriptive statistics such as mean, median, maximum, minimum, standard deviation, and quartiles:

continuous_var = ['age', 'creatinine_phosphokinase',

'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']

df[continuous_var].describe()

Having reviewed the distributions and statistical parameters, we'll analyze how these variables correlate with the 'DEATH_EVENT' column, which we will relabel to 'yes' and 'no' for clarity. Similarly, we'll adjust the 'sex' column to indicate 'male' and 'female':

df['sex1'] = df['sex'].replace({1: "Male", 0: "Female"})

df['death'] = df['DEATH_EVENT'].replace({1: "yes", 0: "no"})

After these modifications, we'll visualize the relationships between the continuous variables and the death events using a pairplot that distinguishes between 'yes' and 'no' based on death events:

sns.pairplot(df[["creatinine_phosphokinase", "ejection_fraction",

"platelets", "serum_creatinine",

"serum_sodium", "time", "death"]], hue="death",

diag_kind='kde', kind='scatter', palette='husl')

plt.show()

In this visualization, red represents death events while green indicates survival. The combination of scatter and density plots effectively illustrates the data distinctions based on death events.

To further clarify our insights, we will use boxplots:

continuous_var = ['age', 'creatinine_phosphokinase',

'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']

plt.figure(figsize=(16, 25))

for i, col in enumerate(continuous_var):

plt.subplot(6, 4, i*2+1)

plt.subplots_adjust(hspace=.25, wspace=.3)

plt.grid(True)

plt.title(col)

sns.kdeplot(df.loc[df["death"]=='no', col], label="alive", color="green", shade=True, kernel='gau', cut=0)

sns.kdeplot(df.loc[df["death"]=='yes', col], label="dead", color="red", shade=True, kernel='gau', cut=0)

plt.subplot(6, 4, i*2+2)

sns.boxplot(y=col, data=df, x="death", palette=["green", "red"])

To support our findings, we will compute the mean and median for each continuous variable based on death events. This quantitative analysis will offer valuable insights into the dataset's characteristics:

y = df.groupby("death")["creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time"].agg([np.mean, np.median])

This analysis reveals significant differences, particularly in the 'time' variable concerning death events.

Next, we will evaluate the impact of high blood pressure on mortality rates across genders:

df.groupby(['sex1', 'high_blood_pressure', 'death']).size().unstack().fillna(0).apply(lambda x: x/x.sum(), axis=1)

The results show variations in proportions, with notable differences observed in high blood pressure conditions rather than gender.

Section 1.2: Analyzing Categorical Variables

Besides the 'death' variable, there are five additional categorical variables worth investigating. We will employ a countplot to visualize their relationships with the 'death' variable:

binary_var = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex1', 'smoking']

plt.figure(figsize=(13, 9))

for i, var in enumerate(binary_var):

plt.subplot(2, 3, i+1)

plt.title(var, fontsize=14)

plt.xlabel(var, fontsize=12)

plt.ylabel("Count", fontsize=12)

plt.subplots_adjust(hspace=0.4, wspace=0.3)

sns.countplot(data=df, x=var, hue="death", palette=['gray', "coral"])

The plot above clearly illustrates discrepancies in death events across different variables, such as gender and health conditions. However, it also highlights an imbalance in the dataset regarding smoking habits and diabetes prevalence.

To gain a clearer understanding, we can utilize crosstab analysis for the 'sex1' variable:

x = pd.crosstab(df["sex1"], df['death'])

Following this, we will calculate proportions for better insights:

x.apply(lambda z: z/z.sum(), axis=1)

This reveals that death rates among males and females are approximately equal at around 32%. Similar analyses for the other categorical variables yield the following observations:

  • Anemia vs Death: The death rate is higher among those with anemia.
  • Diabetes vs Death: No difference in death rates between those with and without diabetes.
  • Smoking vs Death: Death rates are nearly identical for smokers and non-smokers.
  • High Blood Pressure vs Death: Individuals with high blood pressure exhibit a higher death rate.

Having examined the relationships among categorical variables, we can explore further insights.

To visualize the distribution of 'time' based on smoking status and gender, we will create a violin plot:

plt.figure(figsize=(8, 6))

a = sns.violinplot(df.smoking, df.time, hue=df.sex1, split=True)

plt.title("Smoking vs Time Segregated by Gender", fontsize=14)

plt.xlabel("Smoking", fontsize=12)

plt.ylabel("Time", fontsize=12)

plt.show()

This plot indicates similar distributions for males and females among non-smokers, while notable differences emerge within the smoking population.

Next, we will analyze the relationship between 'ejection_fraction' and 'time' categorized by death status:

sns.lmplot(x="ejection_fraction", y="time",

hue="death", data=df, scatter_kws=dict(s=40, linewidths=0.7, edgecolors='black'))

plt.xlabel("Ejection Fraction", fontsize=12)

plt.ylabel("Time", fontsize=12)

plt.title("Ejection Fraction vs Time Segregated by Death", fontsize=14)

plt.show()

Although this plot provides limited information, it reveals the regression line and confidence band, indicating data density.

Next, we will analyze how 'time' varies with 'age', distinguishing between genders:

fig = plt.figure(figsize=(20, 8), dpi=80)

g = sns.lmplot(x='age', y='time',

data=df,

robust=True,

palette="Set1", col="sex1",

scatter_kws=dict(s=60, linewidths=0.7, edgecolors="black"))

for ax in g.axes.flat:

ax.set_title(ax.get_title(), fontsize='x-large')

ax.set_ylabel(ax.get_ylabel(), fontsize='x-large')

ax.set_xlabel(ax.get_xlabel(), fontsize='x-large')

Notice how the regression line is steeper for males, indicating a more pronounced decline in 'time' with age.

To summarize relationships between all variables, a heatmap can be useful:

plt.figure(figsize=(10, 10))

sns.heatmap(df.corr(), annot=True, linewidths=0.5, cmap="crest")

plt.show()

This heatmap will assist in the next section regarding death prediction.

Chapter 2: Predicting Death Using Machine Learning

Using the variables from the dataset, we can develop a machine learning model for predicting death outcomes. We will import a decision tree classifier from the Scikit-learn library for this purpose.

Data Preparation

Initially, we need to clean the dataset by removing non-numeric columns created earlier:

df = df.drop(columns=['sex1', 'death'])

Next, we will remove any rows containing null values, ensuring the dataset is clean:

df = df.dropna()

The correlation heatmap from earlier will guide us in selecting relevant features for the model. Notably, the correlation between 'DEATH_EVENT' and variables such as 'anaemia', 'diabetes', 'sex', and 'smoking' is minimal, prompting us to exclude these columns:

df = df.drop(columns=['anaemia', 'diabetes', 'sex', 'smoking',

'creatinine_phosphokinase'])

To standardize the variables, we will scale the continuous variables:

df2 = df

continuous_var = ['age', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']

for i in continuous_var:

df2[i] = df2[i]/max(df2[i])

The output variable for our model will be 'DEATH_EVENT':

y = df2['DEATH_EVENT']

The remaining variables will serve as input features:

X = df2.drop(columns=['DEATH_EVENT'])

Separation of Training and Test Data

Before training the model, we typically reserve part of the dataset as a test set to evaluate performance. We will use the train_test_split function from Scikit-learn:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=22)

Using the Decision Tree Classifier

For our model, we will employ a decision tree classifier:

from sklearn.tree import DecisionTreeClassifier

from sklearn import metrics

clf_tree = DecisionTreeClassifier(random_state=21, max_depth=7, max_leaf_nodes=6).fit(X_train, y_train)

y_pred = clf_tree.predict(X_test)

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

print("Precision:", metrics.precision_score(y_test, y_pred, pos_label=0))

print("Recall:", metrics.recall_score(y_test, y_pred, pos_label=0))

print("F Score:", metrics.f1_score(y_test, y_pred, pos_label=0))

print("Confusion Matrix:n", metrics.confusion_matrix(y_test, y_pred))

Output:

Accuracy: 0.84

Precision: 0.9215686274509803

Recall: 0.8545454545454545

F Score: 0.8867924528301887

Confusion Matrix:

[[47 8]

[ 4 16]]

The accuracy of our model is 84%, with an F Score of 0.89. This indicates that given information about 'age', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', and 'time', we can predict an individual's death status with a high degree of accuracy.

Conclusion

In this demonstration, we explored various techniques for understanding a dataset and implementing a predictive model. There are many approaches to EDA, and the methods presented here are just one way to conduct such an analysis.

For further engagement, feel free to connect with me on Twitter and like my Facebook page for more insights and discussions.

More Reading:

This video titled "Exploratory Data Analysis With Pandas || Python Machine Learning PT.1" provides an excellent introduction to EDA using Pandas in Python.

In this video "Python Data Science: Automating EDA: Univariate Statistics and Visualizations," viewers can learn about automating exploratory data analysis processes and visualizations in Python.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Embrace the Courage to Be Disliked: A Guide to Authentic Living

Discover how to cultivate courage and authenticity in a society that often discourages uniqueness.

The Role of AI in Transforming Education for the Future

Exploring how AI is revolutionizing education with personalized learning and improved teaching methods.

Signs the Universe is Guiding You Towards a New Beginning

Discover the eight unsettling signs indicating your dark cycles are ending and a brighter phase is on the horizon.

Maximizing Insights with Jupyter Notebooks and ChatGPT in Data Analytics

Explore how Jupyter Notebooks and ChatGPT enhance data analytics tools for better insights and decision-making.

Ignite Your Creativity: Transforming Ideas into Written Words

Discover how to overcome writing hurdles and turn your ideas into captivating stories with practical tips and insights.

Embrace Your Unique Leadership Style for Lasting Success

Discover why embracing your individual leadership style is crucial for authentic success instead of merely imitating others.

Monthly Inspiration: February 2023

Discover insights from

Surviving the Future: Embracing a Learning Mindset

Explore the Trio-Learning-Mindset to thrive in a rapidly evolving world and prepare for the future.