Predicting Medical Insurance Costs Using Machine Learning
Written on
Understanding Health Insurance Policies
Health insurance policies are designed to cover or reduce the costs associated with healthcare services. Several factors can impact the price of these policies, including:
- Age - Premiums can be significantly higher for older individuals, potentially up to three times more than for younger people.
- Location - The area where one resides affects premiums due to variations in competition, regulations, and living costs.
- Tobacco Use - Insurance companies may charge tobacco users up to 50% more than non-tobacco users.
- Individual vs. Family Plans - Policies covering multiple individuals, such as families, typically have higher premiums.
- Plan Category - There are several categories (Bronze, Silver, Gold, Platinum, and Catastrophic) that determine how costs are shared between the insurer and policyholder. Bronze plans usually entail lower monthly premiums with higher out-of-pocket expenses, while Platinum plans have the opposite.
According to the Kaiser Family Foundation, the average annual health insurance cost in the U.S. in 2020 was $7,470 for individuals and $21,342 for families.
Health Insurance Claims Overview
A health insurance claim refers to a request for payment or reimbursement for medical services received by an insured individual. Claims are submitted by the insured or their healthcare provider to their insurer for benefits or payments.
A study by Change Healthcare revealed that hospitals in the U.S. submitted $3 trillion in medical claims in 2016, with $262 billion of these claims being denied initially.
Predicting Healthcare Costs
Creating models that accurately predict individual healthcare expenses can provide significant advantages to insurance companies, healthcare providers, and insured individuals. Reliable cost forecasts can help insurers and providers plan for future needs and allocate resources efficiently. Furthermore, insured individuals can better understand their potential future expenses, aiding them in selecting appropriate insurance plans with suitable deductibles and premiums.
The aim of this article is to accurately forecast insurance costs based on the characteristics of policyholders and identify the key factors influencing these costs. These insights can assist insurers in adjusting annual premiums based on expected treatment costs.
Roadmap for Implementation
To achieve this, we will follow these steps using Python and machine learning techniques:
- Import necessary software libraries.
- Load and import the dataset.
- Conduct data analysis and exploration.
- Select relevant features.
- Divide the data into training and test sets.
- Normalize the data.
- Train the model using the training data.
- Generate predictions on the test data.
- Evaluate the model's performance.
- Draw conclusions from the evaluations.
The Program
Objective: Predict medical insurance costs based on policyholder characteristics.
Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
Loading and Importing the Dataset
# Load data on Google Colab
from google.colab import files
uploaded = files.upload()
# Read the dataset into a DataFrame
insurance = pd.read_csv('insurance.csv')
Data Analysis and Exploration
# Display the first five rows of the dataset
insurance.head()
This dataset, sourced from Kaggle, includes records based on Miri Choi's Medical Cost Personal Datasets. The columns are described as follows:
- age: Age of the policyholder.
- sex: Gender of the policyholder (female, male).
- BMI: Body Mass Index, which indicates body weight relative to height, ideally between 18.5 and 25.
- children: Number of dependents covered by the policy.
- smoker: Smoking status of the policyholder (non-smoker = no, smoker = yes).
- region: Geographic area of the policyholder in the U.S. (northeast, northwest, southeast, southwest).
- charges: Total medical costs billed by health insurance (in USD).
# Show data columns and types
insurance.info()
Summary of the Data
- Total records: 1,338
- Variables: 7
- No missing values present.
- Data types include both categorical and numerical.
Statistical Overview
insurance.describe()
Visualizing Relationships
To better understand the data, we visualize the relationships between age, gender, BMI, number of children, smoking status, and region against claim charges.
# Example visualization for age vs claim charges
fig = plt.figure(figsize=(10, 6))
sns.barplot(y='charges', x='age', data=insurance, ci=None)
plt.xlabel("Age", size=12)
plt.ylabel("Claim Charges ($)", size=12)
Key Observations
- Claim charges typically increase with age.
- Males tend to have slightly higher claim charges.
- Higher BMI correlates with increased claim costs, except for the 40–45 BMI group, where costs dip slightly.
- Claims are notably higher when policies cover two or three children.
- Smokers face significantly higher claim charges, while regional variations are minimal, with the Northwest showing the highest charges.
Data Normalization
# Normalize charges to a range between 0 and 1
column = 'charges'
insurance[column] = (insurance[column] - insurance[column].min()) / (insurance[column].max() - insurance[column].min())
Feature Selection
Categorical data needs to be converted into numerical format using one-hot encoding.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
insurance["sex"] = le.fit_transform(insurance["sex"])
insurance["smoker"] = le.fit_transform(insurance["smoker"])
insurance["region"] = le.fit_transform(insurance["region"])
We will generate a correlation matrix to evaluate the relationships between features and the target variable (charges).
plt.figure(figsize=(8,6))
sns.heatmap(insurance.corr(), annot=True, fmt='.0%', cmap='Blues')
Data Splitting
We will separate the dataset into features (X) and target variable (Y), then split it into training and testing sets.
x_data = insurance.drop('charges', axis=1)
y_data = insurance['charges']
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x_data, y_data, test_size=0.25, random_state=42)
Model Training and Evaluation
We will train three different machine learning models: Linear Regression, Random Forest, and Support Vector Machines, evaluating their performance using R² score and RMSE.
Linear Regression Model
# Linear regression model training
model_lr = LinearRegression()
model_lr.fit(x_training_data, y_training_data)
predictions_lr = model_lr.predict(x_test_data)
Calculate R² Score and RMSE
r2 = r2_score(y_test_data, predictions_lr)
rmse = np.sqrt(np.mean((predictions_lr - y_test_data) ** 2))
Random Forest Model
model_rf = RandomForestRegressor(n_estimators=10, random_state=0)
model_rf.fit(x_training_data, y_training_data)
predictions_rf = model_rf.predict(x_test_data)
Support Vector Machine Model
model_sv = SVR()
model_sv.fit(x_training_data, y_training_data)
predictions_sv = model_sv.predict(x_test_data)
Display Results
models = pd.DataFrame({
'Model': ['Linear Regression', 'Random Forest', 'Support Vector Machine'],
'Accuracy Score': [r2_lr, r2_rf, r2_sv]
})
sns.barplot(x='Accuracy Score', y='Model', data=models)
Conclusion
All models effectively predicted insurance claim charges based on policyholder characteristics, with the Random Forest model achieving the highest accuracy at 82%. Future work could focus on expanding the dataset and refining model hyperparameters for improved predictions.
Thanks for engaging with this article! Feel free to share any thoughts or feedback below.
This video titled "Project 11: Medical Insurance Cost Prediction using Machine Learning with Python" provides a detailed overview of the process and techniques for predicting medical insurance costs using machine learning.
In this second video titled "Medical Insurance Cost Prediction Using Linear Regression | Machine Learning Project 7," the linear regression model is explored in the context of predicting medical insurance expenses.