import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
Heart Disease Prediction Random Forest Classifer
Project Description
In this project we explore a comprehensive dataset on Cardiovascular disease. We’ll apply a Random Forest classifier to develop a system to predict heart attacks in an effective manner.
Importing Libraries
Import Data
= pd.read_csv("datasets/cep1_dataset.csv")
cardio_data cardio_data.shape
(303, 14)
Preliminary Data Analysis
cardio_data.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
# Statistical summary
= cardio_data.describe()
summary summary
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 |
mean | 54.366337 | 0.683168 | 0.966997 | 131.623762 | 246.264026 | 0.148515 | 0.528053 | 149.646865 | 0.326733 | 1.039604 | 1.399340 | 0.729373 | 2.313531 | 0.544554 |
std | 9.082101 | 0.466011 | 1.032052 | 17.538143 | 51.830751 | 0.356198 | 0.525860 | 22.905161 | 0.469794 | 1.161075 | 0.616226 | 1.022606 | 0.612277 | 0.498835 |
min | 29.000000 | 0.000000 | 0.000000 | 94.000000 | 126.000000 | 0.000000 | 0.000000 | 71.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 47.500000 | 0.000000 | 0.000000 | 120.000000 | 211.000000 | 0.000000 | 0.000000 | 133.500000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 |
50% | 55.000000 | 1.000000 | 1.000000 | 130.000000 | 240.000000 | 0.000000 | 1.000000 | 153.000000 | 0.000000 | 0.800000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 |
75% | 61.000000 | 1.000000 | 2.000000 | 140.000000 | 274.500000 | 0.000000 | 1.000000 | 166.000000 | 1.000000 | 1.600000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 |
max | 77.000000 | 1.000000 | 3.000000 | 200.000000 | 564.000000 | 1.000000 | 2.000000 | 202.000000 | 1.000000 | 6.200000 | 2.000000 | 4.000000 | 3.000000 | 1.000000 |
cardio_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
cardio_data.target.value_counts()
target
1 165
0 138
Name: count, dtype: int64
Now, let us look at whether the dataset has null values or not.
sum() cardio_data.isna().
age 0
sex 0
cp 0
trestbps 0
chol 0
fbs 0
restecg 0
thalach 0
exang 0
oldpeak 0
slope 0
ca 0
thal 0
target 0
dtype: int64
From this output, our data does not contain null values and duplicates. So, the data is good which will be further analyzed.
Heart Disease Distribution Report
# Count the occurrences of "disease" and "no disease"
= cardio_data['target'].value_counts()
disease_counts
# Length of disease and no disease
= len(cardio_data[cardio_data['target'] == 1])
disease = len(cardio_data[cardio_data['target'] == 0])
no_disease = 'Heart Disease', 'No Disease'
labels
# Create a pie chart
=(8, 5))
plt.figure(figsize=labels, autopct='%1.1f%%', startangle=90)
plt.pie(disease_counts, labels'Distribution of Heart Disease vs No Disease')
plt.title('equal') # Equal aspect ratio ensures a circular pie
plt.axis(
plt.show()
# Print the length of disease and no disease
print(f"Length of 'disease': {disease}")
print(f"Length of 'no disease': {no_disease}")
Length of 'disease': 165
Length of 'no disease': 138
Let’s identify the categorical variables and explore them using appropriate tools like count plots.
Exploring Categorical Variables
# Identify categorical variables
= ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
categorical_vars
for var in categorical_vars:
=(8, 4))
plt.figure(figsize=var, data=cardio_data)
sns.countplot(xf'Count Plot of {var}')
plt.title( plt.show()
Factors Determining the Occurrence of CVD
We’ll list how other factors in the dataset determine the occurrence of cardiovascular diseases.
=(15,10))
plt.figure(figsize=.01,annot=True,cmap="summer")
sns.heatmap(cardio_data.corr(),linewidth'Correlation Heatmap')
plt.title(
plt.show()
'correlationfigure') plt.savefig(
<Figure size 672x480 with 0 Axes>
From the above heatmap, we can understand that Chest pain(cp)
and target have a positive correlation. It means that whose has a large risk of chest pain results in a greater chance to have heart disease. In addition to chest pain
, thalach
, slope
, and resting
have a positive correlation with the target.
Then, exercise-induced angina(exan) and the target have a negative correlation which means when we exercise, the heart requires more blood, but narrowed arteries slow down the blood flow. In addition to ca, old peak, thal have a negative correlation with the targetet.
Let us see the relation between each features distribution with help of histogram
=(12,12))
cardio_data.hist(figsize'featuresplot') plt.savefig(
Occurrence of CVD Across Age Categories
Let’s study the occurrence of cardiovascular diseases across different age categories with a histogram distribution.
print(f"The youngest patient is {min(cardio_data.age)}")
print(f"The oldest patient is {max(cardio_data.age)}")
print(f"The mean age is {cardio_data.age.mean()}")
print("\n")
# Set the style for Seaborn
set(style='whitegrid')
sns.
# Create the histogram plot using Seaborn
=(8, 4))
plt.figure(figsize=cardio_data, x='age', bins=20, kde=False)
sns.histplot(data'Occurrence of CVD across Age Categories')
plt.title('Age')
plt.xlabel('Frequency')
plt.ylabel( plt.show()
The youngest patient is 29
The oldest patient is 77
The mean age is 54.366336633663366
Composition of Patients by Sex
We’ll study the composition of all patients with respect to the sex category.
# Set the style for Seaborn
set(style='whitegrid')
sns.
# Create the count plot for gender distribution according to the target variable
= plt.subplots(figsize=(8, 5))
fig, ax = cardio_data['sex']
gender = sns.countplot(x='sex', hue='target', data=cardio_data, palette='muted')
ax
# Set the title and x-axis labels
"Gender Distribution according to Target", fontsize=13, weight='bold')
ax.set_title('Gender')
ax.set_xlabel('Count')
ax.set_ylabel('Female', 'Male']) # Explicitly set x-axis labels
ax.set_xticklabels([
# Add percentage annotations
= []
totals for i in ax.patches:
totals.append(i.get_height())= sum(totals)
total for i in ax.patches:
+ 0.05, i.get_height() - 15,
ax.text(i.get_x() f"{round((i.get_height() / total) * 100, 2)}%", fontsize=14,
='white', weight='bold')
color
plt.tight_layout() plt.show()
/tmp/ipykernel_73277/1354869852.py:13: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
ax.set_xticklabels(['Female', 'Male']) # Explicitly set x-axis labels
From the bar graph, we can observe that among disease patients, males are higher than females.
Detecting Heart Attacks based on Resting Blood Pressure
We’ll study if heart attacks can be detected based on anomalies in resting blood pressure (trestbps).
=(15, 5))
plt.figure(figsize
# plotting subplot #1 - Box plot of resting blood pressure by target variable
1,2,1)
plt.subplot(='target', y='trestbps', data=cardio_data)
sns.boxplot(x'Resting Blood Pressure (trestbps) vs. Target Variable')
plt.title(
# plotting subplot #2 - Scatter plot of resting blood pressure by target variable
1,2,2)
plt.subplot(=cardio_data.age, x=cardio_data.trestbps, hue= cardio_data.target)
sns.scatterplot(y
plt.show()
We observe that people who develop CVD have lower resting bp than people who don’t develop CVD
Relationship between Cholesterol Levels and Target Variable
Let’s describe the relationship between cholesterol levels (chol) and the target variable.
=(15, 5))
plt.figure(figsize
# plotting subplot #1 - Box plot of cholesterol levels by target variable
1,2,1)
plt.subplot(='target', y='chol', data=cardio_data)
sns.boxplot(x'Cholesterol Levels (chol) vs. Target Variable')
plt.title(
# plotting subplot #2 - Scatter plot of cholesterol levels by target variable
1,2,2)
plt.subplot(=cardio_data.chol, y=cardio_data.target, hue=cardio_data.target)
sns.scatterplot(x='center left')
plt.legend(loc
plt.show()
We can infer that people who develop CVD have much higher level of cholestrol and concentrated within the range between 200 - 300.
Relationship between Peak Exercising and Heart Attack Occurrence
We’ll explore the relationship between peak exercising (oldpeak) and the occurrence of a heart attack.
=(15, 5))
plt.figure(figsize
# plotting subplot #1 - Box plot of peak exercising (oldpeak) by target variable
1,2,1)
plt.subplot(='target', y='oldpeak', data=cardio_data)
sns.boxplot(x'Peak Exercising (oldpeak) vs. Target Variable')
plt.title(
# plotting subplot #2 - Bar plot of peak exercising (oldpeak) by target variable
1,2,2)
plt.subplot(=cardio_data.oldpeak, x=cardio_data.slope, hue=cardio_data.target)
sns.barplot(y
plt.show()
We can infer that oldpeak is lower in people who develop CVD.
Checking if Thalassemia is a Major Cause of CVD
Let’s check if thalassemia is a major cause of cardiovascular diseases.
# Count plot of thalassemia by target variable
=(15, 5))
plt.figure(figsize='thal', hue='target', data=cardio_data)
sns.countplot(x'Thalassemia vs. Target Variable')
plt.title( plt.show()
Thalassemia can be a major cause for CVD as it is more associated in 2.
Pair Plot to Understand Variable Relationships
Lastly, we’ll use a pair plot to understand the relationship between all the given variables.
# Pair plot
='target')
sns.pairplot(cardio_data, hue'Pair Plot')
plt.title( plt.show()
Automated EDA using ydata-profiling report
In the provided code, we are using the ydata_profiling
library to create a profiling report for the “cardio_data” dataset. The ProfileReport
function takes the “cardio_data” dataset as input and generates a comprehensive profiling report. This report will include valuable insights such as data statistics, missing values, data types, unique values, correlation, and visualizations for better data understanding. The title of the report is set as “Heart Disease Profiling Report.” This analysis will help users to gain a deeper understanding of the dataset, identify potential issues, and make informed decisions during the data exploration and preprocessing stages.
from ydata_profiling import ProfileReport
= ProfileReport(cardio_data, title="Heart Disease Profiling Report") profile_cardio_data
Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.
profile_cardio_data.to_notebook_iframe()
0%| | 0/14 [00:00<?, ?it/s]100%|██████████| 14/14 [00:00<00:00, 182929.15it/s]
# Exporting the report to HTML.
"Heart Disease Profiling Report.html") profile_cardio_data.to_file(
Model Building and Implementation
Train-Test Split
We split the whole dataset into trainset and testset which contains 75% train
and 25% test
.
We can include this train set into classifiers to train our model and the test set is useful for predicting the performance of the model by different classifiers.
# Modeling
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint
# Tree and Data Visualisation
import graphviz
from sklearn.tree import export_graphviz
# Features
= cardio_data.drop('target', axis=1)
X # Target Variable
= cardio_data['target']
y
print('shape of X and y respectively :', X.shape, y.shape)
shape of X and y respectively : (303, 13) (303,)
= train_test_split(X,y,test_size=0.25,random_state=40)
X_train, X_test, y_train, y_test
print(f'X_train: {X_train.shape}')
print(f'X_test: {X_test.shape}')
print(f'y_train: {y_train.shape}')
print(f'y_test: {y_test.shape}')
X_train: (227, 13)
X_test: (76, 13)
y_train: (227,)
y_test: (76,)
Logistic Regression with Statsmodel
import statsmodels.api as sm
= sm.Logit(y_train, X_train).fit() logreg_stats
Optimization terminated successfully.
Current function value: 0.371141
Iterations 7
print(logreg_stats.summary())
Logit Regression Results
==============================================================================
Dep. Variable: target No. Observations: 227
Model: Logit Df Residuals: 214
Method: MLE Df Model: 12
Date: Thu, 24 Apr 2025 Pseudo R-squ.: 0.4618
Time: 07:01:40 Log-Likelihood: -84.249
converged: True LL-Null: -156.55
Covariance Type: nonrobust LLR p-value: 7.046e-25
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
age 0.0228 0.023 0.997 0.319 -0.022 0.068
sex -1.6290 0.519 -3.138 0.002 -2.647 -0.612
cp 0.9382 0.220 4.265 0.000 0.507 1.369
trestbps -0.0252 0.012 -2.170 0.030 -0.048 -0.002
chol -0.0032 0.004 -0.755 0.450 -0.011 0.005
fbs -0.2011 0.608 -0.331 0.741 -1.392 0.990
restecg 0.8217 0.390 2.109 0.035 0.058 1.585
thalach 0.0346 0.010 3.498 0.000 0.015 0.054
exang -0.6767 0.479 -1.414 0.157 -1.615 0.261
oldpeak -0.6029 0.250 -2.411 0.016 -1.093 -0.113
slope 0.4173 0.397 1.051 0.293 -0.361 1.195
ca -0.6518 0.214 -3.044 0.002 -1.072 -0.232
thal -0.7534 0.317 -2.378 0.017 -1.374 -0.132
==============================================================================
= logreg_stats.predict(X_test)
logreg_prediction = list(map(round, logreg_prediction))
predictions print(predictions)
[0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
= accuracy_score(y_test, predictions)
accuracy = recall_score(y_test, predictions)
recall = precision_score(y_test, predictions)
precision = f1_score(y_test, predictions)
f1
# Displaying the scores with their corresponding percentages using f-strings
print(f'Accuracy score: {accuracy} -> {accuracy * 100:.2f}%')
print(f'Recall score: {recall} -> {recall * 100:.2f}%')
print(f'Precision score: {precision} -> {precision * 100:.2f}%')
print(f'f1 score: {f1} -> {f1 * 100:.2f}%')
= confusion_matrix(y_test,predictions)
logreg_stats_cm =True) sns.heatmap(logreg_stats_cm,annot
Accuracy score: 0.881578947368421 -> 88.16%
Recall score: 0.9285714285714286 -> 92.86%
Precision score: 0.8666666666666667 -> 86.67%
f1 score: 0.896551724137931 -> 89.66%
Logistic Regression with scikit-learn
from sklearn.linear_model import LogisticRegression
# initialize the model
= LogisticRegression(max_iter=1000)
logreg_model # fit the model
logreg_model.fit(X_train, y_train)# prediction
= logreg_model.predict(X_test) Y_prediction
= logreg_model.score(X_train, y_train)
score print(f'Training Score: {score} -> {score * 100:.2f}%')
= logreg_model.score(X_test, y_test)
score print(f'Testing Score: {score} -> {score * 100:.2f}%')
print("\n")
# Heart-Disease yes or no? 1/0
= pd.DataFrame({'Predicted':Y_prediction})
output print(output.head())
= output.loc[output.Predicted == 1]["Predicted"]
people
= 0
rate_people if len(people) > 0 :
= len(people)/len(output)
rate_people
print(f'% of people predicted with heart-disease: {rate_people} -> {rate_people * 100:.2f}%')
= score
score_logreg = output out_logreg
Training Score: 0.8414096916299559 -> 84.14%
Testing Score: 0.881578947368421 -> 88.16%
Predicted
0 0
1 1
2 1
3 0
4 1
% of people predicted with heart-disease: 0.5921052631578947 -> 59.21%
Now let’s evaluate a logistic regression model’s performance on the test dataset using a classification report, which includes precision, recall, F1-score, and support metrics for each class. Additionally, we will generate a heatmap of the confusion matrix to visualize the true positive, false positive, true negative, and false negative predictions. This assessment allows a quick and comprehensive understanding of how well the model performs in predicting the target variable for different classes, aiding in the evaluation and interpretation of its predictive capabilities.
from sklearn.metrics import confusion_matrix
print(classification_report(y_test,Y_prediction))
= confusion_matrix(y_test,Y_prediction)
logreg_confusion_matrix = [0,1]
class_names = plt.subplots()
fig,ax = np.arange(len(class_names))
tick_marks
plt.xticks(tick_marks,class_names)
plt.yticks(tick_marks,class_names)= True, cmap = 'Greens', fmt = 'g')
sns.heatmap(pd.DataFrame(logreg_confusion_matrix), annot 'top')
ax.xaxis.set_label_position(
plt.tight_layout()'Confusion matrix for logistic regression')
plt.title('Actual label')
plt.ylabel('Predicted label')
plt.xlabel( plt.show()
precision recall f1-score support
0 0.90 0.82 0.86 34
1 0.87 0.93 0.90 42
accuracy 0.88 76
macro avg 0.88 0.88 0.88 76
weighted avg 0.88 0.88 0.88 76
The Receiver Operating Characteristic (ROC) curve calculates the probabilities of the positive class (heart disease present) using predict_proba
, and then computes the false positive rate (FPR), true positive rate (TPR), and corresponding thresholds using roc_curve
from sklearn.metrics
. The resulting curve is plotted using matplotlib, with the green line representing the ROC curve and a dashed line indicating the baseline (random guessing) performance. The plot also includes vertical and horizontal lines at 0 and 1, respectively, to mark the extreme points. The area under the ROC curve (AUC) is displayed on the plot, providing a single-value summary of the model’s discrimination performance.
A higher AUC indicates better model performance in distinguishing between positive and negative cases. This visualization helps to assess the model’s overall performance and choose an appropriate probability threshold for classification based on the trade-off between false positives and true positives.
# ROC Curve
= logreg_model.predict_proba(X_test)[:,1]
y_probabilities = roc_curve(y_test,y_probabilities)
false_positive_rate_knn, true_positive_rate_knn, threshold_knn
=(10,6))
plt.figure(figsize'ROC for logistic regression')
plt.title(=5, color='green')
plt.plot(false_positive_rate_knn, true_positive_rate_knn, linewidth0,1],ls='--',linewidth=5)
plt.plot([0,0],[1,0],c='.5')
plt.plot([1,1],c='.5')
plt.plot([0.2,0.6,'AUC: {:.2f}'.format(roc_auc_score(y_test,y_probabilities)),size= 16)
plt.text('False Positive Rate')
plt.xlabel('True Positive Rate')
plt.ylabel( plt.show()
Random Forest Classifier
Let’s first fit and the evaluate a base model
from sklearn.ensemble import RandomForestClassifier
# initialize the model
= RandomForestClassifier(n_estimators=100) # , max_depth=5, random_state=1
rfc_model # fit the model
rfc_model.fit(X_train, y_train)# prediction
= rfc_model.predict(X_test) Y_prediction
= rfc_model.score(X_train, y_train)
score print(f'Training Score: {score} -> {score * 100:.2f}%')
= rfc_model.score(X_test, y_test)
score print(f'Testing Score: {score} -> {score * 100:.2f}%')
= accuracy_score(y_test, Y_prediction)
accuracy print(f'Accuracy score: {accuracy} -> {accuracy * 100:.2f}%')
print("\n")
# Heart-Disease yes or no? 1/0
= pd.DataFrame({'Predicted':Y_prediction})
output print(output.head())
= output.loc[output.Predicted == 1]["Predicted"]
people
= 0
rate_people if len(people) > 0 :
= len(people)/len(output)
rate_people
print(f'% of people predicted with heart-disease: {rate_people} -> {rate_people * 100:.2f}%')
= score
score_rfc = output out_rfc
Training Score: 1.0 -> 100.00%
Testing Score: 0.8157894736842105 -> 81.58%
Accuracy score: 0.8157894736842105 -> 81.58%
Predicted
0 1
1 0
2 1
3 0
4 1
% of people predicted with heart-disease: 0.5526315789473685 -> 55.26%
Visualizing the Results
Let’s display the first three decision trees from the random forest classifier (rfc_model). The loop iterates three times, and for each iteration, it retrieves one decision tree from the random forest using rfc_model.estimators_[i]
. It then uses export_graphviz
from sklearn.tree
to generate a Graphviz dot data representation of the decision tree. The visualization options include displaying feature names, filling tree nodes with colors based on class proportions, and limiting the tree depth to 2 levels (max_depth=2
). The graphviz.Source
function is used to create a graph from the dot data, and the display
function shows the decision tree as an image in the notebook for visual inspection.
This export and display process allows for a closer examination of individual trees within the random forest ensemble, helping to understand the different decision paths and feature importance in the model’s decision-making process.
# Export the first three decision trees from the forest
for i in range(3):
= rfc_model.estimators_[i]
tree = export_graphviz(tree,
dot_data =X_train.columns,
feature_names=True,
filled=2,
max_depth=False,
impurity=True)
proportion= graphviz.Source(dot_data)
graph display(graph)
Hyperparameter Tuning
The code below uses Scikit-Learn’s RandomizedSearchCV, which will randomly search parameters within a range per hyperparameter. We define the hyperparameters to use and their ranges in the param_dist dictionary. In our case, we are using:
- n_estimators: the number of decision trees in the forest. ncreasing this hyperparameter generally improves the performance of the model but also increases the computational cost of training and predicting.
- max_depth: the maximum depth of each decision tree in the forest. Setting a higher value for max_depth can lead to overfitting while setting it too low can lead to underfitting.
= {'n_estimators': randint(50,500),
param_dist 'max_depth': randint(1,20)}
# Use random search to find the best hyperparameters
= RandomizedSearchCV(rfc_model,
rand_search = param_dist,
param_distributions =5,
n_iter=5)
cv
# Fit the random search object to the data
rand_search.fit(X_train, y_train)
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=5, param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7fc009652960>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7fc00460e1e0>})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=5, param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7fc009652960>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7fc00460e1e0>})
RandomForestClassifier(max_depth=5, n_estimators=469)
RandomForestClassifier(max_depth=5, n_estimators=469)
# Create a variable for the best model
= rand_search.best_estimator_
best_rf
# Print the best hyperparameters
print('Best hyperparameters:', rand_search.best_params_)
Best hyperparameters: {'max_depth': 5, 'n_estimators': 469}
print(classification_report(y_test,Y_prediction))
# Generate predictions with the best model
= best_rf.predict(X_test)
y_pred
# Create the confusion matrix
= confusion_matrix(y_test, y_pred)
rfc_confusion_matrix
= [0,1]
class_names = plt.subplots()
fig,ax = np.arange(len(class_names))
tick_marks
plt.xticks(tick_marks,class_names)
plt.yticks(tick_marks,class_names)= True, cmap = 'Greens', fmt = 'g')
sns.heatmap(pd.DataFrame(rfc_confusion_matrix), annot 'top')
ax.xaxis.set_label_position(
plt.tight_layout()'Confusion matrix for random forest')
plt.title('Actual label')
plt.ylabel('Best Predicted label')
plt.xlabel( plt.show()
precision recall f1-score support
0 0.79 0.79 0.79 34
1 0.83 0.83 0.83 42
accuracy 0.82 76
macro avg 0.81 0.81 0.81 76
weighted avg 0.82 0.82 0.82 76
We should also evaluate the best model with accuracy, precision, and recall (note your results may differ due to randomization)
= best_rf.score(X_train, y_train)
score print(f'Training Score: {score} -> {score * 100:.2f}%')
= best_rf.score(X_test, y_test)
score print(f'Testing Score: {score} -> {score * 100:.2f}%')
= accuracy_score(y_test, y_pred)
accuracy = precision_score(y_test, y_pred)
precision = recall_score(y_test, y_pred)
recall
# Displaying the scores with their corresponding percentages using f-strings
print(f'Accuracy score: {accuracy} -> {accuracy * 100:.2f}%')
print(f'Recall score: {recall} -> {recall * 100:.2f}%')
print(f'Precision score: {precision} -> {precision * 100:.2f}%')
= score
score_best_rfc = output out_best_rfc
Training Score: 0.9515418502202643 -> 95.15%
Testing Score: 0.8552631578947368 -> 85.53%
Accuracy score: 0.8552631578947368 -> 85.53%
Recall score: 0.8809523809523809 -> 88.10%
Precision score: 0.8604651162790697 -> 86.05%
The below code plots the importance of each feature, using the model’s internal score to find the best way to split the data within each decision tree.
# Create a series containing feature importances from the model and feature names from the training data
= pd.Series(best_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_importances
# Plot a simple bar chart
; feature_importances.plot.bar()
Let’s also evalaute and display the ROC curve for the best Random Forest model
= best_rf.predict_proba(X_test)[:,1]
y_probabilities = roc_curve(y_test,y_probabilities)
false_positive_rate, true_positive_rate, threshold_knn =(10,6))
plt.figure(figsize'ROC for best random forest')
plt.title(=5, color='green')
plt.plot(false_positive_rate, true_positive_rate, linewidth0,1],ls='--',linewidth=5)
plt.plot([0,0],[1,0],c='.5')
plt.plot([1,1],c='.5')
plt.plot([0.2,0.6,'AUC: {:.2f}'.format(roc_auc_score(y_test,y_probabilities)),size= 16)
plt.text('False Positive Rate')
plt.xlabel('True Positive Rate')
plt.ylabel( plt.show()
Conclusion
To wrap up the model performance comparison, let’s create a new DataFrame called results
to store the evaluation scores of different models. It adds the scores of three models: ‘Logistic Regression’, ‘Base Model Random Forest Classifier’, and ‘Best Model Random Forest Classifier’.
Next, the code sorts the DataFrame in descending order based on the ‘score’ column using sort_values
and applies a background gradient with green colors (‘Greens’ colormap) to the ‘score’ column using style.background_gradient
. This styling enhances the visual representation of the scores, with higher scores appearing with darker shades of green. This tabular presentation allows for a quick comparison of the model performances, helping to identify the best-performing model at a glance.
=pd.DataFrame(columns=['score'])
results'Logistic Regression']=[score_logreg]
results.loc['Base Model Random Forest Classifier']=[score_rfc]
results.loc['Best Model Random Forest Classifier']=[score_best_rfc]
results.loc[
'score',ascending=False).style.background_gradient(cmap='Greens',subset=['score']) results.sort_values(
score | |
---|---|
Logistic Regression | 0.881579 |
Best Model Random Forest Classifier | 0.855263 |
Base Model Random Forest Classifier | 0.815789 |