In this project, we build a machine learning pipeline to predict loan default risk using financial and demographic data. We apply EDA, preprocessing, Logistic Regression, and Random Forest models, and evaluate them using accuracy, precision, recall, and ROC-AUC.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
df = pd.read_csv("loan_default.csv")
print("Shape of dataset:", df.shape)
df.head()
Shape of dataset: (255347, 18)
LoanID | Age | Income | LoanAmount | CreditScore | MonthsEmployed | NumCreditLines | InterestRate | LoanTerm | DTIRatio | Education | EmploymentType | MaritalStatus | HasMortgage | HasDependents | LoanPurpose | HasCoSigner | Default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I38PQUQS96 | 56 | 85994 | 50587 | 520 | 80 | 4 | 15.23 | 36 | 0.44 | Bachelor's | Full-time | Divorced | Yes | Yes | Other | Yes | 0 |
1 | HPSK72WA7R | 69 | 50432 | 124440 | 458 | 15 | 1 | 4.81 | 60 | 0.68 | Master's | Full-time | Married | No | No | Other | Yes | 0 |
2 | C1OZ6DPJ8Y | 46 | 84208 | 129188 | 451 | 26 | 3 | 21.17 | 24 | 0.31 | Master's | Unemployed | Divorced | Yes | Yes | Auto | No | 1 |
3 | V2KKSFM3UN | 32 | 31713 | 44799 | 743 | 0 | 3 | 7.07 | 24 | 0.23 | High School | Full-time | Married | No | No | Business | No | 0 |
4 | EY08JDHTZP | 60 | 20437 | 9139 | 633 | 8 | 4 | 6.51 | 48 | 0.73 | Bachelor's | Unemployed | Divorced | No | Yes | Auto | No | 0 |
df['Default'].value_counts()
sns.countplot(x='Default', data=df)
plt.title("Class Distribution (Default vs Non-Default)")
plt.show()
numeric_cols = ['Age', 'Income', 'LoanAmount', 'CreditScore', 'MonthsEmployed', 'NumCreditLines', 'InterestRate', 'LoanTerm', 'DTIRatio']
df[numeric_cols].hist(figsize=(15,10), bins=30)
plt.suptitle("Numeric Feature Distributions")
plt.show()
sns.boxplot(x='Default', y='Income', data=df)
plt.title("Income vs Default")
plt.show()
sns.boxplot(x='Default', y='CreditScore', data=df)
plt.title("Credit Score vs Default")
plt.show()
categorical_cols = ['Education', 'EmploymentType', 'MaritalStatus', 'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner']
for col in categorical_cols:
plt.figure(figsize=(6,4))
sns.countplot(x=col, hue='Default', data=df)
plt.title(f"{col} vs Default")
plt.xticks(rotation=45)
plt.show()
We prepare the dataset by:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
df = df.drop(columns=['LoanID'])
X = df.drop('Default', axis=1)
y = df['Default']
categorical_cols = X.select_dtypes(include=['object']).columns
numeric_cols = X.select_dtypes(exclude=['object']).columns
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_cols),
('cat', categorical_transformer, categorical_cols)
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
Logistic Regression serves as a simple baseline.
We add class_weight="balanced"
to handle the imbalance between defaults and non-defaults.
log_reg = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))])
log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))
Logistic Regression Accuracy: 0.6764440963383591 precision recall f1-score support 0 0.94 0.67 0.79 45139 1 0.22 0.70 0.33 5931 accuracy 0.68 51070 macro avg 0.58 0.69 0.56 51070 weighted avg 0.86 0.68 0.73 51070
Random Forest is an ensemble method that often performs well on tabular data.
We again add class_weight="balanced"
to address class imbalance.
rf_model = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))])
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
Random Forest Accuracy: 0.8847464264734678 precision recall f1-score support 0 0.89 1.00 0.94 45139 1 0.70 0.01 0.03 5931 accuracy 0.88 51070 macro avg 0.79 0.51 0.48 51070 weighted avg 0.86 0.88 0.83 51070
We plot the ROC curve and calculate the AUC to evaluate the model's ability to discriminate between defaults and non-defaults.
y_pred_proba = rf_model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_pred_proba):.2f}")
plt.plot([0,1],[0,1],'--',color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Random Forest")
plt.legend()
plt.show()
In real-world credit risk analysis, recall for defaults is more important than overall accuracy, since missing a default is costly.
Future improvements could include SMOTE oversampling, feature selection, and hyperparameter tuning.