Loan Default Risk Prediction¶

1. Introduction¶

In this project, we build a machine learning pipeline to predict loan default risk using financial and demographic data. We apply EDA, preprocessing, Logistic Regression, and Random Forest models, and evaluate them using accuracy, precision, recall, and ROC-AUC.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
In [2]:
df = pd.read_csv("loan_default.csv")
print("Shape of dataset:", df.shape)
df.head()
Shape of dataset: (255347, 18)
Out[2]:
LoanID Age Income LoanAmount CreditScore MonthsEmployed NumCreditLines InterestRate LoanTerm DTIRatio Education EmploymentType MaritalStatus HasMortgage HasDependents LoanPurpose HasCoSigner Default
0 I38PQUQS96 56 85994 50587 520 80 4 15.23 36 0.44 Bachelor's Full-time Divorced Yes Yes Other Yes 0
1 HPSK72WA7R 69 50432 124440 458 15 1 4.81 60 0.68 Master's Full-time Married No No Other Yes 0
2 C1OZ6DPJ8Y 46 84208 129188 451 26 3 21.17 24 0.31 Master's Unemployed Divorced Yes Yes Auto No 1
3 V2KKSFM3UN 32 31713 44799 743 0 3 7.07 24 0.23 High School Full-time Married No No Business No 0
4 EY08JDHTZP 60 20437 9139 633 8 4 6.51 48 0.73 Bachelor's Unemployed Divorced No Yes Auto No 0

Exploratory Data Analysis (EDA)¶

1. Target Variable Distribution (Default)¶

In [3]:
df['Default'].value_counts()
sns.countplot(x='Default', data=df)
plt.title("Class Distribution (Default vs Non-Default)")
plt.show()

2. Numeric Feature Distributions¶

In [4]:
numeric_cols = ['Age', 'Income', 'LoanAmount', 'CreditScore', 'MonthsEmployed', 'NumCreditLines', 'InterestRate', 'LoanTerm', 'DTIRatio']

df[numeric_cols].hist(figsize=(15,10), bins=30)
plt.suptitle("Numeric Feature Distributions")
plt.show()

3. Boxplots (Feature vs Default)¶

In [5]:
sns.boxplot(x='Default', y='Income', data=df)
plt.title("Income vs Default")
plt.show()

sns.boxplot(x='Default', y='CreditScore', data=df)
plt.title("Credit Score vs Default")
plt.show()

4. Categorical Features vs Default¶

In [6]:
categorical_cols = ['Education', 'EmploymentType', 'MaritalStatus', 'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner']

for col in categorical_cols:
    plt.figure(figsize=(6,4))
    sns.countplot(x=col, hue='Default', data=df)
    plt.title(f"{col} vs Default")
    plt.xticks(rotation=45)
    plt.show()

5. Preprocessing¶

We prepare the dataset by:

  • Splitting into features (X) and target (y)
  • Scaling numerical variables
  • One-hot encoding categorical variables
  • Creating a preprocessing pipeline
In [7]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
In [8]:
df = df.drop(columns=['LoanID'])
In [9]:
X = df.drop('Default', axis=1)
y = df['Default']
In [10]:
categorical_cols = X.select_dtypes(include=['object']).columns
numeric_cols = X.select_dtypes(exclude=['object']).columns
In [11]:
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

6. Train-Test Split¶

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

5. Logistic Regression (with Class Weight Balancing)¶

Logistic Regression serves as a simple baseline.
We add class_weight="balanced" to handle the imbalance between defaults and non-defaults.

In [13]:
log_reg = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))])

log_reg.fit(X_train, y_train)

y_pred_lr = log_reg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))
Logistic Regression Accuracy: 0.6764440963383591
              precision    recall  f1-score   support

           0       0.94      0.67      0.79     45139
           1       0.22      0.70      0.33      5931

    accuracy                           0.68     51070
   macro avg       0.58      0.69      0.56     51070
weighted avg       0.86      0.68      0.73     51070

6. Random Forest (with Class Weight Balancing)¶

Random Forest is an ensemble method that often performs well on tabular data.
We again add class_weight="balanced" to address class imbalance.

In [14]:
rf_model = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))])

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
Random Forest Accuracy: 0.8847464264734678
              precision    recall  f1-score   support

           0       0.89      1.00      0.94     45139
           1       0.70      0.01      0.03      5931

    accuracy                           0.88     51070
   macro avg       0.79      0.51      0.48     51070
weighted avg       0.86      0.88      0.83     51070

7. ROC Curve & AUC¶

We plot the ROC curve and calculate the AUC to evaluate the model's ability to discriminate between defaults and non-defaults.

In [15]:
y_pred_proba = rf_model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_pred_proba):.2f}")
plt.plot([0,1],[0,1],'--',color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Random Forest")
plt.legend()
plt.show()

8. Conclusion¶

  • Logistic Regression (with balancing) achieved higher recall for defaults (70%), making it better for detecting risky loans.
  • Random Forest achieved higher accuracy (88%) but struggled to identify defaulters due to class imbalance.
  • AUC score of 0.73 shows Random Forest has reasonable predictive power.

In real-world credit risk analysis, recall for defaults is more important than overall accuracy, since missing a default is costly.
Future improvements could include SMOTE oversampling, feature selection, and hyperparameter tuning.