← Back to Home
Credit Risk Modeling and Credit Scores using Logistic Regression with Python

Credit Risk Modeling and Credit Scores using Logistic Regression with Python

Credit risk modeling is an essential process for financial institutions. It allows lenders to assess the probability that a borrower might default on their obligations. A credit scorecard, which is often built from such models, transforms these probabilities into a simple score that can be used for decision-making. In this guide, we’ll walk through the theory, mathematical concepts, and practical steps with Python code examples.


1. Introduction

What is Credit Risk Modeling?

Credit risk modeling uses historical data to predict the likelihood of default or non-payment. It involves:

What is a Credit Scorecard?

A credit scorecard converts the output of a predictive model into a score that is easy to interpret. Typically, it uses the odds (or probabilities) generated by the model and applies a scaling formula to convert them into points. Two key concepts here are:


2. Data Preparation and Feature Engineering

Before modeling, it’s essential to clean and transform your dataset. Below is a snippet that demonstrates reading a dataset, cleaning up columns, and creating a target variable:

import pandas as pd
import numpy as np

# Load the dataset
loan_data = pd.read_csv('loan_data_2007_2014.csv')

# Drop columns with too many missing values or that are irrelevant for modeling
na_threshold = loan_data.shape[0] * 0.8
loan_data.dropna(thresh=loan_data.shape[0]*0.2, axis=1, inplace=True)
loan_data.drop(columns=['id', 'member_id', 'sub_grade', 'emp_title', 'url', 'desc', 'title', 'zip_code', 
                          'next_pymnt_d', 'recoveries', 'collection_recovery_fee', 'total_rec_prncp', 
                          'total_rec_late_fee'], inplace=True)

# Create a binary target variable based on loan_status
loan_data['good_bad'] = np.where(loan_data['loan_status'].isin([
    'Charged Off', 'Default', 'Late (31-120 days)', 
    'Does not meet the credit policy. Status:Charged Off'
]), 0, 1)

# Drop the original loan_status column
loan_data.drop(columns=['loan_status'], inplace=True)

Explanation:


3. Exploratory Data Analysis (EDA)

Exploratory analysis helps us understand our data’s structure and distribution. For example, visualizing the distribution of the target variable:

import seaborn as sns
import matplotlib.pyplot as plt

# Visualize the target variable distribution
sns.countplot(loan_data['good_bad']).set_title('Distribution of Good vs. Bad Loans')
plt.xlabel("Loan Outcome (1: Good, 0: Bad)")
plt.ylabel("Number of Records")
plt.show()

Pasted image 20250326104432.png ### Explanation:


4. Logistic Regression for Credit Risk Modeling

The Logistic Regression Model

Logistic regression is popular in credit risk because it models the probability of default in a way that is easy to interpret. The model is defined as:

\[P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k)}}\]

Where:

Code Example: Splitting Data and Training Logistic Regression

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, confusion_matrix

# Split the data into features and target
X = loan_data.drop('good_bad', axis=1)
y = loan_data['good_bad']

# Create a train/test split, stratified to maintain the distribution of the target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Instantiate and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict probabilities on the test set
y_pred_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model using ROC-AUC
roc_auc = roc_auc_score(y_test, y_pred_prob)
print(f"ROC-AUC Score: {roc_auc:.2f}")

# Confusion Matrix
y_pred = model.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix: [[ 8418 1776] [ 10 83053]]

Explanation:


5. Model Evaluation Metrics

ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The Area Under the Curve (AUC) is a measure of the model’s ability to distinguish between classes.

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC Curve
plt.figure()
plt.plot(fpr, tpr, label=f"ROC curve (area = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.show()

Pasted image 20250326105806.png ### Explanation:


6. Developing a Credit Scorecard

Credit scorecards convert the logistic regression output into a score. The score is often calculated using the following formula:

\[\text{Score} = \text{Offset} - (\text{Factor} \times \ln(\text{Odds}))\]

Where:

Step-by-Step Example

a. Calculate the Factor

Suppose the PDO is 20:

import numpy as np

PDO = 20  # Points to Double the Odds
factor = PDO / np.log(2)
print(f"Factor: {factor:.2f}")

b. Calculate the Offset

Let’s assume a baseline score S0S_0 of 600 corresponds to baseline odds O0O_0 of 1:50:

baseline_score = 600
baseline_odds = 1/50  # Odds are probability of default vs. non-default
offset = baseline_score + (factor * np.log(baseline_odds))
print(f"Offset: {offset:.2f}")

Factor: 28.85 Offset: 487.12 #### c. Transforming Model Predictions to Scores

Convert the predicted probabilities to scores using the logistic regression outputs. For each applicant, calculate:

  1. The odds:

    \[\text{Odds} = \frac{P}{1-P}\]

  2. The score:

    \[\text{Score} = \text{Offset} - (\text{Factor} \times \ln(\text{Odds}))\]

# Compute odds for each predicted probability
odds = y_pred_prob / (1 - y_pred_prob)
# Calculate score for each applicant
scores = offset - (factor * np.log(odds))

# Show summary of the credit scores
import pandas as pd
score_summary = pd.DataFrame({'Probability': y_pred_prob, 'Score': scores})
print(score_summary.describe())
           Predicted_Probability  Credit_Score
count           93257.000000  93257.000000
mean                0.887445    339.019409
std                 0.278539    199.812793
min                 0.000001     88.491534
25%                 0.949164    157.521232
50%                 0.984298    367.720839
75%                 0.999989    402.668335
max                 0.999999    885.754219

Pasted image 20250326110619.png ### Explanation:


7. Final Thoughts

Credit risk modeling and scorecard development involve:

  1. Data preprocessing: Cleaning, handling missing values, and engineering features.

  2. Modeling: Using logistic regression to predict default probabilities.

  3. Evaluation: Leveraging metrics like ROC-AUC to gauge model performance.

  4. Scorecard Construction: Transforming model outputs into actionable scores using scaling factors like PDO, offset, and weight of evidence.

The process not only provides insights into creditworthiness but also standardizes decisions in lending. While our examples use logistic regression for its simplicity and interpretability, other techniques (like decision trees, ensemble methods, or neural networks) can also be applied depending on the complexity and requirements of the task.