Assignment Chef icon Assignment Chef
All English tutorials

Programming lesson

Loan Approval Prediction with Python: A Step-by-Step Guide for INT303 Big Data Analysis

Learn how to build a machine learning pipeline for loan approval prediction using Python. This guide covers EDA, feature engineering, model selection, and evaluation with practical code examples.

loan approval prediction big data analysis Python machine learning credit risk modeling EDA for loan data feature engineering hyperparameter tuning Random Forest classifier XGBoost loan prediction CIBIL score analysis financial data science student project guide INT303 coding project loan approval dataset classification model evaluation

Introduction

In the world of finance, accurate and efficient loan approval decisions are paramount. Banks and financial institutions rely on robust data analysis and predictive models to assess applicant creditworthiness, mitigate risks, and optimize their lending portfolios. This guide walks you through the essential steps of building a loan approval prediction model using Python, similar to the INT303 Big Data Analysis project. By the end, you'll be able to perform exploratory data analysis, preprocess data, engineer features, and evaluate multiple machine learning models.

Understanding the Dataset

The dataset contains applicant information such as number of dependents, education, self-employment status, annual income, loan amount, loan term, CIBIL score, and asset values. The target variable is loan_status with values 'Approved' or 'Rejected'. We'll use this data to predict whether a loan application should be approved.

Exploratory Data Analysis (EDA)

Loading and Initial Inspection

First, load the dataset using pandas and inspect the first few rows, data types, and missing values. Use df.info() and df.describe() to get an overview.

import pandas as pd
df = pd.read_csv('loan_approval_dataset.csv')
df.head()
df.info()
df.describe()

Univariate Analysis

Analyze each feature individually. For numerical features like income_annum and cibil_score, create histograms and box plots to understand distributions and detect outliers. For categorical features like education and self_employed, use bar plots to see frequency counts.

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(df['income_annum'], bins=30)
plt.show()
sns.boxplot(x=df['cibil_score'])
plt.show()
df['education'].value_counts().plot(kind='bar')
plt.show()

Bivariate Analysis

Explore relationships between features and the target variable. Use stacked bar plots for categorical features vs. loan_status, and box plots or violin plots for numerical features vs. loan_status. A heatmap of correlations can also reveal important relationships.

pd.crosstab(df['education'], df['loan_status']).plot(kind='bar', stacked=True)
plt.show()
sns.boxplot(x='loan_status', y='cibil_score', data=df)
plt.show()
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()

Data Preprocessing

Handling Missing Values

Check for missing values using df.isnull().sum(). Decide on a strategy: for numerical features, you might use median imputation; for categorical, use mode. Document your choices.

df['column'].fillna(df['column'].median(), inplace=True)

Outlier Treatment

Use box plots or Z-scores to identify outliers. Consider capping or transformation, but be careful not to lose important information. For loan approval prediction, outliers in income or asset values might be legitimate.

Feature Engineering

Create at least two new features that could improve model performance. For example:

  • Debt-to-Income Ratio: loan_amount / income_annum
  • Total Assets: sum of residential, commercial, luxury, and bank assets.
df['debt_to_income'] = df['loan_amount'] / df['income_annum']
df['total_assets'] = df['residential_assets_value'] + df['commercial_assets_value'] + df['luxury_assets_value'] + df['bank_asset_value']

Categorical Encoding

Convert categorical features to numerical using one-hot encoding or label encoding. For binary categories like 'education', label encoding is fine. For others, use one-hot.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['education_encoded'] = le.fit_transform(df['education'])
df = pd.get_dummies(df, columns=['self_employed'], drop_first=True)

Feature Scaling

Scale numerical features to have zero mean and unit variance using StandardScaler, or normalize them with MinMaxScaler. This is important for models like SVM and KNN.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_cols = ['income_annum', 'loan_amount', 'cibil_score', 'debt_to_income']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

Model Development and Evaluation

Data Splitting

Split the processed data into training (70%) and testing (30%) sets. Use train_test_split from sklearn.

from sklearn.model_selection import train_test_split
X = df.drop('loan_status', axis=1)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Model Selection

Choose at least three classification algorithms. Good candidates include Logistic Regression, Random Forest, and Gradient Boosting (e.g., XGBoost).

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

Hyperparameter Tuning

Use GridSearchCV to find optimal hyperparameters. For example, for Random Forest, tune n_estimators and max_depth.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None]
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)

Model Evaluation

Evaluate each tuned model on the test set using accuracy, precision, recall, F1-score, ROC AUC, and confusion matrix. Provide a comparative analysis.

from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, confusion_matrix

y_pred = grid.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('ROC AUC:', roc_auc_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Conclusion

Building a loan approval prediction model involves a systematic pipeline from data exploration to model evaluation. By following this guide, you can develop a robust model that helps financial institutions make informed decisions. Remember to document your process and justify your choices, as this is key to a successful project submission.

Further Reading

Explore topics like feature importance, model interpretability with SHAP, and handling imbalanced datasets to enhance your model. The skills you gain here are applicable to many real-world classification problems in finance and beyond.