Predicting Mortgage Approvals from Government Data

GOAL - To predict whether a mortgage application was accepted (meaning the loan was originated) or denied according to the given dataset, which is adapted from the Federal Financial Institutions Examination Council (FFIEC).

Executive Summary

Predicting if a mortgage application will either be accepted or denied within the USA will be very useful. This report describes the process taken to do exactly that, and notably the prediction is accomplished without industry standard features such as credit score, debt to income ratio, and loan to value ratio. The datasets provided for this project were adapted from the Federal Financial Institutions Examination Council, or better known as the FFIEC. The training dataset included 500,000 observations across 23 features, with a unique row identifier, and a known acceptance outcome. This data was used to train a binary classification model using Supervised Machine Learning (SML). A second dataset with 500,000 additional observations, lacking a known acceptance outcome, was then utilized to ‘test’ the model by creating predictions of the acceptance outcome.

The key analytical tasks for this workflow process were broken down into four basic steps of view and clean, explore and transform, train and optimize, and finally test/predict. Each of these steps are described in further detail below.

png

A simple binary classification model was created using the CatBoost package from Yandex, which uses boosted gradient decision trees. The best model predicted results for the unknown acceptance outcomes achieving 0.7275 accuracy with using only 22 significant feature columns and three hyper-parameters. This score earned a public rank of #7 out of 432 other participants in this capstone as of final date for submissions on April 25, 2019. If further accuracy is ever needed, future recommendations would be to explore some of the 60+ additional hyper-parameters that are available within the CatBoost package. Also, my process of cleaning and transforming the features for the training model are one of many other hundreds of different combinations that could have been done for this dataset. One could do more, or even less, to the training set and still achieve a higher accuracy metric.

png

This report concludes that predicting mortgage loans that are accepted or denied can done efficiently, without the industry standard key features, with ~73% accuracy, and with readily available data from the census, the loan application and its applicant, and the geographical location of the loan. Some of the significant findings found in this analysis are listed directly below, where the full list of 22 final features for the predictive model can be found at the end of this report.


Geographical locationacceptance outcome correlates the geographical location of the originating loan, as well as the FFIEC median family income census information,
Loan applicationthe amount of the loan has been correlated with the acceptance outcome and if a co-applicant was included or not, whereas, the purpose of the loan was correlated with preapproval requests,
Personal applicant incomeshown a correlation with the acceptance outcome, the amount of the loan, and the geographical location of the originating loan.


VIEW AND CLEAN | 1

MISSING DATA | 1A

Once the training dataset was loaded, 8 numerical float columns, 1 categorical Boolean column, and 14 categorical integer columns were identified (including the row identifier and the known label). From those, 7 of the numerical feature columns were found to be missing data as empty cells within the comma separated variable (csv) file. It was also noted that for 3 of the categorical columns, missing values were indicated as a ‘-1’ within the cell.


Msa_md – (Cat-int) Metropolitan Statistical Area/ Metropolitan Division (-1 as missing value)
State_code – (Cat-int) Indicates the US state (-1 as missing value)
County_code – (Cat-int) Indicates the county (-1 as missing value)
Lender – (Cat-int) The lender that approved or denied the loan
Loan_amount – (Num-float) Size of requested loan in thousands of US dollars
Loan_type – (Cat-int) Indicates if loan was; conventional, FHA-insured, VA-guaranteed, or FSA/RHS
Property_type – (Cat-int) Indicates if loan application was for; 1-4 family, manufactured, or multifamily
Loan_purpose – (Cat-int) Indicates if loan application was for; home purchase, home improvement, or refinancing
Occupancy – (Cat-int) Indicates if application property was; owner-occupied as principal dwelling, not owner-occupied, or not applicable
Preapproval – (Cat-int) Indicates if; preapproval requested, preapproval not requested, or not applicable
Applicant_income – (Num-float) Size of income in thousands of US dollars
Applicant_ethnicity – (Cat-int) Ethnicity of applicant indicating; Hispanic or Latino, Not Hispanic or Latino, info not provided, or not applicable
Applicant_race – (Cat-int) Race of applicant indicating; American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, White, info not provided, or not applicable
Applicant_sex – (Cat-int) Sex of applicant indicating; male, female, info not provided, or not applicable
Co_applicant – (Boolean) Indicates whether there is a co-applicant or not
Population – (Num-float) Total population in tract
Minority_population_pct – (Num-float) Percentage of minority population to total population for tract
Ffiecmedian_family_income – (Num-float) FFIEC median family income in dollars for MSA/MD in which the tract is located (adjusted annually by FFIEC)
Tract_to_msa_md_income_pct – (Num-float) Percentage of tract median family income compared to MSA/MD median family income
Number_of_owner-occupied_units – (Num-float) Number of dwellings, including individual condominiums, that are lived in by the owner
Number_of_1_to_4_family_units – (Num-float) Dwellings that are built to house fewer than 5 families
Row_id – (Cat-int) A unique identifier with no intrinsic meaning
Accepted – (Cat-int) Indicates whether the mortgage application was accepted (successfully originated) with a value of 1 or denied with a value of 0


STATISTICS | 1B

A simple count of missing data within each feature was tabulated and it was decided that the small amount of missing data, under 15.5%, was not large enough to justify removing the entire features. Instead, missing values were imputed using the median value for each of the numerical feature columns. The categorical feature columns indicated by ‘-1’ were left alone and not replaced, since it was thought that the ‘-1’ designation could have meaningful predictive qualities.

Summary statistics were examined for all feature columns. It was seen that many of the features had mean values that differed from their median values (50%), which indicated that the feature distributions were skewed. For example, in the picture below, ‘applicant_income’ had a mean value of approximately 100, but its median value was 74, indicating that this distribution was right skewed. Count, mean, standard deviation, minimum value, 25% quartile, 50% quartile (median), 75% quartile, and maximum value were included within the summary statistics.

png

EXPLORE AND TRANSFORM | 2

VISUAL EXPLORATION AND TRANSFORMATION | 2A

Using information from the statistics, some of the features that had large numbers of unique categories would be binned into fewer unique categories, both for efficiency in processing and to help create distinctive frequency distributions between the acceptance outcomes. The only criteria for the binning was to start at equal sized bins for the range first and then check if an improvement was made within the training model. If no improvement was observed, then the re-binning with different criteria could be attempted. County code went from 324 unique values to 5 equally sized bins. Lender had 6,508 unique values and was binned into 5 equally sized bins. Applicant race had 7 unique values and was binned into 5 unique values by joining ‘not applicable’ with ‘information not provided’ into a group and also joining ‘American Indian or Alaskan Native’ with ‘Native Hawaiian or Other Pacific Islander’ into a single group. Frequency distributions of all three of these are shown below.

png

From the histograms of the numerical feature columns, it was easy to visually see that 7 out of the 8 features were right skewed, having a long tail to the right as shown in an example below. To check the symmetry of the distributions, the skewness of the data was also tabulated. One of the 8 features was left skewed, having a long tail to the left. The right skewed features with quantified skewness greater than +0.5 were transformed by applying log(x+1), and the left skewed feature with a quantified skewness less than -0.5 was transformed by applying an exponent to the 10th power. Both of these transformations on the data were used to better approximate normally distributed or symmetric gaussian distributions. Though neither of the transformations worked perfectly, they were much closer to normality in appearance as their transformed state than previously in their original state. Both the transformed and untransformed data were used going forward to see which were more powerful predictors in the model.

png

A new numerical feature, Loan to Income, ‘LTI’ was created by taking the ‘loan_amount’ and dividing that feature value by the ‘applicant_income’ feature value for each row of data. This feature was generated in an effort to somewhat mimic the more traditional industry standard of debt to income ratio, or DTI, which has shown that the lower the DTI ratio, the higher the probability of being accepted for a home loan.

Categorical variables were left in their current state as integers and fed into the training model as categorical objects.

POTENTIAL RELATIONSHIPS AND CORRELATIONS | 2B

Using the heat map below, correlations among the features and with the acceptance outcome were observed.

png

The label of ‘accepted’ did not appear to be highly correlated with any of the original features, but modestly correlated with ‘loan_amount’, ‘applicant_income’, and slightly with ‘msa_md’, state_code’, ‘county_code’, and ‘ffiecmedian_family_income’. When the acceptance outcome was compared to the transformed numerical features, ‘accepted’ seemed to be modestly correlated with ‘log(x+1)loan_amount’ and ‘log(x+1)_applicant_income’ and less correlated with ‘log(x+1) ffiecmedian_family_income’. This wasn’t a surprise since they were similarly correlated in the untransformed state.

When comparing the features against each other, there seemed to be several noteworthy relationships. The ‘number_of_owner-occupied_units’ and ‘number_of_1_to_4_family_units’ both correlate with ‘population’. ‘Preapproval’ had a relationship with ‘loan_purpose’. Smaller relationships were observed between ‘ffiecmedian_family_income’ and the three features ‘msa_md’, ‘state_code’, and ‘applicant_income’. Also, ‘co_applicant’ had a smaller correlation with both ‘loan_amount’ and ‘applicant_income’.

TRAIN AND OPTIMIZE | 3

png

TRAINING MODEL | 3A

A binary classification model was needed in order to predict the outcome of accepted or denied for the mortgage applications. It was decided to use the CatBoost model for this challenge using the ‘accuracy’ metric to help optimize the model’s hyper-parameters. CatBoost is an open-source gradient boosting on decision trees library from Yandex. This package was more efficient at producing predictions and created less overfitting for categorically-heavy datasets when compared to other available models, such as XgBoost, Random Forest, and AdaBoost.

A base model for CatBoost was used with default values initially to ensure that the model was working properly. Then the ‘training’ dataset was used with all of the categorical features, the 3 new binned groups, the new LTI numerical feature, the skewed numerical features in both the untransformed and transformed versions, and the known acceptance outcome. This ‘training’ dataset was split 70% to be used for training the model, and the remaining 30% (150,000 observations), to be used for validation. Optimal hyperparameters that affected the model performance were found to be depth=3, l2 leaf reg[ularization] = NA, and learning rate = 0.1. The resulting Area Under the Receiver Operating Characteristics (ROC) Curve or AUC-ROC or even shortened to AUC of 0.735 and an internal validation accuracy of 0.671.

OPTIMIZING MODEL | 3B

Learning rates between 0.2 to 1.0 to reduce the gradient step, increases for the depth of the tree up to 7, and l2 leaf reg[ularization] between 1 to 10 were the three hyper-parameters that were further optimized for the ‘training’ dataset. Each set of the hyperparameters had 100 iterations (maximum number of trees) with 50 different random combinations of the hyperparameters, searched using Bayesian methods. The improved optimal parameters for the model were found to be depth=6, l2 leaf reg[ularization]= 6.43, and learning rate = 0.469. This provided an AUC-ROC of 0.806 and an accuracy of 0.72.

png

During the ‘training’ analysis, the features importances were observed from the resulting model. The top features with importance ratings that were greater than an arbitrary threshold of 0.5 were used as a reduced feature dataset containing only 22 of the 35 features. A new model was trained with tuning of the same hyperparameters and the same number of iterations and combinations. The optimal parameters were found to be depth=6, l2 leaf reg[ularization]= 8.28, and learning rate = 0.373. This resulted in an AUC-ROC of 0.807 and an accuracy of 0.729. Since the reduced feature set was more efficient and the resulting model was similarly performant to the previous model, the reduced feature set model was chosen as the final model to predict the unknown application outcomes.

TEST/ PREDICT | 4

TESTING AND PREDICTING MODEL | 4A

The ‘testing’ dataset was prepared in the same manner as the ‘training’ dataset was. Missing numerical values were imputed with their respective median feature values and the missing categorical values identified as ‘-1’ were left alone to be fed into the model as categorical objects. The ‘testing’ feature columns were reduced to the 22 features that were 0.5 and above from the initial training model in the list below.


Msa_md – (Cat-int) Metropolitan Statistical Area/ Metropolitan Division (-1 as missing value)
State_code – (Cat-int) Indicates the US state (-1 as missing value)
County_code – (Cat-int) Indicates the county (-1 as missing value)
Lender – (Cat-int) The lender that approved or denied the loan
Loan_amount – (Num-float) Size of requested loan in thousands of US dollars
Loan_type – (Cat-int) Indicates if loan was; conventional, FHA-insured, VA-guaranteed, or FSA/RHS
Property_type – (Cat-int) Indicates if loan application was for; 1-4 family, manufactured, or multifamily
Loan_purpose – (Cat-int) Indicates if loan application was for; home purchase, home improvement, or refinancing
Occupancy – (Cat-int) Indicates if application property was; owner-occupied as principal dwelling, not owner-occupied, or not applicable
Preapproval – (Cat-int) Indicates if; preapproval requested, preapproval not requested, or not applicable
Applicant_income – (Num-float) Size of income in thousands of US dollars
Applicant_ethnicity – (Cat-int) Ethnicity of applicant indicating; Hispanic or Latino, Not Hispanic or Latino, info not provided, or not applicable.
Applicant_race – (Cat-int) Race of applicant indicating; American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, White, info not provided, or not applicable
Applicant_sex – (Cat-int) Sex of applicant indicating; male, female, info not provided, or not applicable
Co_applicant – (Boolean) Indicates whether there is a co-applicant or not
Ffiecmedian_family_income – (Num-float) FFIEC median family income in dollars for MSA/MD in which the tract is located (adjusted annually by FFIEC)
Log(Applicant_income+1) – (Num-float) Log(x+1) of the size of income in thousands of US dollars
Log(loan_amount+1) – (Num-float) Log(x+1) of the size of requested loan in thousands of US dollars
Log(minority_population_pct+1) – (Num-float) Log(x+1) of the percentage of minority population to total population for tract
LTI – (Num-float) The ratio of loan_amount divided by applicant_income
Log(LTI+1) – (Num-float) Log(x+1) of the ratio loan_amount divided by the applicant_income
RaceGroup – (Cat-int) Applicant_race binned in 5 categories; Asian, Black or African American, White, info not provided/not applicable, and American Indian/Alaska Native/Native Hawaiian/Other Pacific Islander


Only three new features were created from the ‘test’ data to match the reduced features ‘train’ dataset. The two numerical features LTI and its transformed version log(LTI+1), and the categorical RaceGroup, which were the binned applicant races from seven categories into five as previously mentioned in the ‘testing’ dataset. At this point the ‘testing’ data was formatted as a mirror of the ‘training’ data and the model was applied to create the predictions. The final predicted values for mortgage acceptance outcome were submitted to DrivenData and received a final score of 0.7275 accuracy.

The CatBoost package from Yandex proved to be very efficient in predicting categorical heavy data on CPU only computers. The best model provided a great accuracy with only utilizing a reduced set of 22 significant features and three hyper-parameters. Accuracy could be further increased by tuning the 60+ additional hyper-parameters that are available within the CatBoost package.

In conclusion, the top features shown to be important in determining whether an application for a mortgage will be accepted or not were; 1.) geographical in nature, for example the state, county, and Metropolitan Statistical Area/Metropolitan Division codes for the property tract,
2.) information from the loan application itself corresponding to the lender of the loan, its purpose type, and amount, whether or not preapproval was requested, if a co-applicant was included, and the type of property and if it will be owner occupied as a principal dwelling,
3.) personal applicant information, such as, their income, race, ethnicity, sex, and loan-to-income ratio,
4.) and lastly, a few bits of census information such as the percentage of minorities in the population for that tract, and the FFIEC median family income for the MSA/MD in which the tract is located.

The findings reported above indicate that mortgage loan approvals can be predicted with ~73% accuracy without having key industry standard features, such as, credit score, debt to income ratios, and the appraised value of the property for a loan to value ratio. Instead, the mortgage loan outcomes can simply be predicted with commonly found pieces of data from the loan application, the applicant, census information, and geographical data.

References:

  1. Competition Site- https://www.datasciencecapstone.org/competitions/14/mortgage-approvals-from-government-data/
  2. Competition Leaderboard- https://www.datasciencecapstone.org/competitions/14/mortgage-approvals-from-government-data/leaderboard/

To find the data for this analysis & forecast please visit my GitHub repository at https://github.com/SamanthaDSpivey/Mortgage-Approvals-Python