Improving the Performance of Loan Risk Prediction based on Machine Learning via Applying Deep Neural Networks

— Deep learning algorithms can be applied to data acquired regularly from consumers' financial activity in order to create predictive tools used to foresee credit card defaults. Methods such as Convolutional Neural Networks and Recurrent Neural networks were utilized to improve the performance of loan risk prediction, hence lowering the risk to financial institutions. To compare the two models, the same data set was used. With the goal of better modeling and moving their precise measures for gauging model performance, the same initial dataset was used. Predictions from deep learning algorithms are helpful in many different areas of study, including representation and abstraction that make sense of images, sound, and text. The study laid the groundwork for filling the void left by outdated banking tools and inadequate computer technology by applying deep learning algorithms to the problem of credit card default prediction.


I. INTRODUCTION
Lenders can provide consumers with different interest rates using machine learning models for credit scoring [1].These models take into consideration the real risk that a person poses while taking out a loan.Lower interest rates for low-risk borrowers and higher interest rates for high-risk borrowers.The most effective machine learning algorithm or model will help the best credit loan default accuracy [2].With the increasing number of data, different types of quantitative models were anticipated for improved credit scoring and consumer loan decisions [3].Supervised learning and unsupervised learning are two primary groups of machine learning algorithms [3].
The topic of machine learning, known as "deep learning," has made incredible improvements in the areas of image and text processing.In this research paper, convolutional and recurrent neural networks are used due to their great potential for nonlinear mining characteristics in data.Recent studies have shown that deep learning algorithms can achieve higher accuracy than individual and ensemble machine learning models, although there is still room for improvement.For credit default prediction, various studies employed convolutional and recurrent neural network models to find the best accuracy rate that a consumer can default on their credit card payments.This study will compare the two specified models to determine if accuracy can be higher.The results will then be compared to the conventional machine-learning models that were discussed in previous work [4].

A. Convolutional Neural Networks
CNN is a type of deep-learning neural network that is especially well-suited to processing organized arrays of input, such as photographs [5].As the current gold standard for many visual applications like image classification, convolutional neural networks have also achieved success in natural language processing for text categorization.For the purpose of time series classification, deep neural networks employing convolutional neural networks have recently attracted a lot of interest.Time series classification challenges might benefit from a deep learning method employing convolutional neural networks [6].Many studies using convolutional neural networks have been inspired by the success of these models in other areas, and this is often cited by the researchers themselves.In light of the success of convolutional neural networks in computer vision, one scientist claims that the same principle may be used to classify data from five mixed gas time series acquired from an array of eight MOX gas sensors [5], [6].
When it comes to picture identification, convolutional neural networks have excelled.Because of their superiority in extracting hierarchical features, convolutional neural networks are widely employed in image recognition tasks.Recurrence Plots (RP) were used to convert 1D time series signals into 2D texture pictures, allowing them to categorize using convolutional neural networks.Therefore, a novel method is made possible by the picture representation of time series, allowing for the extraction of characteristics that are unavailable in the conventional 1D time series signals.Consequently, TSC issues can be seen as texture image recognition projects as well [7], [8].

B. Recurrent Neural Networks
RNN is another form of the neural network model.Neural network models of this sort may be trained to remember data sequences.Because of the sequential structure of time series datasets, RNNs have found widespread use in this area of study.RNN is an effective alternative to traditional approaches for forecasting time series with irregular behavior because of their capacity to learn, train, and self-adapt [9].RNN is able to get around problems that standard models presume cannot be solved without them.RNNs have been demonstrated to be capable of learning sophisticated nonlinear dynamic mappings that take successive states of the system into account [10].
For time series issues, RNNs are useful.However, variants of these networks typically outperform the originals.Deep Belief Networks (DBN) were shown to be superior to RNN [11].The error explosion or vanishing gradient concerns found in another investigation indicate that the suggested RNN model without Long Short-Term Memory cells is not enough for this purpose.These kinds of problems are why variants of Recurrent Neural Networks, such as Long Short-Term Memory networks, have been developed [9].

C. Risk Management
Credit risk arises when a borrower fails to repay and defaults on his loan.By the end of 2019, the local household debt was about $12.3 trillion.With the Covid-19 outbreak, it is reported that numbers would increase worldwide [12].To survive in such an economic climate, banks are trying to decrease default rates and improve accuracy in issuing credit cards and loans to applicants with an accuracy rate of 100%.One way is data analytics.Other banks plan to use machine learning tools to assess credit risk learning.Such tools will help credit risk managers improve and reduce credit and liquidity risks, as both are the most common and highest risks in financial institutions [12], [13].Moreover, banks use data analytics to analyze their customers and the financial industry's activity patterns to forecast a potential default, whether on a large scale or a personal account level.The data analytics operation focuses on detecting cross-channel schemes across numerous financial systems, both within the institution and in industry-wide databases.Cross-channel schemes offered by data analytics allow banks to detect unusual activity that can lead to default, such as payment records and the total amount owned [14].

III. PROBLEM STATEMENT, HYPOTHESIS STATEMENTS, AND RESEARCH QUESTIONS A. Problem Statement
There is no comprehensive research has been reported on whether deep learning algorithms can improve the predict credit card applicants' risk more accurately if compared to machine learning algorithms.Data analysts require input modifications to develop a credit-qualifying model to serve major banks in their risk management operations, which will be the goal of this correlational quantitative research study.

B. Hypothesis Statement
H0: Among all examined deep learning algorithms, we cannot find a better one that can perform better than machine learning algorithms in predicting credit card applicants' risk.
H1: Among all examined deep learning algorithms, we can find a better one that can perform better than machine learning algorithms in predicting credit card applicants' risk.

C. Research Question
How accurate is a deep learning algorithm from machine learning at accurately predicting credit card applicants' risk?
IV. METHODOLOGY

A. Method
In this study, the most effective deep learning model will be examined to produce an accuracy that is more significant than the other machine learning models that were examined in previous work [1].I will analyze two deep learning algorithms and evaluate and compare the performance of each base model.Various strategies for selecting base models based on their correlation or accuracy will be applied and compared.
Since the goal of this study was to evaluate and compare deep learning applications that might better predict more accurate results than the machine learning models' results for defaulting a credit payment or not, qualitative research methods were not used in this research.The hypotheses in this study tested if, among all examined machine learning algorithms, the best one could be found which has the highest accuracy in predicting credit card applicants' risk.
Understanding the confusion matrix (also known as the error matrix) is important, visually representing the model predictions vs. the ground-truth labels in a tabular format.There are rows and columns in a confusion matrix representing the occurrences in a predicted class and the actual class [15], [16].Classification accuracy is one of the most straightforward metrics imaginable; it is defined as the number of right predictions divided by the total number of predictions multiplied by 100.
The 2 × 2 confusion matrix, as seen in Fig. 1, has four possible combinations of true positive (TP), which means the model correctly predicts the positive class, and false positive (FP), which means the model gives the wrong prediction of the negative class, and false negative (FN) that means the model wrongly predicts the positive class, and true negative (TN) that means the model correctly predicts the negative class.The data values of the variables or parameters in the confusion matrix, which include Accuracy, Precision, Recall, and F1-Score, are the types of information that need to be gathered for this study by splitting the dataset into training and sets.

B. Population and Sample
The population in this study consisted of consumers in Taiwan.The study used credit card default data stored in the public Machine Learning Repository of the University of California, Irvine (UCI) in 2016 [17].The data consists of 30,000 instances and 25 attributes of real integer characteristics.The data were divided into training and testing samples to evaluate the different machine learning models using the Python tools.
It ran each of the machine learning algorithms on a set of credit cards default indicators like limit balance, gender, education, marriage, and age, in addition to other indicators like payments, bill amount, and payment amount in a sixmonth period.The idea was to find a correlation between the different indicators and the applied machine-learning algorithm to predict if the consumer will default on the upcoming month's payment.The predicted value in a confusion matrix was used to compare it to the target value since the actual value of the dependent variable is known.

C. Evaluation of the Performance of Deep Learning
Some of the key performance metrics used to evaluate the performance of machine learning classification algorithms are Accuracy, Precision, Recall, and F1-score [16].Accuracy is defined as the portion of true predictions by all predictions.Precision is defined as the ratio of true-positive predictions to all positive predictions.Recall is the ratio of true positive predictions by all actual positive instances.F-measure is a machine learning testing metric that is computed using Precision and recall.F1-score is the harmonic average of Precision and recall [17].For the specified classification model metrics, 1  = ' ( #)*+,-,./ ( 0*+122 34567879: $ ;56<== (4) This study will attempt to examine the best model for predicting personal credit default using various machine learning algorithms.The approach could increase accuracy and stability while also reducing artificial impact.Various researchers examined the credit default prediction problem indepth to ensure long-term, stable, and healthy growth of the credit business throughout.

V. EXPERIMENT AND RESULTS
The study used the archived data of a default credit card that is available for free in the public Machine Learning Repository of UCI.The data consists of 30,000 instances and 25 attributes of real integer characteristics.The data were divided into training and testing samples to evaluate the different machine learning models using Python.A sampling technique was used to select data samples from a given study population.The study considered different sampling scenarios, including 80:20, 70:30, 60:40, 50:50, 40:60, 30:70, and 20:80 ratios of training to testing sets for the prediction analysis.Of the 30,000 observations, 5529 observations (22.12%) were cardholders with default payments.Because an existing dataset with known results is used, different supervised machine-learning approaches were applied in a previous study [4] to the dataset to determine the best-fitted model.This study emphasized the model parameters and implementation details for each deep learning method's performance, which might be very sensitive to parameter selection.The conducted experiments included CNN and RNN.Unless otherwise specified, the default settings in TensorFlow were utilized for all algorithms.Then different layers were used to calculate performance metrics for the testing/training of various sets.The selected deep learning algorithms were run on the proposed default credit card dataset to evaluate and select the best model in terms of the specified evaluation metrics: test accuracy, train accuracy, Precision, recall, and F1-score.The two different models run on seven different testing/training split datasets.The False-Negative should be as low as possible to have the lowest percentage that a consumer may default on his next month's payment.In this case, results should show high recall, which is the case in most of the experiment results.When comparing different models, it will be difficult to decide which is better (high Precision and low recall or vice-versa).Therefore, there should be a metric that combines both.So, the testing accuracy values will be used for the comparison.

A. Results
CNN is a sort of deep learning model for processing data with a grid pattern, such as images.CNN is a mathematical structure generally consisting of convolutional layers, pooling layers, and fully linked layers.The first two, called convolution and pooling layers, are responsible for feature extraction, while the third, called a fully connected layer, transfers the retrieved features into the final output, such as classification [5] [6].CNN uses a series of layers to perform a variety of mathematical operations, including convolution, a specific sort of linear operation.Since a feature can appear anywhere in a digital image, the pixel values are stored in a two-dimensional (2D) grid or array of numbers.The architecture of the CNN model used in this study consists of the Input Layer, Conv2d Layer (6 channels and ReLu activation function), Conv2d Layer (12 channels and ReLu activation function), the MaxPool Layer (size of 2), and the Flatten Later (64 nodes).
Based on the results in Table I, the accuracy values were promising and close to 90.50% for all seven different testing sizes of datasets.That sure will be good news for financial institutions as this model has 90% and more accuracy, and it can detect many defaulters accurately.RNN is used in multiple fields [9].For instance, in Natural Language Processing (NLP), RNN has been employed for tasks including generating handwritten writing, doing machine translation, and recognizing spoken language.However, RNN has many more uses outside language processing, including the prediction process in banks and financial institutions.Also, it is common practice in the field of Computer Vision to employ RNN for tasks such as picture captioning and image questionanswering.RNN operates similarly to a chain.In each time step, the computation is dependent on the computation in the preceding time step.Deep Learning training's secret sauce lies in its obfuscated depths [10], [11].
The architecture of the RNN model used in this study has five layers.The Input Layer, the LSTM layer with ten blocks, another LSTM layer with 50 blocks, the Dense Layer with ReLu activation function, and the Output Layer (1 node and Sigmoid activation function).
The results in Table II show that the accuracy values were promising and close to 91.40% for all seven different testing sizes of datasets.Similar to the results of the CNN model, the RNN model shows significant results that can be used to detect many defaulters accurately.

B. Summary
The study showed the results of the two deep-learning models.Both CNN and RNN performed much better results if compared to the results of the machine learning models in a previous study [4].The CNN shows an accuracy of 90.5%, while the RNN shows the best results of accuracy of 91.40%.
The CNN model performed well on the 80% testing set, Fig. 3, as their accuracy was around 90%.The accuracy of the RNN is slightly better if compared to the other CNN model, and it is around 91%.The value of the Precision can be considered.Despite that, Precision alone could not predict the classifiers' performance as the Recall value is close to zero.
When reducing the testing set to 70%, Fig. 4, the results did not change much.Still, both models seem to have the best results, with accuracies of 90.20% and 91.10%, respectively.There were also no changes in the values of the Precision and Recall.That kept the value of the F1 score very low.
Using a 60% test set, Fig. 5, the accuracy of both models improved a little bit to 90.37% and 91.17%, respectively.The remarkable changes were in the Precision value of the CNN model, which increased to more than 50%.But the value of the Recall didn't improve due to the high FN value and very low TP value.
The accuracy of both models improved slightly when using a 50% test set as in Fig. 6, but the values for the CNN and RNN models remain at 90.4% and 91.25%, respectively.The remarkable changes were in the Precision value of both models, which increased to more than 50% and 60%, respectively.But the value of the Recall didn't improve due to the high FN value and very low TP value.Running the models on a 40% testing set, as in Fig. 7, nothing changed much.It was noted in this experiment to have the Precision of the RNN dropped significantly to around 40%.But the value of the Recall did not improve due to the high FN value and very low TP value.
Using a dataset of 30% testing, there was an improvement in the accuracy of each of the two neural network models.The accuracy of each of the CNN and RNN models improved to 90.43% and 91.28%, respectively, as shown in Fig. 8 and Fig. 9.It was slightly clear that decreasing the percentage of the testing data and increasing the percentage of the training data improved the fit.The best results were shown when using a 20% or 30% testing set.Still, the results are very close, and the performance made the two models with five layers to be considered.As was described in previous research [4], the RF one is not thought to be susceptible to overfitting.According to the data analyzed, this is indeed the case with Default Credit Card Payments.That may be due, at least in part, to the fact that the dataset is sufficiently sizable to permit model fitting.The model's accuracy jumps from 81.91% to 82.13% when the depth is increased to a maximum of 10.The AdaBoost model is expected to outperform the RF model since it incorporates additional rules and changes throughout the training process.However, when compared to the RF model, its accuracy is only slightly lower, at 81.46%.In summary, the AdaBoost and RF models perform the same with an accuracy of 81 to 82%.
While CNNs and RNNs are recommended for image classification due to their ability to process large datasets, the usage of deep learning techniques is still in its infancy.In addition, the effectiveness of deep learning techniques for credit card default classification is investigated here.In this research, a CNN model was employed that has the following components: Input Layer, Conv2d Layer (6 channels and ReLu activation function), Conv2d Layer (12 channels and ReLu activation function), MaxPool Layer (size of 2), and Flatten Later (64 nodes).The RNN model that I used in this study has only three layers.An Input Layer, a Dense Layer (with 50 LSTM blocks and a ReLu activation function), and an Output Layer (1 node and Sigmoid activation function).Both types of deep learning models used here demonstrate an increase in accuracy, from 90% to 91%.
Due to the sensitive nature of financial data meant to safeguard users' privacy, publicly available datasets are not very common, limiting the scope of research in this domain where machine learning algorithms rely entirely on historical data.This means that each given investigation of the topic can only make use of a single dataset at best.In machine learning, a model's performance may vary greatly across datasets.Additionally, the number of default instances is extremely low compared to the number of normal transactions made with a credit card, creating a class imbalance problem for algorithms used to identify credit card defaults.Another goal of this research is to examine how different sampling methods affect behavior when applied to the problem of class imbalance.These conventional techniques perform poorly on modern, large-scale data sets.
In conclusion, the RF model that was introduced in a previous study has the best performance, but the accuracy level was not higher than 90%.As a result, choosing a deep learning model like CNN or RNN would be a better choice as they show an accuracy of at least 91%.Consequently, the lending institutions will be able to use the final model to bring down the high delinquent rate by accurately separating credit card defaulters from non-defaulters.

Fig. 3 .
Fig. 3. Performance of the Models based on 80% Testing Size.

Fig. 4 .
Fig. 4. Performance of the Models based on 70% Testing Size.

Fig. 5 .
Fig. 5. Performance of the Models based on 60% Testing Size.

Fig. 6 .
Fig. 6.Performance of the Models based on 50% Testing Size.

Fig. 7 .
Fig. 7. Performance of the Models based on 40% Testing Size.

Fig. 8 .
Fig. 8. Performance of the Models based on 30% Testing Size.

Fig. 9 .
Fig. 9. Performance of the Models based on 20% Testing Size.

TABLE I :
CNN PERFORMANCE RESULTS BASED ON VARIOUS TESTING DATA SIZES