透過您的圖書館登入
IP:18.119.29.99
  • 學位論文

透過探索性資料分析搭配特徵工程提高機器學習方法之預測力–以信用貸款資料為例

Improving the Predictive Power of Machine Learning Methods through Exploratory Data Analysis and Feature Engineering – Taking Credit Loan Data as an Example

指導教授 : 何宗武

摘要


本研究的目的是希望透過探索式資料分析方法,將變數轉換以及新增變數,利用統計方法將資料做分群後,再觀察不同的模型預測各群資料的效能,以證明透過這樣的方法可降低資料整體的異質性,並在各群資料分別做模型訓練,最後依照各項預測指標來衡量模型的預測效能,以達到改善預測的目的。在本篇研究中,我們會以Kaggle網站上的銀行信用貸款資料---Give Me Some Credit作為我們的研究資料,資料內容包含與借款人個人基本資料與其個人融資相關資料,其目標變數為「是否有超過90天或更長時間逾期未還貸款的不良行為」。而在分析過程中,我們利用卡方檢定觀察到信用貸款在線數量與債務比例發現兩變數之間互不獨立,然而在分割資料前,我們利用差異性檢定找出分割資料的邊界,可得到信用在線數量於0-12,其違約與未違約的借款人,在債務比例上有顯著的差異,然而信用在線數量於13-56的借款人,在這兩類別的借款者,債務比例並沒有顯著的差異,因此將資料於公開貸款與信用在線數量以12為邊界將資料分為兩群,分別以羅吉斯迴歸、決策樹、K最近鄰法隨機森林以交叉驗證法做模型的訓練以及預測,預測的評估指標有準確率(Accuracy)、召回率(Recall)、精確率(Precision)、F1 Score與接受者操作特徵曲線(Receiver Operating Characteristic Curve)的AUC(Areas Under the Curve)值,我們利用這五項指標評估不同模型在兩群資料的預測表現並比較其差異是否顯著。研究結果顯示K最近鄰法在「沒有缺值」的資料預測改善效果最明顯,無論是群一、群二、群三除了召回率外都有約10% - 15%左右的改善效果,而召回率改善最多的為群二-約有7%的改善。而各群之間在預測效果上的比較上,群一在決策樹、K最近鄰法、隨機森林優於群二約4%-10%,代表透過分群後的預測效果是有差異的。而在「僅月收入為缺值」與「月收入和家眷數量皆為缺值」的資料可能因為缺少關鍵變數或因資料筆數不足而沒有分群,以至於在預測效果的改善上並不如「沒有缺值」的資料來的理想。

並列摘要


The purpose of the study is to explore the relationship between variables through exploratory data analysis method to do variable transformation and create new variables in interaction terms. And using statistical methods to split the data into two datasets in order to reduce the heterogeneity of the original datasets. After splitting, we should train different models through the datasets respectively and measure the effectiveness of model prediction to achieve the purpose of improving prediction. In this study, we will use the credit loan data-Give Me Some Credit from Kaggle as our research datasets, and its target variable is "Person who Experienced 90 days Past Due Delinquency or Worse ". In the process of analyzing, we use chi-squared test to observe that Debt Ratio and Number Of Open Credit Lines And Loans aren’t independent. Before partitioning the datasets, we use t test to find the boundary to segment the data and know that there are significant difference in Debt Ratio between default and non-default borrowers having 0-12 open credit lines and loans. However, the borrowers having 13-56 open credit lines and loans between them have no significant difference in Debt Ratio. Therefore, we set 12 as our boundary in open credit lines and loans to divide the datasets into two groups and using logistic regression, decision tree, K-nearest neighbor and random forest to do model training through cross-validation respectively. To evaluate the predictive performance of the models, we choose Accuracy, Recall, Precision, F1 Score and AUC as our evaluative indicators. These five indicators were used to assess the predictive performance of different models in the two groups of data and compare whether the differences are significant. The results showed that K-nearest neighbor had the most obvious improvement in "No Missing Value" data. Except recall rate, Group I-III had about 10%-15% in improvement, and Group II had about 7% improvement. In comparison of prediction effects among groups, Group I was about 4%-10% better than Group 2 in decision tree, K-nearest neighbor method and random forest, indicating that there were differences in prediction effects after grouping. The data of "Only Monthly Income is Missing Value" and "Monthly Income and Dependent Quantity are Missing Value" may not be grouped due to the lack of key variables or the insufficient number of data, so that the prediction effect is not as ideal as the data of "No Missing Value".

參考文獻


英文文獻
Antipov, E., & Pokryshevskaya, E. (2010). Applying CHAID for logistic regression diagnostics and classification accuracy improvement. Journal of Targeting, Measurement and Analysis for Marketing, 18(2), 109-117.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Choi, J. M. (2010). A selective sampling method for imbalanced data learning on support vector machines.
Deodhar, M., & Ghosh, J. (2008). Simultaneous co-segmentation and predictive modeling for large, temporal marketing data. Paper presented at the 2008 IEEE International Conference on Data Mining Workshops

延伸閱讀