支援向量機於不平衡資料類別問題之應用

根據「財團法人證券投資人及期貨交易人保護中心」網站統計: 從民國八十四年起至今，企業的財務訴訟案件多達七十多件，其中不乏財報不實、操縱股價、內線交易等罪名，這些事件的興起，對於證券市場、投資人、會計師、甚至是政府機關，無疑影響深遠。近年來許多財務方面的研究議題，諸如財務詐欺預測、信用卡詐欺、財務報表詐欺等等，使用資料探勘技術(data mining techniques)，以類神經網路(Artificial Neural Network)、支援向量機(Support Vector Machine)、邏輯回歸(Logistic Regression)等方法做財務方面的預測，並且有不錯的成果。不平衡資料類別(imbalanced class)的問題在現實生活當中很普遍，凡舉醫療、犯罪、財務詐欺等議題，讓分類問題變得更困難；對此，目前的研究大致分成兩種策略，第一種為針對演算法做改善，嘗試提高分類的泛化能力(generalization ability)或是執行速度，第二種則著重在以抽樣技術(sampling techniques)改變資料類別的分佈型態，再以分類器做分類或預測。目前研究普遍使用的抽樣技術是將多數類別資料減少(under sampling或減少多數法)、或是利用演算法增加少數類別的資料(over sampling或增加少數法)，使得多數類別與少數類資料筆數較為平衡，進而提升分類的結果；許多文獻也證實: 同時使用減少多數法與增加少數法能有效改善分類的結果。本研究探討因為各種財務問題而具有法律風險的企業，首先以資料前處理(data preprocessing)步驟平衡原本極度不平衡的資料集: 使用拔靴法 (bootstrap)將多數類別的資料經由抽樣的過程，使得大樣本資料變成小樣本，接著以合成少數類別抽樣技術(Synthetic Minority Over-sampling Technique, SMOTE)增加少數類別的資料，經由特徵選取之後，以參數最佳化的支援向量機做企業法律風險的預測，最後，使用約略集合論(Rough Sets Theory, RST)擷取規則。實驗結果顯示，使用合併的抽樣技術能夠提高分類準確率，擷取出的規則表則能夠提供企業財務顧問、審計人員等有關企業法律風險的參考。

關鍵字

支援向量機；約略集合論；不平衡資料類別

並列摘要

The big financial fraud event in 2005, making public pay more attention to the financial fraud events. According to the statistics of Securities and Futures Investors Protection Center (SFIPC) website, there are over seventy financial litigation cases since 1995 to 2012. The litigation cases include fraudulent financial statements, stock price manipulation, insider trading etc, hurt the securities market, investors, accountants and even the authorities. There are many researches about financial topics recently, for example, credit card fraudulent detection, financial statements fraud, and financial fraudulent prediction etc. Using data mining techniques like Artificial Neural Network (ANN), Support Vector Machine (SVM), Logistic Regression (LR), and have great research performance. In the fields of medical, crime and finance etc., data with imbalanced class is a common problem in reality, making the classification more difficult. To solve this problem, there are two strategies have been proposed: algorithm-based approaches and sampling approaches. Algorithm-based approaches focus on improving the generalization ability or the speed of classifiers. However, sampling approaches focuses on using sampling techniques to rebalance the distribution of data, and then classifying or predicting. The most common sampling techniques are the methods which over sampling the minority class and the methods which under sampling the majority class. Current literatures report that combining the over sampling and under sampling techniques can greatly improve the classification performance. This research discuss the enterprises been litigated due to their financial problems. Sampling techniques are used to rebalance imbalanced dataset in the data preprocessing step. The methods been used in rebalance step are bootstrap and the Synthetic Minority Over-sampling Technique (SMOTE). After selecting features by RELIEF algorithm, the best parameter Support Vector Machine (SVM) is used to predict the litigated enterprises. And finally, the Rough Sets Theory (RST) is applied to extract rules from SVM. Experiments results show the combination of under sampling and over sampling techniques can receive better accuracy rate. And the rules obtain from RST can provide good references to financial consultants and auditors of enterprises.

並列關鍵字

support vector machine ； rough set theory ； imbalanced class

參考文獻

Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-Based Learning Algorithms. Machine Learning, 6, 37-66.

Google Scholar

Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying Support Vector Machines to Imbalanced Datasets

Google Scholar

Alibeigi, M., Hashemi, S., & Hamzeh, A. (2011). Unsupervised Feature Selection Based on the Distribution of Features Attributed to Imbalanced Data Sets. International Journal of Artificial Intelligence and Expert Systems, 2(1), 14-22.

Google Scholar

Altıncay, H., & Ergun, C. (2004). Clustering based under-sampling for improving speaker verification decisions using adaBoost. Lecture Notes in Computer Science, 3138, 698-706.

Google Scholar

Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explorations, 6(1), 20-29.

Google Scholar

被引用紀錄

葉泰均（2013）。運用類神經網路探討大腸異常之相關健檢項目與預測模型〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-2712201314041915

國際替代計量

支援向量機於不平衡資料類別問題之應用

主題瀏覽