透過您的圖書館登入
IP:3.15.202.4
  • 學位論文

結合馬氏距離和屬性挑選機制解決資料分類上之問題

Combining Mahalanobis Distance with the Attributive Selection Mechanism for Solving Problems of Data Classification

指導教授 : 江瑞清
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


由於資訊的暴增,使得資料的處理也越來越被重視。因而,從廣大的資訊中,經過資料轉換、分析並找出其中有用的知識,成為一項重要的研究議題。在二元分類的資料中,多量類別與少量類別在數量上往往有差距,這種不平衡的資料會使得模型在多量類別的準確率較高,但卻降低了對於少量類別之敏感度。然而在實務上,相較於多量類別來說,少量類別之預測錯誤將會導致較嚴重的損失,例如,在檢測結果中不良品判別為正常狀態。因此,本論文提出了一個馬氏距離-屬性挑選機制(Mahalanobis Distance – Attributive Selection Mechanism, MD-ASM) 演算法,在處理多變量資料分析中考慮了變數間的相關性,所計算出馬氏距離被用來建構系統的測量尺度(measurement scale),是以距離基準點的遠近來判斷其類別;在篩選的特徵變數上,加入屬性挑選的機制用來確認重要變數,並減少系統的維度,這個方法不但在觀念及計算上簡單,而且剔除多餘對分類不具貢獻的屬性干擾。因此,此演算法係整合了馬氏距離及屬性挑選的機制,來進行分類/系統參數選擇與設計之演算法。 在實證部份則從 UCI 資料存放站上找出 3個資料檔及 1個Mobile phone power amplifier modules 作為實例研究。在實例部分,影響良品與不良品之17 項檢測變數篩選為3 項變數,所建立的類別預測模型,將可提供給未來產品檢測流程上之依據與參考,提升產品的品質與市場競爭力。最後,實驗結果顯示馬氏距離-屬性挑選機制演算法在分類問題上不但能正確分析不平衡的資料集的正確率,對於處理資料確實有較穩健、出色的表現。

並列摘要


Given the rapid growth of information, people place increasing importance on data processing, which has become a major research method used to analyze and discover useful knowledge from a huge body of information. In binary classification problems, imbalanced and skewed data sets often occur in actual application areas. In imbalanced data sets, majority instances far outnumber minority instances. In a predictive model or classifier, an imbalance problem always results in high predictive values in a majority class, but results in poor values in a minority class. In practice, compared with the prediction errors of a majority class, those of a minority class cause more severe losses, such as those involved in identifying a defective product as a qualified one. Motivated by the aforementioned problems, we propose a Mahalanobis Distance-Attributive Selection Mechanism (MD-ASM) algorithm, which considers the relevance among variables in the analysis of multivariable data. The calculated MD is used to create the measurement scale of the system, and class is determined on the basis of distance to the reference point. ASM is added to the screened characteristic variables to determine significant variables and reduce system dimensionality. This method presents a simple concept and calculation, and eliminates redundant attribute interference that does not contribute to classification. The proposed algorithm is integrated with MD and ASM for the selection and design of classification/system parameters. For the purpose of demonstration, three data files and one power amplifier module for a mobile phone are sourced from a UCI data storage station for use in a case study. In terms of examples, the class forecasting model that is built using three variables screened from 17 testing variables that affect both non-defective and defective products can provide criteria and references for future product testing processes. These contributions will improve product quality and market competitiveness. Experimental results show that the MD-ASM algorithm not only correctly analyzes the accuracy rate of imbalanced data sets in classification problems, but also exhibits robust and excellent data processing performance.

參考文獻


An, A. and Wang, Y., 2001, “Comparisons of classification methods for screening potential compounds,” Proceedings of the IEEE International Conference on Data Mining (ICDM.01), San Jose, CA, pp. 11-18.
Chatfield, C. and Collins, A. J., 1980, Introduction to Multivariate analysis, Chapman and Hall, London.
Chen, X., 2012, “Research of coin recognition based on Bayesian network classifier,” Advances in Information Sciences and Service Sciences, Vol. 4, No. 18, pp. 395-402.
Cudney, E. A., Paryani, K. and Ragsdell, K., 2006, “Applying the Mahalanobis–Taguchi system to vehicle handling,” Concurrent Engineering: Research and Applications, Vol. 14, No. 4, pp. 343-354.
Cudney, E. A., Hong, J., Jugulum, R., Paryani, K., Ragsdell, K. M. and Taguchi, G., 2007, “An evaluation of Mahalanobis-Taguchi system and neural network for multivariate pattern recognition,” Journal of Industrial and Systems Engineering, Vol. 1, No. 2, pp. 139-150.

延伸閱讀