  • 學位論文


Using of Optimal Cluster Analysis and Variability Analysis for Multi-class Classification Problems

指導教授 : 林泓毅


多類別分類問題(multi-class classification problems)較二元分類問題擁有更複雜的資料樣式(pattern)與決策模式,因而需要更有效的資料分析技術與資料探勘方法,面對當今大型資料庫中多維度的資料屬性(high-dimensional attributes),資料分類的工作日益困難,本研究以模糊群集分析(fuzzy cluster analysis)對連續型資料做分群,並以群集驗證指數PBMF-Index最佳化每個屬性的群集數目,除資料簡化之外,這種屬性前置處理方式的目的在於突顯屬性的識別力(discrimination power)。另外,我們利用以熵值(entropy)為基礎的屬性評估準則,加入資料變異數,提出一套全新屬性評估方法-凝聚獲利(aggregation gain),希冀藉由此方法的實施,更精準地評估屬性的識別力。 屬性之間普遍存在資訊的重複性(redundancy),因此結合一群具備良好識別力的屬性,不一定可達到最佳的分類結果,這是因為多個屬性可能重複執行類似的分類工作,為確保參與分類工作的特徵屬性(characterizing attributes)能夠各司其職並達到最大的分類功效,本研究提出啟發式屬性選擇演算法(heuristic attribute selection algorithm),運用主成份分析(PCA)於資料變異量測量(variability measurement),挑選一群適當的屬性,有效率地完成多類別分類工作。最後,我們採用五個資料集來驗證本研究提出的設計與方法,將挑選的屬性用於建構五種常見的分類器,藉以觀察群集分析、新評估方法以及啟發式演算法對分類準確度(accuracy)及區別能力(ROC area)的影響。


Multi-class classification problems incur more intricate decision models and data patterns than binary classification problems do. These situations make more techniques and technologies involved in the data mining community. Present large-scale datasets with high-dimensional attributes necessitate more efficient and effective handles in their classification tasks. In this paper, fuzzy cluster analysis is employed to preprocess continuous attributes and the cluster numbers are optimized by PBMF-Index. In addition to data simplicity, the goal of such preprocessing is to enhance the discrimination power of attributes. We also propose an entropy-based attribute evaluation criterion-Aggregation Gain in this paper. The factor of data variation is taken into account in the criterion of Information Gain so that attributes’ discrimination power can be identified with precision. Dependencies are commonly found among different attributes. A collection of discriminative attributes do not necessarily lead to good classification quality. This is because some attributes could likely possess the similar classification effects and in turn lead to the redundant classification results. In order to ensure that the selected characterizing attribute can take the major responsibility of the classification task, this paper proposed a heuristic algorithm in selecting a compact attribute subset. In this algorithm, the selection of every new characterizing attribute appeals to the variability analysis using Principle Component Analysis (PCA). In terms of classification performance, the experimental results show that our new attribute selection scheme successfully produce a compact subset of characterizing attribute for various classifier.


