運用最佳化群集分析與變異量分析於多類別分類問題

多類別分類問題(multi-class classification problems)較二元分類問題擁有更複雜的資料樣式(pattern)與決策模式，因而需要更有效的資料分析技術與資料探勘方法，面對當今大型資料庫中多維度的資料屬性(high-dimensional attributes)，資料分類的工作日益困難，本研究以模糊群集分析(fuzzy cluster analysis)對連續型資料做分群，並以群集驗證指數PBMF-Index最佳化每個屬性的群集數目，除資料簡化之外，這種屬性前置處理方式的目的在於突顯屬性的識別力(discrimination power)。另外，我們利用以熵值(entropy)為基礎的屬性評估準則，加入資料變異數，提出一套全新屬性評估方法－凝聚獲利(aggregation gain)，希冀藉由此方法的實施，更精準地評估屬性的識別力。屬性之間普遍存在資訊的重複性(redundancy)，因此結合一群具備良好識別力的屬性，不一定可達到最佳的分類結果，這是因為多個屬性可能重複執行類似的分類工作，為確保參與分類工作的特徵屬性(characterizing attributes)能夠各司其職並達到最大的分類功效，本研究提出啟發式屬性選擇演算法(heuristic attribute selection algorithm)，運用主成份分析(PCA)於資料變異量測量(variability measurement)，挑選一群適當的屬性，有效率地完成多類別分類工作。最後，我們採用五個資料集來驗證本研究提出的設計與方法，將挑選的屬性用於建構五種常見的分類器，藉以觀察群集分析、新評估方法以及啟發式演算法對分類準確度(accuracy)及區別能力(ROC area)的影響。

關鍵字

群集分析；變異量分析；多類別分類問題；群集驗證指數；識別力

並列摘要

Multi-class classification problems incur more intricate decision models and data patterns than binary classification problems do. These situations make more techniques and technologies involved in the data mining community. Present large-scale datasets with high-dimensional attributes necessitate more efficient and effective handles in their classification tasks. In this paper, fuzzy cluster analysis is employed to preprocess continuous attributes and the cluster numbers are optimized by PBMF-Index. In addition to data simplicity, the goal of such preprocessing is to enhance the discrimination power of attributes. We also propose an entropy-based attribute evaluation criterion－Aggregation Gain in this paper. The factor of data variation is taken into account in the criterion of Information Gain so that attributes’ discrimination power can be identified with precision. Dependencies are commonly found among different attributes. A collection of discriminative attributes do not necessarily lead to good classification quality. This is because some attributes could likely possess the similar classification effects and in turn lead to the redundant classification results. In order to ensure that the selected characterizing attribute can take the major responsibility of the classification task, this paper proposed a heuristic algorithm in selecting a compact attribute subset. In this algorithm, the selection of every new characterizing attribute appeals to the variability analysis using Principle Component Analysis (PCA). In terms of classification performance, the experimental results show that our new attribute selection scheme successfully produce a compact subset of characterizing attribute for various classifier.

並列關鍵字

cluster analysis ； variability analysis ； multi-class classification problems ； cluster validity index ； discrimination power ； attribute evaluation

參考文獻

Aran, O., & Akarun, L. (2010). A multi-class classification strategy for Fisher scores: Application to signer independent sign language recognition. Pattern Recognition, 43(5), 1776-1788.

Bellotti, T., & Crook, J. (2009). Support vector machines for credit scoring and discovery of significant features. Expert Systems with Applications, 36(2), 3302-3308.

Bezdek, J.C. (1974a). Cluster validity with fuzzy sets. Journal of Cybernetics, 3(3), 58-73.

Bezdek, J.C. (1974b). Numerical taxonomy with fuzzy sets. Journal of Mathematical Biology, 1, 57-71.

Bezdek, J.C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Boston, MA: Kluwer Academic Publishers.

國際替代計量

運用最佳化群集分析與變異量分析於多類別分類問題

全文下載

主題瀏覽