多變量分類樹之建構與應用

分類樹(Classification tree)是一種常用於資料探勘的分類方法，透過一連串選擇適當的屬性(attribute)並將資料作分割，已達到分類的結果。但持續對資料做分割會造成樣本數迅速減少，將造成分類樹下層的估計較為不可靠。另外，當反應變數(response)與數個屬性之間呈現線性關係時，傳統分類樹也無法提供有效的分類結果。另一種常用於多變量分析的分類方法為費雪判別(Fisher's Linear Discriminant)，此方法尋找屬性之間的最佳線性組合，已達到能將各類別作最適當的分類，但此方法無法適用於資料非線性關係。為解決上述所提兩種分類方法的缺失，本研究提出一個新的分類方法－多變量分類樹 (Multivariate classification tree)。此方法因應不同的資料結構，選擇適當的分類方式。當資料屬線性關係時，選擇一組屬性的線性組合做分割，此時不僅能對資料做更精確的描述，並避免傳統方法因多次分割所造成樣本數銳減的問題。若資料非屬線性關係，則選擇傳統以單一屬性做分割的分類方式。本研究所提出的多變量分類樹中，包含一個選擇適當屬性的方法，以及單一屬性及多個屬性的衡量比較。另外，本研究導入費雪判別及馬氏距離(Mahalanobis distance)的概念，同時考慮反應變數及屬性的分布情況，以選擇最適當的決策條件(conditional clause)。為驗證本研究所提出之多變量分類樹，透過模擬產生的資料，與傳統的分類方法比較。證明此方法能有效的處理各種結構的資料，並得到準確的結果。

關鍵字

分類樹；費雪判別；馬氏距離；多變量；屬性選擇

並列摘要

Classification tree is a very common technique in data mining. It is built through selecting the appropriate attribute and sequentially splitting the sample into subsets. However, the sample size reduces sharply after few levels of splitting, and results in unreliable prediction. In addition, the classification tree could not provide accurate result efficiently for data with multivariate structure. Therefore, we propose a multivariate classification tree method to deal with different kinds of data structures. The objective is to choose the appropriate conditional clause that can capture the data character well. The proposed tree will employ a linear combination of multiple attributes if needed to avoid unnecessary sample size reduction and to obtain a more accurate tree model. To build the multivariate tree, we propose a systematic methodology to select the relevant attributes and to evaluate, compare and select the univariate model and multivariate model. In addition, we incorporate the idea of Fisher’s linear discriminant and Mahalanobis distance so that the conditional clause will take into accounts the data distributions of both the response and the attributes. To validate the proposed method, we compare that with other classification methods via simulated data and the real cases. It is shown that the new method can capture different data structures with acceptable accuracy.

並列關鍵字

Classification tree ； Mahalanobis distance ； Multivariate ； Attribute selection

參考文獻

[1]. Alpaydin,E., “Combined 5 x 2cv F test for comparing supervised classification learning algorithms,” Neural Comput., vol. 11, pp. 1975–1982, 1999.

[2]. Bartlett, M. S., “Multivariate analysis”. J. Roy. Stat. Soc. (Supple.), 2, 176-197, 1947.

[4]. Breiman, L., “Technical note: some properties of splitting criteria,” Machine Learning, v.24 n.1, p.41-47, July 1996.

[5]. Brodley, Carla E. and Paul E. Utgoff. “Multivariate decision trees,” Machine Learning, 19:45–77, 1995.

[6]. Fisher, R. A., “The use of multiple measurement in taxonomic problems,” Annals of Eugenics, 1936. 7: p. 178-188.

被引用紀錄

賴淑俐（2010）。多層判別分析理論與方法擴張及其於腫瘤診斷上的應用〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2010.02182

國際替代計量

多變量分類樹之建構與應用

主題瀏覽