透過您的圖書館登入
IP:216.73.216.156
  • 學位論文

利用Youden及接收者操作特徵曲線指標之分類樹表現研究

Performance Study of CART and Multi-Layer Classifier based on Attribute Selection Criterion Using Youden's Index and ROC Curve

指導教授 : 陳正剛
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


最常見的分類樹為CART分類樹,其每一母節點皆有兩子節點,而兩子節點都可以繼續往下切割。另一種分類樹為多層判別分析,其分類樹的結構有別於傳統的分類樹,每一母節點會有兩或三個子節點,其中會有一節點為未分類的資料,而其餘的子節點為已分類的資料,再由該未分類資料的子節點利用其他屬性繼續進行分割產生新的一層。這兩種方法通常是以Gini index做為切割與否和屬性選擇的準則。 基於CART分類樹的演算法特性,針對一些資料型態並不是最有效率的分類方法,甚至可能有配適過度(overfitting)的問題。雖然,利用Gini index的多層判別分析補足了利用Gini index的CART分類樹的缺點,但是,在某些情況上仍會有限制。兩種方法仍然有不足的地方,不一定能產生最有效率的模型。本研究利用賴淑利學者所提出特定二類別兩屬性的資料型態做為標竿資料型態,分別探討使用Youden’s Index以及接收者操作特徵曲線(AUROC)等不同準則作為切割與否和屬性選擇下,CART分類樹和多層判別分析之分類表現。而針對標竿資料各種分類樹之表現研究,有一前提假設為標竿資料之樣本數極大,並沒有考慮到樣本數對各分類樹表現之影響。本研究亦以模擬資料的方式驗證理論結果,探討各種分類樹標竿資料在不同樣本數下之表現能力。根據本研究的理論探討發現,使用Youden's index和接收者操作特徵曲線為切割與否和屬性選擇準則,能夠克服多層判別分析的不足之處。而單獨使用Youden's index為切割與否和屬性選擇準則,能改善一些CART傳統分類樹的缺點。多層判別分析使用上述切割與否和屬性選擇的準則,其分類表現能力,也會間接受到樣本數大小的影響。

並列摘要


The Classification and Regression Tree(CART)is the most commonly used classification tree which consists of a hierarchy of decision nodes. Each decision node in CART can only be split into two child-nodes. Different from the traditional tree structure, another alternative tree called Multi-Layer classifier (MLC) with decision nodes split into two or three child-nodes. Among the child-nodes in each layer of Multi-Layer classifier, one is undetermined node and up to two other nodes are classified nodes. The tree is then further grown by splitting the undetermined node into a new layer of two or three nodes until a stop criterion is reached. Both CART and MLC use the Gini index as the selection criterion of attribute and cutoff point. It is generally believed that the traditional classification tree, such as CART, can effectively classify certain type of data distribution. Due to the selection criterion for cutoff point and attribute used by the algorithm, CART does not always classify data efficiently. Besides, overfitting is the main disadvantage of CART. Although the MLC based on Gini index is more effective than CART, both the Gini-based classifiers have limitations in classifying certain data type effectively. In this search, we will introduce new criteria for attribute and cutoff point selection based on the Youden’s Index and the ROC curve. We then discuss the performance properties of the classifiers based on the new criteria against the benchmark testing data (Lei,2010). Based on the theoretical discussion in this research, it is found that MLC using Youden’s Index and ROC-based indices as criterion performs well. Moreover, CART using the Youden’s Index as criterion also performs much better than that using the Gini index. The theoretical discussion assumes infinite or great sample size. However, it is observed that the data sample size affects the selection of attribute and thus the performance of the classifiers. In this research, Monte Carlo simulation is also performed to test how the sample size affects the classifier performance. Finally, some concluding remarks and recommendations are made to the users of classification trees.

參考文獻


Altman, D. G., Bland, J. M. (1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ: British Medical Journal, 308(6943), 1552.
Breiman, L., Friedman, J., Stone, C. J., Olshen, R. A. (1984). Classification and regression trees: CRC press.
Dodd, L. E., Pepe, M. S. (2003). Partial AUC estimation and regression. Biometrics, 59(3), 614-623.
Hanley, J. A., McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
McClish, D. K. (1989). Analyzing a portion of the ROC curve. Medical Decision Making, 9(3), 190-195.

延伸閱讀