透過您的圖書館登入
IP:18.191.132.36
  • 學位論文

多層混合分類樹研究及其腫瘤診斷之應用

Study of Multi-layer Hybrid Classification Tree with Applications to Cancer Diagnosis

指導教授 : 陳正剛
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


分類樹(Classification Tree)在資料探勘領域上被廣泛使用來探討感興趣資料的分類,並應用於醫學、工程等領域的機器學習。分類樹主要分為兩個主要的類別,即分類與迴歸樹(Classification and regression trees, C ART) 和多變量分類樹。C ART常用於建構二元分類樹,一般利用Gini index 做為分割的準則。多層判別分析有別於C ART,其每一層的待分割節點皆會分割成兩個或三個節點,允許其中一節點為未分類資料,未分類節點資料可繼續透過使用其他屬性進行分割展開新的一層,而已確定類別的節點,則不再分割。由於在醫學探勘(如腫瘤診斷)中,結合費雪線性判別分析(FLD)的分類樹模型不一定能夠有效提升分類樹的分類效能,本文嘗試構造更有效的演算法並加以實例驗證。 在模型構造中,本研究先通過引入參數 來調節費雪線性組合屬性方案的比例。同時,根據賴淑俐學者(2010)所進行的理論探討發現,多層判別分析與C ART分類樹可以互補不足之處,本研究進而通過引入參數 調整多層判別分析和C ART分類樹的相對比重。當每一個節點進入演算法中時,先通過 和多層組合屬性方案決定是否需要採用費雪線性組合屬性方案及相應的特徵數,再通過 和非參數型接受者操作特徵(NP-ROC)來決定節點和切割方案,即決定是否分割成C ART的兩個節點或多層判別分析的兩個節點或三個節點。 為了驗證此模型,本研究利用臺大醫院所提供的366筆乳房腫瘤案例來測試,其中266筆做為訓練樣本用於選擇和訓練參數,而100筆則固定作為獨立測試樣本,從而比較多層混合分類樹與C ART、多層判別分析和強化多層判別分析的單一分類樹的判別結果和多階段調適樹群(莊曙詮,2012)的BI-RADS分級結果,驗證判別模型效能。 從案例驗證的結果中,可以看出新演算法的分類效能確實優於其他方法,且能在顯著增加多階段調適樹群BIRADS 3的良性個數同時,將惡性比例維持在可接受的範圍內。

並列摘要


The classification decision tree is the most commonly used classification tool in data mining and machine learning in medical and engineering applications. There are mainly two types of classification trees: C ART and multivariate classification tree. The C ART is usually used and constructed by a hierarchical tree of decision nodes. The structure of the Multi-layer Classifier, proposed by Wu (2009), is differs from the C ART by constructing each layer consisting of two or three nodes, of which only the node with unclassified data will be classified further into the next layer and the rest nodes contain data completely classified. The tree construction continues until a stop criterion is reached. However, the structure of the Multi-layer Classifier or C ART combined with Fisher Linear Discriminant analysis (FLD) may not improve classification tree efficiency when it is applied to medical exploration (such as diagnosis of tumor). Hence, this thesis aims at constructing a more effective Multi-layer Hybrid Classification Tree and utilizes empirical data to validate its performance. In the modeling of tree structure, this study first introduces a parameter, , to be used to adjust the proportion of nodes constructed by FLD. At the same time, according to the theoretical discussion by Lai (2010), the multi-layer classifier and the C ART can complement each other’s insufficiency. Therefore, this study introduces a second parameter, , to be used to adjust likelihood for each tree layer of data to be classified according to the Multi-layer or C ART decision. When a node is to be split, it needs to decide first whether to apply FLD based on the value of . Then it needs to decide whether to split into two nodes with C ART decision or three (or two) nodes with Multi-layer decision based on the value of . In order to verify the performance of the proposed model, this study uses 366 breast cancer cases provided by National Taiwan University Hospital (NTUH) to test the proposed tree, 266 of which are taken as training samples for selection and training parameters, and the other 100 is isolated as the independent test sample. We compare this proposed Multi-layer Hybrid Classifier with C ART, Multi-Layer Classification Tree (ML-ROC), as well as Enhanced Multi-layer Classification Tree(Enhanced-ML-ROC) proposed by Lai (2010) based on results of single tree performance and BI-RADS results generated by Adaptive Multi-phase Ensemble (Chuang, 2012). Based on the verification results, it is found that the classification efficiency of the newly proposed algorithm is indeed superior to other methods, and the BIRADS result shows that it not only increases the benign case number of BIRADS 3 by an observable size, but also maintains the number malignant cases of BIRADS 3 in an acceptable range.

參考文獻


Breiman, L., Friedman, J., Stone, C. J., Olshen, R. A. (1984).Classification and regression trees. CRC press.
Budescu, D. V. (1993). Dominance analysis: A new approach to the problem of relative importance of predictors in multiple regression. Psychological Bulletin, 114(3), 542.
Chang, K. J., Chen, W. H., Chen, A., Chen, C. N., Ho, M. C., Tai, H. C., ... Wu, H. J. (2013). U.S. Patent No. 8,572,006. Washington, DC: U.S. Patent and Trademark Office.
DeLong, E. R., DeLong, D. M., Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 837-845.
Fisher, R. A. (1950). The use of multiple measurements in taxonomic problems, Annual Eugenics, 7, Part II, 179-188 (1936); also in Contributions to Mathematical Statistics.

延伸閱讀