一種改良的啟發式方法以建構名目屬性之二元決策樹

資訊科技的日新月異，資料的儲存與處理規模均與過去有相當大的差距。如何從龐大的資料量中擷取出有用的資訊以提供給決策者參考，一直是資料探勘領域裡所關注的重點。決策樹由於其運算容易，又能產生清楚的規則，使其成為資料探勘中最常用的分類技術之一。但是當處理的資料量龐大，且名目屬性的屬性值相當多的情況之下，若每一屬性值都形成一個分支，則決策樹的分支太多將會造成所萃取的規則過於複雜難以解讀，資料在處理上的效率也會大打折扣。本論文發展一種簡化決策樹的方法，可將資料庫內的名目屬性做二元分割，把資料分成二支，以減少過多與不必要的決策樹分支。本研究採用主成分分析法中，可表示大部分變異的第一主成分，並利用該成分裡經過標準化成分分數的平均值，作為二元分割屬性值的基準，以消除過多的屬性值分支，使得決策樹的外顯知識容易解讀。最後，並以四個UCI資料庫內的資料集作為測試樣本，結果顯示本研究所提的方法，在決策樹的精簡與分類正確性上都有良好的表現。

關鍵字

決策樹；資料探勘；分類；啟發式方法；主成分分析

並列摘要

The ability to extract useful information from a large-scale database to aid decision-making is critical in data mining. Classification is an important problem in data mining. It has been studied extensively as a possible solution to the knowledge acquisition. Decision tree has become one of the most commonly used techniques for classifying data because the algorithm for generating a decision tree can be easily implemented. However, when there are too many distinct values of the nominal attributes in each node of a tree, the branches of the tree become enormous and complicated. As a result, the effectiveness of data processing in a large data set may be compromised. This paper aims to propose a heuristic method to simplify the decision tree by splitting the nominal attributes into two branches. We adopt principal component analysis to present an algorithm for finding a good partition strategy in order to reduce unnecessary branches of a decision tree. Since the principal component can represent most of the variants, the first component scores of each attribute will be utilized as the thresholds for splitting examples. The decision tree can be simplified to a binary tree so that the explicit knowledge of a tree can be easily extracted. We also compare against other heuristic methods and give an analysis of experimental results on four UCI data sets.

並列關鍵字

Decision tree ； Data mining ； Classification ； Heuristic method ； Principal component analysis

參考文獻

Aha, D. W.,Breslow, L. A.(1998).Comparing simplification procedures for decision trees.Artificial Intelligence and Statistics.5,199-206.

Google Scholar

Auer, P.,Holte, R. C.,Maass, W.(1995).Theory and applications of agnostic PAC-learning with small decision trees.Proceedings of the twelfth international conference on machine learning.(Proceedings of the twelfth international conference on machine learning).

Google Scholar

Bohanec, M.,Bratko, I.(1994).Trading accuracy for simplicity in decision trees.Machine Learning.15,223-250.

Google Scholar

Breiman, L.,Friedman, J. H.,Olshen, R. A.,Stone, C. J.(1984).Classification and Regression Trees.Monterey:Wadsworth International Group.

Google Scholar

Brodley, C. E.,Utgoff, P. E.(1995).Multivariate decision trees.Machine Learning.19,45-77.

Google Scholar

被引用紀錄

高棋楠（2012）。資料探勘技術建構公司財務預警模式之研究〔碩士論文，國立中正大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0033-2110201613500091

國際替代計量

一種改良的啟發式方法以建構名目屬性之二元決策樹

全文下載

主題瀏覽