可處理巨量資料的平行化CHAID決策樹

隨著科技的進步，Big-Data的時代正式來臨。在資料量急增下，電腦處理速度的改良已成為一項重要的發展技術。若將資料處理及分析的時間縮短，可以提早進行預測或判斷，平行化處理就是減少分析時間的一個方法。本研究探討資料探勘常被使用的決策樹方法與平行化運算的結合。我們改寫了CHAID決策樹在合併及判斷變數的運算法則，利用多核心計算，使決策樹的建構時間縮短。在結論中，模擬的結果顯示，當CPU 的核心為一顆以上時，CHAID決策樹的計算時間比單核心狀況明顯縮短。在處理更大的資料量時，我們節省的時間會有更明顯的差異。

關鍵字

資料探勘；分類器； CHAID決策樹；平行化

並列摘要

As technology advances, the era of Big-Data has finally arrived. As the amount of data increases , the improvement of computing speed becomes an important development technology. If data training and analysis time are reduced, we could make the prediction or decision much earlier then expected. As a result, parallel computation is one of the methods which can reduce the analysis time. In this paper, we rewrite the CHAID decision tree algorithm for parallel computation and Big-Data capability. Our simulation results show that, when the CPU has more than one kernel, the computation time of our improved CHAID tree is significantly reduced. When we have a huge amount of data, the difference of computation times is even more significant.

並列關鍵字

data mining ； classifiers ； parallel ； CHAID

參考文獻

2. 李智慎(2013)，平行化處理在決策樹演算法之應用，碩士論文，淡江大學統計系應用統計所。

4. Joshi ,M.J., Karypis , G., and Kumar, V. (1998). ScalParC : A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets, IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium

5. Kane,M. J. et al.(2013). Scalable Strategies for Computing with Massive Data. Journal of Statistical Software, 55(14), 1-19.

6. Kass, G. V.(1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data ,Applied Statistics, 29(2), 119-127.

8. Mayer-Schonberger, V. and Cukier,K. (2012). Big Data: A Revolution That Transforms How we Work, Live, and Think, Houghton Mifflin Harcourt

被引用紀錄

陳毓倫（2016）。以大數據技術分析及預測手機網路聲量與銷售量之關聯〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2016.01012

國際替代計量

可處理巨量資料的平行化CHAID決策樹

全文下載

主題瀏覽