基於Hadoop MapReduce叢集設計平行化二元分類演算法

在現今單機電腦環境已經無法有效率的分析大量資料的同時，Hadoop運算平台之可儲存與分析之特性有著明確的重要性。對於大量資料分析過程而言，資料探勘的演算法應用是其中重要的一環。而本次研究為了解決二元分類演算法SVM之時間複雜度過高的問題，改良一二元分類演算法，於分散式平行化運算框架中達到加速篩選分類資料的效果。主要利用MapReduce程式框架之平行化處理特性實現此演算法並成功運行於Hadoop運算平台上，在使用相同資料集進行訓練分析的情形下，大幅降低了執行運算時間。

關鍵字

資料探勘；分類法； SVM ；二元分類； Hadoop ； MapReduce

並列摘要

With increased amount data today,it is hard to analyze large data on single computer environment efficiently,the hadoop cluster is very important because we can save and large data by hadoop cluster. Data mining plays an important role of data analysis.Because time complexity of the binary-class classification SVM algorithm is a big issue,we design a parallel binary SVM algorithm to slove this problem,and achieve the effect of classifying appropriate data. By leveraging the parallel processing property in MapReduce ,we implement multi-layer binary SVM by MapReduce framework,and run on the hadoop cluster successfully. By designing different parameters of hadoop cluster and using the same data set for training analysis, it shows that the new algorithm can reduce the computation time significantly.

並列關鍵字

Data Mining ； Classification ； SVM ； Binary-class classification ； Hadoop ； MapReduce

參考文獻

[1] Wamba, S. F., Akter, S., Edwards, A., Chopin, G., & Gnanzou, D. (2015). How ‘big data’can make big impact: Findings from a systematic review and a longitudinal case study. International Journal of Production Economics, 165, 234-246.

[3] Bracci, F., Corradi, A., & Foschini, L. (2012, July). Database security management for healthcare SaaS in the Amazon AWS Cloud. In Computers and Communications (ISCC), 2012 IEEE Symposium on (pp. 000812-000819). IEEE.

[6] Lam, C. (2010). Hadoop in action. Manning Publications Co..

[14] Sun, A., Lim, E. P., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems,48(1), 191-201.

[15] Liu, C. L., Nakashima, K., Sako, H., & Fujisawa, H. (2003). Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition,36(10), 2271-2285.

國際替代計量

基於Hadoop MapReduce叢集設計平行化二元分類演算法

全文下載

主題瀏覽