基於Spark系統之混合型模糊分群演算法

大數據的誕生讓資料探勘變成一項熱門的科學研究，為了能在大量且複雜的資料中提取出有用的資訊，資料探勘的技術就顯得非常重要，現有使用資料探勘的技術大致上以關聯式規則分析、群集分析、類神經網路、決策樹及迴歸分析等技術為主要發展趨勢。 FCM(Fuzzy C-Means)模糊分群演算法利用模糊理論中在資料點上設計模糊歸屬度概念的作法，讓每個資料點對每個一群集有各自的歸屬度值後進行之後在依據歸屬度數值及其他特性分群，FCM對每個離群值判斷更有強健性。過去也有許多對FCM模糊分群演算法進行擴展的研究，例如加入新的權重計算方式以適應混合型資料或是基於抽樣技術由大數據中隨機選取樣本進行初始群中心的計算，但其結果易受隨機樣本的影響，在目前的技術中較少有同時能夠應付大數據與混合型資料集的相關研究。本研究主要是針對FCM(Fuzzy C-Means)模糊分群演算法進行處理混合型資料的擴展並使用平行運算的方式藉此提升模糊分群演算法之運算能力，我們主要利用Spark系統平行分散運算的特性將資料以分散式方式處理並放置記憶體中進行快速的分群運算，藉以提高演算法之運算效能，並降低運算時間成本。我們的研究是以分散平行的觀點提升FCM模糊分群演算法的運算效能，我們改寫並重新設計FCM模糊分群演算法，使其在面對混合型資料時能提升運算效能。實驗結果顯示，本論文提出的SPMIX-FCM模糊分群演算法，在大型資料集與混合型資料的處理是優於傳統的FCM模糊分群演算法。

關鍵字

資料探勘； Spark RDD ； MapReduce ；平行化；分群演算法；混合型資料探勘

並列摘要

The birth of big data has made data exploration a popular scientific research. In order to extract useful information from a large amount of complex data, the technology of data exploration is very important. The existing technology for data exploration is roughly related. Techniques such as rule analysis, cluster analysis, neural network-like, decision tree and regression analysis are the main development trends. The FCM (Fuzzy C-Means) fuzzy cluster algorithm uses the fuzzy theory to design the fuzzy attribution degree concept on the data points, so that each data point has its own attribution value for each cluster and then the attribution degree. Values and other characteristics are grouped, and FCM is more robust to each outlier. In the past, there have been many studies on the extension of FCM fuzzy clustering algorithms, such as adding new weight calculation methods to adapt to mixed data or sampling samples from random data in large data for initial group center calculation, but the results are susceptible. The impact of random samples, in the current technology, is less likely to be able to cope with the research of big data and hybrid data sets. This study mainly focuses on FCM (Fuzzy C-Means) fuzzy cluster algorithm to deal with the expansion of mixed data and uses parallel computing to improve the computing power of fuzzy cluster algorithm. We mainly use the characteristics of parallel decentralized operation of Spark system. The data is processed in a decentralized manner and placed in the memory for fast clustering operations, thereby improving the performance of the algorithm and reducing the computational time cost. Our research enhances the computational efficiency of the FCM fuzzy clustering algorithm from the perspective of decentralized parallelism. We rewrite and redesign the FCM fuzzy clustering algorithm to improve the computational performance in the face of mixed data. The experimental results show that the SPMIX-FCM fuzzy clustering algorithm proposed in this paper is superior to the traditional FCM fuzzy clustering algorithm in the processing of large data sets and hybrid data.

並列關鍵字

data mining ； Spark RDD ； MapReduce ； Parallelization ； Cluster algorithms ； Hybrid data mining

參考文獻

英文參考文獻

Google Scholar

[1] H. G. Sanjay Ghemawat, and Shun-Tak LeungLeung, "The Google file system," ACM SIGOPS operating systems review, pp. 29-43, 2003.

Google Scholar

[2] H. K. Konstantin Shvachko, Sanjay Radia, Robert Chansler, "The hadoop distributed file system," Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, pp. 1-10, 2010.

Google Scholar

[3] J. D. a. S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107-113, 2008.

Google Scholar

[4] Apache Hadoop. Available: http://hadoop.apache.org

Google Scholar

國際替代計量

基於Spark系統之混合型模糊分群演算法

全文下載

主題瀏覽