結合MapReduce與模糊聚類作為大數據的資料分析

大數據已成為近年來最熱門的話題，在這資訊爆炸的年代，數據資料往往都是非常龐大且難以運算，而且資料本身的多樣性更增加其分析的困難度。而隨著科技的發達，為了研究大數據而提出的方法越來越多，目的都是為了挖掘大數據資料背後所隱藏的信息。而Google發布了MapReduce的相關論文，使得大數據資料分析變得不是那麼困難，使用者將可以利用MapReduce來進行對大數據資料的分析。聚類分析被視為一個獲得資料訊息的良好工具，它的概念為將所得到的資料，依照其特徵的相似性分成為若干群集，目標在使同一群集中的資料有較大的相似性，而群集與群集間則有明顯的特徵差異。面臨龐大的數據資料時，聚類分析可能需要花費更多的時間來處理，甚至於超過單機運算能力，在這個情形下，如果可以結合MapReduce平行運算的方法，那麼聚類分析可以處理的資料將更為廣泛。而本文將討論利用MapReduce結合模糊聚類的方式做為大數據資料分析。

關鍵字

大數據； MapReduce ；模糊聚類分析

並列摘要

In this era of information explosion, data are easily and variously collected so that they become large and difficult to be handled. On the other hand, these huge data are so diversity that it cannot be analysed easily. In recent years, Big Data has become the popular topic. With the development of science and technology, there are more and more methods proposed for mining big data. The goal is to excavate hidden information behind these big data. Google released relevant papers about MapReduce. It makes big data analysis become less difficult where users are able to take advantage of MapReduce for big data analysis. Cluster analysis is regarded as a good tool to get information from data. It gets clustering results from data sets in accordance with its characteristic similarities so that the target in the same cluster of information has a greater similarity, and there are significant differences between the characteristics between different clusters. Although cluster analysis has been a good tool in data analysis, it may takes more and more time to process and even exceeds operation of computer when it faces big data. In this case, if we can combine MapReduce that is a kind of parallel computing method, then clustering methods can handle these big data more easily. The main goal of this article will focus on the use of MapReduce combined with fuzzy clustering as an analysis tool for big data.