於平行系統中以權重方法改進混合型資料模糊分群演算法

大數據分析因現今社會資訊交流的快速發展而興起，其帶來莫大商機與多元化的應用。大數據大多以混合型資料集方式呈現，並且資料集間表達的意思也未必一致。混合型資料集(hybrid dataset) [1]是指連續型資料和離散型資料集之混合，亦大都是由非數值型態與數值型態所組成的混合型資料集，如何能即時且迅速處理這些混合型資料集，並從中獲取有用有效的資訊，藉此提升在市場上的競爭力，已是重要的課題。目前分群研究中，針對混合型資料的處理大多都是將資料集的數值型資料與類別型資料分開處理，需使用到兩種甚至以上的計算及相似度比對方法，使得整體複雜度增加。此外，加註權重也是一種提升分群效能的方法，而權重資料的選擇通常為分群資料領域的專業人士提供，而這種方法會使得每次在進行分群工作前都必須先請領域專家進行資料收集及整理的動作；另一種提升分群效能的方式為對所有欄位進行權重的計算，但對於混合型資料數值與類別資料由於相似程度標準不相同，並無法以單一方式進行權重處理，所以應將類別資料的權重做獨立處理，才能確實提升分群效能。另外資料間相似度的比較方式，傳統分群中大多為資料點各欄位的距離求出後加總得到距離總合，但類別資料間的差異並非只有數值資料的計算，而是包含了非數值的類別資料相似度比較。綜合上述，本論文利用類別轉換字典(Categorical Transform Dictionary，CTD)將類別資料轉為數值資料，並以資料與群心各欄位間的距離做標準差計算出資料差異度作為比較標準。此外，現今大數據環境在進行分群實驗時計算各群心與各資料點的距離時，會因為資料的增長而使運算負擔增加，所以目前許多研究都利用平行化的方式大幅降低演算法的時間成本。本文提出利用Apache Spark叢集運算框架進行混合型大數據平行模糊分群處理。首先，本論文將文字型態資料對應的資訊，以字典方式儲存，將非數值資料對應為數值型態資料。採用高效能的平行資料集前置處理，使原始資料依字典內容轉換為數值型態，並先行計算欄位權重值，而在傳統中資料間相似度的比對計算則改用資料間距離的標準差做為差異程度的比較。利用Spark平行分散運算的環境，能大大降低整理字典、權重計算所需花費時間，並將比較資料間距離標準差的方法實際套用至混合型資料集，進行分群實驗。經過分群實驗結果比較，透過本論文的類別轉換字典將類別資料轉為數值型態、類別型資料加權及改進的資料差異比較方法後，本論文所提出的混和型分群方式效果較傳統的分群演算更快更優異，主要在於本論文對轉換與分群的計算部分使用平行化的方法處理，並使演算過程以數值的方式進行，而讓整體運算的時間成本降低。在進行資料間距離運算的方法也用異於傳統距離加總的方式，本論文使用距離的標準差來進行資料間距離運算的方法。

關鍵字

Spark RDD ；資料挖掘；混合型資料；模糊分群處理；大數據分析

並列摘要

Due to the development of information technology, the amount of digital data in business applications is growing at a rapid rate and results in big-data analysis becomes an important issue. Big-data analysis offers various business benefits, including new revenue opportunities and more effective marketing strategy. However, most big data in business applications is presented in hybrid datasets which includes different types data i.e., numeric and text type data. Hybrid dataset leads to more efforts are used by big-data analyzers. This is because hybrid dataset is a mixture of continuous and discrete datasets and composed of non-numeric and numerical type data. How to instantly obtain useful and effective information form non-numeric and numerical type data is an important issue. Currently, clustering analysis researches proposed their hybrid dataset clustering algorithms by separating the datasets into the numerical and non-numerical datasets. This results in two or more computation methods are used to compare the similarity between the numerical and non-numerical datasets separately. Therefore, the two or more similarity computation methods increase the time complexity in their proposed hybrid data clustering algorithms. On the other hand, using the concept of weights is a way to improve the performance of clustering analysis. In generally, in order to obtain useful information, the weight values are added into the data by domain experts to emphasize its importance before clustering analysis. However, for numerical and non-numerical data, different weight values are used and cause complex similarity evaluation standards among clustering analysis. This paper proposes a Apache Spark clustering framework for hybrid big data analysis. Firstly, the non-numeric data is transformed into the numberical data by analying these non-numeric data’s features. In the stage, the parallel processing is used to increase the performance of converting data forms. Secondly, the weight values are used in the proposed framework to obtain good clustering results for hybrid data sets. Finally, through experiments, the high performance of our proposed framework is proved.

並列關鍵字

Spark RDD ； data mining ； hybrid data ； Fuzzy C-Means ； Big data analysis

參考文獻

英文參考文獻

Google Scholar

[2]H. G. Sanjay Ghemawat, and Shun-Tak LeungLeung, "The Google file system," ACM SIGOPS operating systems review, pp. 29-43, 2003.

Google Scholar

[3]H. K. Konstantin Shvachko, Sanjay Radia, Robert Chansler, "The hadoop distributed file system," Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, pp. 1-10, 2010

Google Scholar

[4]J. D. a. S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107-113, 2008

Google Scholar

[5]ApacheHadoop.Available:http://hadoop.apache.org

Google Scholar

國際替代計量

於平行系統中以權重方法改進混合型資料模糊分群演算法

全文下載

主題瀏覽