基於Spark系統之排容原理關聯式規則演算法

大量資料的產生，使得資料探勘成為熱門的研究議題之一，其目的是將資料進行挖掘，從資料中提取出有用的資訊，而關聯式規則就是資料探勘的一種重要技術，它可以生成關聯式規則來標識出交易資料庫中的項目之間的關聯性。隨著科技進步，網際網路蓬勃發展，資料量也跟著不斷的增加，提高關聯式規則演算法運算能力以及效能也就有顯著的重要性。本論文是針對PIETM(Principle of Inclusion-Exclusion and Transaction Mapping)關聯式規則演算法進行改善以及利用平行化方式以提升演算法運算能力，並且基於Spark RDD環境下完成實作。PIETM演算法結合了Apriori演算法自下而上(Bottom up)簡單搜尋方式以及FP-Growth演算法在探勘過程對資料庫進行掃描只需兩次等優點，讓探勘過程中，提高其演算法的運算能力。過去有學者基於MapReduce環境下將PIETM演算法進行實作並且平行化(簡稱BMR-PIETM)，不過仍有部分程序可以加以改善，例如：減少交易區間列表中的區間數量、平行化方式獲得二項頻繁項目集，省略二項候選項目集以排容原理計算支持度、平行化方式生成三項之後的候選項目集等。本研究是使用Spark RDD環境將PIETM演算法實作並且平行化，所以取名為Based on Spark-PIETM(簡稱BS-PIETM)。實驗結果顯示，本論文所提出BS-PIETM演算法，有利於提升整體演算法運算能力，我們將在Spark RDD環境下BS-PIETM演算法與BMR-PIETM演算法進行比較，其結果也顯示出，BS-PIETM演算法在實驗方法下，其效能確實優於BMR-PIETM演算法。根據實驗結果顯示，BS-PIETM演算法在為了生成其聯集值進行所合併交易區間列表的階段，可能導致其執行時間花費較長，因此未來的研究就是改善此階段。關鍵詞：關聯式規則、資料探勘、Spark RDD、MapReduce、平行化

關鍵字

關聯式規則；資料探勘； Spark RDD ； MapReduce ；平行化

並列摘要

With the development of big data, data mining becomes an important technique to acquire the useful information among them. An association rule mining algorithm is an effective data mining technique to be used to obtain the relationships among data items. In this study, we propose a parallel association rules mining algorithm, BS-PIETM, based on Spark system to enhance the performance of BMR-PIETM. BMR-PIETM is a parallel mining algorithm based on Hadoop MapReduce framework. It combines the advantages of the simplicity of Apriori and efficiency of FP-growth algorithms to improve their mining performance. However, BMR-PIETM still needs a lot of time to classify and process big data for obtaining the useful information. There are three contributions in the study. First, a mechanism is designed to reduce the number of data items in the interval list and efficiently generate the frequent 2-itemset and candidate 3-itemset. Second, the mechanism is designed based on Spark system and can be used parallel implemented on a cluster of machines. Finally, a comprehensive environment that facilitates an experimental evaluation of BMR-PIETM and our BS-PIETM. The experimental results also show the higher performance of BS-PIETM. Furthermore, according to our experimental results, merging the interval lists may lead to BS-PIETM having the longer execution time. In the future work, improving the stage will reduce the execution time and result in higher parallel mining performance. Keywords: association rule, data mining, spark RDD, MapReduce, parallel mining algorithms

並列關鍵字

association rule ； data mining ； spark RDD ； MapReduce ； parallel mining algorithms

參考文獻

[1] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," in ACM SIGOPS operating systems review, 2003, pp. 29-43.

Google Scholar

[2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, 2010, pp. 1-10.