基於Hadoop叢集提升雲端運算之關聯式規則資料探勘技術

關聯式資料探勘讓人們從不起眼的數據獲得有用的資訊。然而，現今數據的產量急速增加，對於探勘的演算法也需要有更好的資料處理能力。本論文針對PIETM (Principle of Inclusion-Exclusion and Transaction Mapping)關聯式資料探勘演算法提出完全平行化的方法來提升演算法運算能力，並且基於Hadoop MapReduce框架完成實作。PIETM演算法結合了Apriori與FP-growth兩演算法的優點，像是由下而上(Bottom-Up)的簡單搜尋方式以及整體探勘過程只需掃描兩次資料庫等優點。過去，有學者基於Hadoop MapReduce框架將PIETM平行化(簡稱BMR-PIETM)，不過仍有部分處理程序需要改進，例如：產生二階候選項目集、建立交易樹、降低交易區間表占用記憶體的空間。本研究以MapReduce框架及資料本身的結構特性提出了解決上述之問題的方法，特別是透過「間隔標籤(ITag)」將建立交易樹的過程平行化。另外也透過「間隔標籤(ITag)」將交易區間表的大小縮小。由於本研究完全的使用MapReduce框架將PIETM完全平行化，所以將此平行化之PIETM稱為Parallel-PIETM(簡稱P-PIETM)。本研究之解決方法，除了適用在Apriori演算法也適用在FP-growth演算法外，實驗結果顯示，P-PIETM演算法的效能可以比BMR-PIETM演算法以及Apriori演算法更好，效能提升最高可達66.17%。

關鍵字

關聯式規則；資料探勘； MapReduce ；平行化

並列摘要

We can get useful and valuable information from insignificant data through data mining and gain huge benefit from professional analysis. However, it is important to improve the performance of data mining for Big Data processing. The purpose of this study is to improve the performance of parallel association-rule mining algorithm of PIETM (Principle of Inclusion- Exclusion and Transaction Mapping) under the MapReduce framework. PIETM is arranging transaction data in database into a tree structure which is called Transaction tree (T-tree), and then transform T-tree into Transaction Interval tree (TI-tree). And use principle of Inclusion- Exclusion according to TI-tree to calculate all frequent itemsets. PIETM combines the benefits of Apriori and FP-growth algorithms and only needs to scan the database twice in data mining. However, we still need to improve some procedures, for example, construct a transaction tree and generate candidate k-item sets. For the two problems described above, we provide a solution respectively. These two solutions adopted the FP-growth and Apriori algorithms respectively.

並列關鍵字

association rule ； data mining ； MapReduce ； parallel

參考文獻

[2] S. Y. Chen, J. H. Li, K. C. Lin, H. M. Chen and T. S. Chen, "Using MapReduce Framework for Mining Association Rules".

[4] S. Ghemawat, H. Gobioff. and S. T. Leung, “The Google File System,” in Proceedings of the nineteenth ACM symposium on Operating systems principles, Pages 29-43, 2003.

[6] J. Han, J. Pei, Y. Yin, and R. Mao, "Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach," in Data Mining Knowledge Discovery, Volume 8, Issue 1,Pages 53-83, 2004.

[7] K. F. Jea, M. Y. Chang, and K. C. Lin, " An Efficient and Flexible Algorithm for Online Mining of Large Itemset," in Information Processing Letters, Volume 92, Issue 6, Pages 311-316, 2004.

[8] K. C. Lin, I. E. Liao, S. F. Lin, and T. P. Chang, "A Frequent Itemset Mining Algorithm based on the Principle of Inclusion-Exclusion and Transaction Mapping," in Information Sciences, revised.

國際替代計量

基於Hadoop叢集提升雲端運算之關聯式規則資料探勘技術

全文下載

主題瀏覽