基於MapReduce程式架構下的分散式循序樣式探勘方法之研究

循序樣式探勘是在巨量循序資料庫中用來取得頻繁循序樣式的一種資料探勘方法，常見的循序資料探勘方法可以分為兩大類，候選樣式產生與樣式成長方法，這些演算法主要執行於單機的環境，便會造成一些缺點，像是對於巨量資料的掃描時間、可擴展性的問題、對於巨量資料及的效率較低。為了增進循序資料探勘的性能，並且改善可擴展性的問題，本研究提出了以Hadoop平台與MapReduce軟體架構為基礎的循序資料探勘方法。探勘任務被分解為許多分散式任務，Map方法用來挖掘資料集中的所有循序樣式，然後Reduce方法合併所有被找出來的樣式。簡化了搜尋的空間以及獲得了更高的探勘效能。在這次研究當中，我們對於用戶所設定最小支持度的影響有更進一步的討論，根據我們的實驗，我們發現在探勘過程中的Map與Reduce階段對於最小支持度的設定應該不同，否則會產生頻繁樣式流失的可能。

關鍵字

Hadoop ； MapReduce ；循序樣式；資料探勘

並列摘要

Sequential pattern mining is a data mining method for obtaining frequent sequential patterns in a large sequential database. Conventional sequence data mining methods could be divided into two categories: Apriori-like methods and pattern growth methods. These algorithms are mainly executed on standalone environment. There are some disadvantages like large database scanning time, scalability problem, less efficient for massive dataset. To improve the performance of sequential pattern mining and to improve the scalability issues, this study presents a distributed sequential pattern mining method based on Hadoop platform and Map Reduce programming model. Mining tasks are decomposed to many distributed tasks, the Map function is used to mine each sequential pattern in a subset of database. Then the Reduce function merges together all these identified patterns. It simplifies the search space and acquires a higher mining efficiency. In this study, we have further discussion on the influence of the setting of user-specified minimum support threshold on the distributed mining process. According to our experiments, it has been found that the threshold setting should be different in Map and Reduce mining process to prevent loss of some frequent patterns.

並列關鍵字

Hadoop ； MapReduce ； Sequential Pattern ； Data Mining

參考文獻

[1].R. Agrawal and R. Srikant. Mining sequential patterns. Presented at Data Engineering, 1995. Proceedings of the Eleventh International Conference On. 1995, . DOI: 10.1109/ICDE.1995.380415.

[2].J. H. Chang and N. H. Park. Comparative analysis of sequence weighting approaches for mining time-interval weighted sequential patterns. Expert Syst. Appl. 39(3), pp. 3867-3873. 2012.

[3].Y. Chen, M. Chiang and M. Ko. Discovering time-interval sequential patterns in sequence databases. Expert Syst. Appl. 25(3), pp. 343-354. 2003.

[4].Y. Chen and Y. Hu. Constraint-based sequential pattern mining: The consideration of recency and compactness. Decis. Support Syst. 42(2), pp. 1203-1215. 2006.

[8].Jian Pei, Jiawei Han, B. Mortazavi-Asl, H. Pinto, Qiming Chen, U. Dayal and Mei-Chun Hsu. PrefixSpan,: Mining sequential patterns efficiently by prefix-projected pattern growth. Presented at Data Engineering, 2001. Proceedings. 17th International Conference On. 2001, . DOI: 10.1109/ICDE.2001.914830.

國際替代計量

基於MapReduce程式架構下的分散式循序樣式探勘方法之研究

全文下載

主題瀏覽