多維度資料庫中頻繁數值樣式之分散式探勘

已有許多方法能從符號式資料庫中探勘頻繁樣式，對於數值資料的處理，則先利用分界點將數值資料轉換成符號式的資料，然後再利用之前所提出的方法，從轉換後的符號式資料中探勘頻繁樣式。然而分界點選定的不同，常會造成樣式的不同。因此，最好發展一個方法直接從數值資料庫中探勘數值樣式。從資料庫中探勘數值樣式，是個需要大量運算的工作；而雲端運算的架構正好適合此類運算密集的應用。就我們所知，目前尚未有人提出在雲端架構上探勘數值樣式的方法。因此，在本篇論文中，我們提出一個有效率的演算法，利用MapReduce架構從多維度的數值資料庫中，探勘頻繁數值樣式。我們所提出的方法，共包含三個MapReduce的工作。首先，每個運算實體分頭掃描資料庫，以找出所有長度為一的頻繁樣式。接著，每個運算實體掃描每個長度為一的頻繁樣式的投影資料庫，以取得有助於負載平衡規劃的資訊，並將探勘的工作平均分配給各個運算實體。最後，每個運算實體以深度優先的方式遞迴產生所有頻繁樣式。在探勘過程中，我們利用兩個加速策略來產生相近份額的諸多任務，以便得到平衡的負載，使得各個運算實體能夠平行而互不干擾的進行探勘工作。因此，我們所提出的方法能夠有效率地從多維度資料庫中探勘出頻繁數值樣式。根據實驗結果顯示，我們所提出的方法較改良式的Partition 演算法，在執行速度與擴充性上皆有較佳的表現。

關鍵字

雲端運算；數值樣式；頻繁樣式；多維度資料庫；資料探勘

並列摘要

Most previously proposed methods mine frequent patterns from symbolic databases, where numerical data may be transformed numerical data into a symbolic representation by a list of breakpoints. However, using different breakpoints to transform numerical data into a symbolic representation can result in different patterns. Thus, it is better to directly mine numerical patterns without any transformation. Mining numerical patterns from databases is one of the most computation-intensive applications, which can be dealt with by using the cloud computing infrastructure. To the best of our knowledge, there is no algorithm dedicated to mining frequent numerical patterns on the cloud. Therefore, in this thesis, we propose an efficient algorithm for mining frequent numerical patterns in a multi-dimensional database on the MapReduce framework, where every transaction in the database contains multiple numerical values. The proposed algorithm is composed of three MapReduce jobs. The first job is to mine frequent patterns of length one (1-patterns) in parallel. The second job gathers the information about frequent 1-patterns mined, and utilizes the information to divide the mining process into nearly-equally sized tasks. The third job distributes these tasks to different worker instances, each of which recursively mines frequent patterns in a depth-first search (DFS) manner until no more frequent patterns can be found. During the mining process, we employ two effective speedup strategies to form tasks of nearly-equal size and balance the workload, and an approach to divide a multi-dimensional database into independent partitions so that the mining tasks can be performed independently and parallelly. Therefore, the proposed method can efficiently mine frequent numerical patterns in a multi-dimensional database. The experimental results show that the DNM algorithm outperforms the modified Partition-Apriori algorithm in orders of magnitude.

並列關鍵字

Cloud computing ； numerical pattern ； frequent pattern ； multi-dimensional database ； data mining

參考文獻

[1] R. C. Agarwal, C.C. Aggarwal, V.V.V. Prasad, A tree projection algorithm for generation of frequent itemsets, Journal of Parallel Distributed Computing, Vol. 61, No. 3, 2001, pp. 350-371.

[5] A. Baratloo, M. Karaul, Z. Kedem, P. Wyckoff, Charlotte: Metacomputing on the web, Future Generation Computer Systems, Vol. 15, No. 5-6, 1999, pp. 559-570.

[7] D Burdick, M Calimlim, J Flannick, MAFIA: A maximal frequent itemset algorithm, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 11, 2005, pp. 1490-1504.

[8] G. E. Blelloch, Scans as primitive parallel operations, IEEE Transactions on Computers, Vol. 38, No. 11, 1989, pp. 1526-1538.

[10] L. Di-Jorio, A. Laurent, M. Teisseire, Mining frequent gradual itemsets from large databases, Lecture Notes in Computer Science, Vol. 5772, 2009, pp. 297-308.

國際替代計量

多維度資料庫中頻繁數值樣式之分散式探勘

全文下載

主題瀏覽