具興趣時段感知之頻繁與循序樣式資料探勘

在此論文中，我們討論了具興趣時段感知之頻繁與循序樣式資料探勘的問題。我們發現使用者對較新的資料比過去的資料更感興趣。若是能考慮使用者的興趣時段，我們便能夠獲得在交易資料庫中最有趣的頻繁樣式或在序列資料庫中的循序樣式。我們探討了在時間資料庫中發掘相關性的通用模型，在這模型中資料的生存週期可以允許有所不同。為了解決這個問題，我們提出了一種有效的演算法Twain，以便找出頻繁樣式更為精確的頻繁時段。Twain不僅能產生頻繁模式更精確的頻繁時段，也發現了更多有趣的頻繁模式。另外，我們提出了一個循序樣式探勘中的通用模型，處理的資料庫為漸進式的資料庫，而資料庫中的資料可能是靜態的、可被新增的或可被刪除的。此外，我們也提出了一個漸進式的演算法Pisa，逐步在使用者的興趣時段中找尋循序樣式。Pisa採用了一個漸進式循序樹，能夠有效地保留最新的資料序列，並產生最新且完整的循序樣式，同時刪除過時的資料和相對應的循序樣式。最後，我們討論了在前述通用模型中必定存在的可擴展性問題。當資料庫擁有越來越多的序列或使用者的興趣時段加大時，用來處理漸進式循序樣式的時間和空間會急劇增加。由於在單一處理器上計算能力和工作空間有限，通常很難不停地擴大。因此，我們設計了一個分散式的演算法DPSP以處理大量的資料。在每一個時間點，DPSP能夠刪除過時的資料、更新目前的循序樣式和產生在最新的興趣時段中頻繁出現的循序樣式。

關鍵字

頻繁樣式；循序樣式；興趣時段；資料探勘；分散式

並列摘要

In this dissertation, we addressed the frequent and sequential pattern mining problem with period of interest awareness. It is noted that users are usually more interested in recent data than old ones. Taking the period of interest into consideration, we are able to derive most interesting frequent patterns in time domain in a transaction database or sequential patterns in a sequence database. We first explored the general model of mining associations in a temporal database, where exhibition periods of items are allowed to be different from one to another. To address this issue, we proposed an efficient algorithm Twain, standing for TWo end AssocIation miNer to give more precise frequent exhibition periods of frequent temporal itemsets. Twain not only generates frequent patterns with more precise frequent exhibition periods, but also discovers more interesting frequent patterns. We also proposed a general model of sequential pattern mining with a progressive database while the data in the database may be static, inserted or deleted. In addition, we presented a progressive algorithm Pisa, standing for Progressive mIning of Sequential pAtterns, to progressively discover sequential patterns in a defined period of interest. Pisa utilizes a progressive sequential tree to efficiently maintain the latest data sequences, discover the complete set of up-to-date sequential patterns, and delete obsolete data and patterns accordingly. In addition, we examined the intrinsic scalability problem of mining progressive sequential patterns. When the number of sequences grows and the POI becomes larger, the time and space used to conduct progressive sequential patterns increases dramatically. Due to the limited computing power and working space, single processors usually struggle to scale up. Therefore, we designed a distributed algorithm DPSP, standing for Distributed Progressive Sequential Pattern mining algorithm, to deal with large amounts of data. At each timestamp, DPSP is able to delete obsolete itemsets, update current candidate sequential patterns and report up-to-date frequent sequential patterns within the current POI.

並列關鍵字

frequent pattern ； sequential pattern ； period of interest ； data mining ； distributed algorithm

參考文獻

[3] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. Proceedings of the

[74] B. G. J. Muhonen and H. Toivonen. Mining Non-Derivable Association Rules. Proceedings of

[89] A. Savasere, E. Omiecinski, and S. Navathe. An Efficient Algorithm for Mining Association

[91] J. Soo, M.-S. Chen, and P. S. Yu. Efficient Parallel Data Mining for Association Rules. Proceedings

[1] R. Agarwal, C. Aggarwal, and V. Prasad. A Tree Projection Algorithm for Generation of Frequent

國際替代計量

具興趣時段感知之頻繁與循序樣式資料探勘

全文下載

主題瀏覽