資料流最近常見項目集變動探勘之研究

本論文針對資料流滑動視窗的模型提出一個探勘狀態變動項目集的方法，稱為CV-SCD(Cross-Verify Status Change Detection)演算法。本方法主要利用兩棵稱為Base-Tree及Delta-Tree的相同字首樹之樹狀結構，儲存在任一時間點t時滑動視窗中所有交易資料，以及從t到t+1之間新增及過時的交易資料，並利用Base-Tree及Delta-Tree的資訊判斷出狀態變動項目集，再同時對兩棵樹遞迴建立包含特定項目的條件樹，以探勘出更長的狀態變動項目集。本論文對固定區間長度探勘出的狀態變動項目集儲存成狀態變動資料項集快照，並採用金字塔式時間框架的結構來儲存快照，提供可由使用者指定特定時間區間對其中各狀態變動項目集的變動情形進行相對特性分析。實驗結果顯示，當新增及過時的交易資料相對於滑動視窗資料為少量，或是資料集中包含之項目種類較多，或是在支持度小的情況下，CV-SCD演算法相較於以FP-growth探勘出常見項目集後再進行狀態變動項目集比對可顯著增進執行效率。

關鍵字

資料流；滑動視窗；資料探勘

並列摘要

In this thesis, a method for discovering recently status-changed itemsets over data streams is proposed, which is named the CV-SCD (Cross-Verify Status Change Detection) algorithm. In this algorithm, two prefix tree structures, which are named Base-Tree and Delta-Tree, respectively, are constructed for maintaining the transaction data in a sliding window at time t, and the newly inserted and removed transactions at time t+1. The data stored in the Base-Tree and Delta-Tree is used to discover the status-changed itemsets at t+1 with respect to t. By performing cross-verification on Delta-Tree and Base-Tree, then by constructing conditional Delta-Tree and conditional Base-Tree recursively, all the status-changed itemsets are discovered in depth-first search. The discovered status-changed itemsets within a fixed time interval are stored in a snapshot of status-changed itemsets. Accordingly, given a specific period of time interval, the changing characteristics of itemsets in this interval can be compared relatively. The experiment results show that FP-growth to discover all the frequent itemsets at each time point and then performing itemsets matching to get status-changed itemsets, the CV-SCD algorithm provides significant improvement on execution time when the added and expired transactions are few, or there are many different items in transaction data, or the support is small.

並列關鍵字

data stream ； sliding window ； data mining

參考文獻

[19] J. H. Chang and W. S. Lee, “A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams,” In Journal of Information Science and Engineering, Vol. 20, No. 4, July 2004.

[1] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation, ” in Proc. of ACM SIGMOD Int. Conf. on Management of data, 2000.

[2] Y. Chi, H. Wang , P. S. Yu ,and R. R. Muntz,” Moment: Maintaining closed frequent itemsets over a stream sliding window,” In Proc. of Int. Conf. on Data Mining and Knowledge Discovery, 2004.

[4] H. J. Woo, W. S. Lee, “estMax: Tracing maximal frequent itemsets over online data streams,” in Proc. of the 2007 Seventh IEEE Int. Conf. on Data Mining, 2007.

[5] B. Mozafari, H. Thakkar, C. Zaniolo, “Verifying and Mining Frequent Patterns from Large Windows over Data Streams,” in Proc. of the 25th Int. Conf. on Data Engineering, 2008.

國際替代計量

資料流最近常見項目集變動探勘之研究

主題瀏覽