在資料串流環境中以時間性滑動視窗探勘封閉頻繁項目集

探勘頻繁項目集向來是資料探勘領域中的主流，在資料串流上有許多新穎的應用，舉凡電腦、電話以及交通網路上的流量資料，沒有流量限制且僅能檢視一次資料。以往研究提出處理固定交易量的滑動視窗，近年研究更提出以時間為單位的時間性滑動視窗，將資料串流切成數個連續的時間區段，各區段資料量取決於單位時間流量的變化。有鑑於以時間為單位更能符合使用者需求，本研究提出在資料串流上探勘時間性滑動視窗中封閉頻繁項目集的方法，以封閉頻繁項目集可還原所有頻繁項目集的特性，讓系統僅須記錄極少量項目集資訊，以滿足資料串流上必須有效運用系統資源的需求。本論文並提出新穎的刪除條件，觀察最近時間區段內項目集出現次數的變化趨勢，藉此預估未來該項目集是否可能成為頻繁項目集，提高整體探勘的準確度。實驗結果顯示採用新刪除條件的探勘結果，正確率、回收率以及計數誤差皆優於過去方法，其中正確率可提高至80%以上；此外，比較封閉頻繁項目集與所有頻繁項目集的數量，在緊密資料分佈下探勘結果的壓縮效率可達90%以上，不僅減少系統資源需求，執行時間也優於探勘所有頻繁項目集的方法。

關鍵字

滑動視窗；頻繁項目集；資料探勘；資料串流

並列摘要

Mining frequent itemsets has been the mainstream of the data mining field for years. There are many new applications on data streams such as data flows through computer, telephone and traffic networks. Such data are unbounded in amount and can only be read once. Most of previous studies adopt a sliding window to extract a fixed number of transactions from a data stream at every moment. Recently, a time-sensitive sliding window is proposed to divide the data stream into blocks by time. The number of transactions in a block depends on the data arrival rate and therefore can be variable. We propose a novel approach for mining closed frequent itemsets with a time-sensitive sliding window over data streams. Since closed frequent itemsets is the compression of all frequent itemsets, both the space requirement and process time can be reduced. Moreover, we devise a new method of itemset deletion to avoid false dismissals. Our method keeps watching the counts of an itemset in recent blocks to predict whether the itemset can be frequent in the near future. Experimental results show that our method performs better than the previous work in the precision, recall and count error rate. In addition, compared with all frequent itemsets, our method using closed frequent itemsets achieves better space utilization and time efficiency.

並列關鍵字

frequent itemsets ； sliding window ； data streams ； data mining

參考文獻

[4] J.H. Chang, W.S. Lee “Finding recent frequent itemsets adaptively over online data streams,” KDD, pp. 487-492, 2003.

[1] R. Agrawal, T. Imielinski, A. N. Swami, “Mining Association Rules between Sets of Items in Large Databases,” ACM SIGMOD, pp. 207-216, 1993.

[2] B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom, “Models and Issues in Data Stream Systems,” PODS, pp. 1-16, 2002.

[3] B Babcock., M. Datar, R. Motwani, L. O'Callaghan, “Maintaining variance and k-medians over data stream windows,” PODS, pp. 234-243, 2003.

[5] M. Charikar, K. Chen, M. Farach-Colton, “Finding Frequent Items in Data Streams,” ICALP, pp. 693-703, 2002.

國際替代計量

在資料串流環境中以時間性滑動視窗探勘封閉頻繁項目集

全文下載

主題瀏覽