在資料串流環境中使用封閉子樹作頻繁子樹探勘

樹狀結構被使用在許多應用上比如說生物科技、XML資料庫、和網站紀錄資料庫。常見樹狀結構的探勘則在查詢最佳化、結構分類、網站推薦上有其價值。然而，常見樹狀結構探勘的其中一個挑戰就是候選子樹的樹木會隨著樹的大小成指數性的成長。為了處理這樣大量的資訊，其中一個可能的解決方法就是封閉子樹。在此篇論文當中，我們提出了一個叫CSTMer的方法利用先找出封閉子樹來找出常見頻繁子樹。在這方法中，我們主要介紹一個壓縮的資料結構-封閉字首樹來保存用來找出封閉子樹所相關的資訊。封閉子樹的數量通常比所有子樹的數量還少，因此用來探勘頻繁子樹所需的記憶體需求減少了。此外，我們和一個探勘頻繁子樹的演算法STMer做比較。實驗的結果顯示出我們的方法大量減少了記憶體需求。

關鍵字

資料串流；探勘；常見

並列摘要

Tree structures are commonly adopted in many applications, such as bioinformatics, XML database, and web log databases. Mining frequent tree patterns can be valuable for query optimization, pattern classification, and web recommendation. However, one of the challenges for mining frequent subtrees is that the number of the candidate subtrees may exponentially grow with the tree size. In order to deal with such massive information, the closed subtrees is one of the promising solutions. In this paper, we propose an approach, named CSTMer (Closed Stream Tree Miner), for mining frequent subtrees over data streams by discovering the closed subtrees first. In the approach, we mainly introduce a compact structure, named closed global prefix tree, which maintains the associated information needed for deriving the closed subtrees. The size of the set of closed subtrees is usually smaller than the size of the set of all frequent subtrees. Therefore, the memory consumption needed for mining frequent subtrees can be reduced. In addition, we compare the proposed approach with STMer which discovers all frequent subtrees. The experiment result shows that our approach greatly reduces the memory consumption.

並列關鍵字

data stream ； mining ； frequent

參考文獻

[6] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’02), 2002.

[7] J. X. Yu, Z. Chong, H. Lu, and A. Zhou. False positive or false negative: mining frequent itemsets from high speed transactional data streams. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’04), 2004.

[8] Y. Xiao, J. F. Yao, Z. Li, and Margaret H. Dunham. Efficient Data Mining for Maximal Frequent Subtrees. In Proc. of IEEE International Conference on Data Mining (ICDM,03), 2003

[9] Y. Chi, Y. Xia, Y. Yang, and R. R. Muntz. Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees. In Proc. of IEEE Transaction on Knowledge and Data Engineering (TKDE’05), 2005

[10] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. In Proc. of Intl. Conf. on Data Mining (ICDM’04), 2004.

國際替代計量

在資料串流環境中使用封閉子樹作頻繁子樹探勘

主題瀏覽