一個能快速探勘前k高效益物品集的演算法

在資料探勘中，探勘前k高效益物品集是一項新興的議題，其中k代表的是使用者想要的高效益物品集的個數。在探勘前k高效益物品集這一議題之前，已有探勘高效益物品集的議題，而探勘高效益物品集則是發掘那些效益值高於使用者定義的效益門檻值的物品集。但對於使用者來說，設定一個適當的效益門檻值是一個困難的問題，如果門檻值設定得太低，太多的高效益物品集會被產生，而這可能導致探勘的演算法變得效率低下，或更甚至耗盡記憶體。另一方面，如果門檻值設定得太高，則可能找不到高效益物品集。因此，探勘前k高效益物品集更適合那些想要找出特定個數的高效益物品集的使用者。為了探勘前k高效益物品集，現有的演算法首先會先高估物品集的效益值去產生候選物品集，緊接著再計算候選物品集的真正效益值去獲得真正的前k高效益物品集。但現有的演算法會有產生太多候選物品集的問題，而大部分的候選物品集在經過計算真正的效益值之後，會發現它們並非為前k高效益物品集。在本篇論文中，我們提出了一個能快速探勘前k高效益物品集的演算法，其名稱為TKU+。通過避免產生候選物品集的方式，TKU+可以有效的探勘前k高效益物品集。在不同的資料集上，我們將TKU+和目前最先進的演算法相比，實驗結果顯示TKU+不論在執行時間及記憶體消耗上，均優於目前最先進的演算法。

關鍵字

資料探勘；效益探勘；高效益物品集；前k物品集探勘

並列摘要

Mining top-k high utility itemsets is an emerging topic in data mining, where k is the desired number of high utility itemsets to be mined. Although several studies have been carried out on mining high utility itemsets, which refers to the discovery of itemsets with utilities higher than a user-specified minimum utility threshold min_util. But setting an appropriate minimum utility threshold is a difficult problem for users. If min_util is set too low, too many high utility itemsets will be generated, which may cause the mining algorithms to become inefficient or even run out of memory. On the other hand, if min_util is set too high, no high utility itemset will be found. Therefore, mining top¬-k high utility itemsets is more suitable for users who want to find a certain number of high utility itemsets. To identify top-k high utility itemsets, the existing algorithms first generate candidate itemsets by overestimating their utilities, and subsequently compute the exact utilities of these candidates. These algorithms incur the problem that a very large number of candidates are generated, but most of the candidates are found out to be not top-k high utility itemsets after their exact utilities are computed. In this thesis, we propose a fast algorithm, named TKU+, for mining top-k high utility itemsets. By avoiding the costly generation and utility computation of numerous candidate itemsets, TKU+ can efficiently mine top-k high utility itemsets. We compared TKU+ with the state-of-the-art algorithms on various datasets, and experimental results show that TKU+ outperforms these algorithms in terms of both running time and memory consumption.

並列關鍵字

data mining ； utility mining ； high utility itemset ； top-k itemset mining

參考文獻

[6] J. Han, J. Pei, Y. Yin, and R. Mao, "Mining frequent patterns without candidate generation: a frequent-pattern tree approach," Data Mining and Knowledge Discovery, vol. 8, pp. 53-87, 2004.

[11] Y.-L. Cheung and A. W.-C. Fu, "Mining frequent itemsets without support threshold: with and without item constraints," IEEE Transactions on Knowledge and Data Engineering, vol. 16, pp. 1052-1069, 2004.

[12] W. Jianyong, H. Jiawei, L. Ying, and P. Tzvetkov, "TFP: an efficient algorithm for mining top-k frequent closed itemsets," IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 652-663, 2005.

[13] S.-C. Ngan, T. Lam, R. C.-W. Wong, and A. W.-C. Fu, "Mining N-most interesting itemsets without support threshold by the COFI-tree," International Journal of Business Intelligence and Data Mining, vol. 1, pp. 88-106, 2005.

[14] T. Quang, S. Oyanagi, and K. Yamazaki, "ExMiner: an efficient algorithm for mining top-k frequent patterns," Advanced Data Mining and Applications, vol. 4093, pp. 436-447, 2006.

國際替代計量

一個能快速探勘前k高效益物品集的演算法

全文下載

主題瀏覽