植基於可拓理論以探勘大型資料庫中頻繁項目之方法

傳統的資料探勘技術，總是使用單一門檻值作為探勘的依據。而探勘時間也隨著單一門檻值的選擇起伏不定。因此如果選用較低的門檻值，將造成較長的時間需求，如此便進而拖延決策的取捨。本論文提出一種不同以往的探勘方法：讓使用者能夠選取一段範圍門檻值的頻繁項目集作為探勘的對象，並且利用可拓集合的概念，在門檻值以下選取一段範圍之非頻繁項目集，最後再利用FP-增長執行資料探勘。原本非頻繁項目集中之商品項目被購買之頻率均低於門檻值，但是經由某些促銷活動後，這些項目集將有可能成為頻繁項目集。因為網路的盛行，那些屬於可拓集合中之項目，皆有可能因為網路無遠弗屆的通路，成為一時暢銷的商品。探勘一段範圍門檻值之頻繁項目集，能夠大量的減少探勘的時間，超越門檻上限之項目集，因為其支持度足夠高，所以並不在探勘的範圍內。經過驗證的結果顯示，當只有選取某一段範圍進行探勘密集資料庫時，約僅需要2秒至100秒之間。反觀傳統單一門檻值之探勘方法卻需要500秒以上的探勘時間。如此證實了本論文提出之方法確實能夠降低資料探勘的時間。

關鍵字

可拓集合； FP-增長；頻繁項目集；關聯法則；資料探勘

並列摘要

Traditional data mining technologies always use one single threshold as the basis for mining frequent patterns. The required mining time varies with the levels of thresholds. Thus, if the users choose a lower threshold then it takes more time to mine frequent patterns which will in turn delay the decision making. A new mining strategy is proposed in this thesis: we allow users to select a range of threshold for mining frequent itemsets. The concept of extension set is also exploited to select a range of infrequent itemsets. Finally, we use FP-growth algorithm to process data mining job. Infrequent itemsets consist of patterns with supports lower than the threshold and are normally removed from consideration in finding association rules. But if some promotions are planned, the infrequent patterns may have chances to become frequent. For example, some patterns may not overpass the given threshold at the time of extracting frequent patterns. However, if a store plans a kind of sale promotion, it is possible that the infrequent items may become frequent in a short time. Since Internet is being in vogue and wide route, the items belonging to extension set may become popular products. Mining only the frequent patterns in an extended range will save lots of time. We are not interested in the frequent patterns with supports higher than the threshold because they are too frequent to provide rich information in data mining. When only the interesting extended range of supports is considered to find the frequent patterns, only 2 to 100 seconds mining time are needed in our model as compared with more than 500 seconds in conventional model. This verified that the proposed model is efficient for data mining applications.

並列關鍵字

Extension set ； FP-growth ； Data mining ； Large itemset ； Association rules

參考文獻

[3] G. Grahne and J. Zhu, “Fast algorithm for frequent itemset mining using FP-trees,” IEEE Trans. on Knowledge and Data Engineering, vol. 17, issue 10, pp.1347-1362, Oct. 2005.

[6] C. Lucchese, S. Orlando and R. Perego, “Fast and memory efficient mining of frequent closed itemsets,” IEEE Trans. on Knowledge and Data Engineering, vol. 18, no. 1, pp.21-36, Jan. 2006.

[7] J. Wang, J. Han and C. Li, “Frequent closed sequence mining without candidate maintenance,” IEEE Trans. on Knowledge and Data Engineering, vol. 19, no. 8, pp.1042-1056, Aug. 2007.

[8] M. J. Zaki, “Efficiently mining frequent trees in a forest: algorithms and applications,” IEEE Trans. on Knowledge and Data Engineering, vol. 17, no. 8, pp.1021-1035, Aug. 2005.

[13] M. Song and S. Rajasekaran, “A transaction mapping algorithm for frequent itemsets mining,” IEEE Trans. on Knowledge and Data Engineering, vol. 18, no. 4, pp.472-481, Apr. 2006.

國際替代計量

植基於可拓理論以探勘大型資料庫中頻繁項目之方法

未授權

主題瀏覽