實用性導向頻繁集勘測

自從早期頻繁集探勘演算法Apriori 被提出之後，相關頻繁集探勘技術已被廣泛地提出及討論。然而，即使這些技術可以滿足某些現實生活中的應用要求，我們發現，如何提供可實用之頻繁集探勘技術，達到容易使用，低複雜度，高效率等實用課題，依然是一個極富挑戰性問題。有鑑於此，我們在本論文中提出一可直接尋找出top-k (closed) 頻繁集之探勘技術，同時考慮到其所需之執行記憶體上限。不同於以往只考慮如何快速得到正確的頻繁集之技術，我們探討如何限定所需記憶體的上限。同時在這樣的要求下，所提出之MTK 及MTK_Close 演算法可以不用設定使用者難以理解的最小支持度參數。相反的，使用者只需設定其容易理解的參數，也就是所需之頻繁集個數。實驗上證明，MTK 跟MTK_Close 可以同時達成高執行效率及考慮到記憶體上限。此外，本論文亦提出一在滑動視窗模型下，可依序產生高品質樣本之資料取樣技術，稱之為FPS 演算法。在此我們考慮之高品質樣本，指的是每一屬性值之母體比率及樣本比率在一滑動視窗下之相似程度。FPS 演算法有幾個好處：(1)它可以連續地從一個隨時間變化的資料源中產生樣本；(2)FPS 的執行時間相對資料的大小是線性成長的；(3)對大多數屬性值來說，其相對比例誤差可被保證在一定的錯誤上限之下，而剩餘的屬性值的相對比例誤差亦可被極盡的縮小，可保證所產生之樣本是高品質之樣本；(4)產生樣本之取樣比率可以接近使用者所指定的比率，無需增加樣本大小來達成高品質樣本的需求；(5) FPS 亦能優異地保持多維度統計量的比率；(6) FPS 不為特定應用而設計，因此可以使用於不同之探勘應用。在本論文中，我們也觀察了一個實際資料中存在的重要特徵，稱之為itemset supportdistribution，用以提供對實際資料特性更好的模型分析。重要地，從觀察不同的實際資料後，我們確認，Power-Law 的現象的確出現於itemset support distribution 中，並且我們可以用Zipf distribution 來描繪此現象。然後由於要尋找出這個特徵是十分費時的，我們更進一步地，提出一有效率之演算法，用於快速正確的抓取出itemset support distribution 的數學模型特徵。此外，我們也充分利用這個實際資料中發現的特性，提出新穎的機制解決二個重要的探勘問題：(1)自動決定一個在資料串流環境中頻繁集探勘演算法所需之參數；(2)決定頻繁集探勘演算法所需之最低樣本大小。在本論文中，我們也試圖回答一個重要的探勘技術問題：『那些子集將在未來是頻繁集？』這樣的探勘樣式，我們稱之為預期中的頻繁集(Prospective Frequent Patterns)，對使用者是十分有意義的探勘結果。因為許多交互推銷的策略在真實應用中的成功與否，是依靠在頻繁集的出現能否被精確地預言。由於以往的技術不能用來有效地得到預期中的頻繁集，我們在本論文中亦提出一架構，可用來精確地預測預期中的頻繁集，同時也預測他們可能的支持度。

關鍵字

資料探勘；頻繁集；資料取樣；記憶體

並列摘要

Since the early work in algorithm Apriori, a broad spectrum of topics in mining frequent patterns has been studied. While those proposed techniques are important results toward the integration of mining association mining and other real-life requirements, how to provide feasibility-oriented models for mining frequent patterns, to enable easy-use, low-cost, high-efficiency, and realistic mining applications, still remains as a challenging issue. In view of this, we explore in this dissertation a novel algorithm of mining top-k (closed) itemsets in the presence of the memory constraint. As opposed to most previous works that concentrate on improving the mining efficiency or on reducing the memory size by best effort, we first attempt to specify the available upper memory size that can be utilized by mining frequent itemsets. While complying with the requirement of the memory constraint, two efficient algorithms, called MTK and MTK_Close, were thus devised for mining frequent itemsets and closed itemsets, respectively, without specifying the subtle minimum support. Instead, users only need to give a more human-understandable parameter, namely the desired number of frequent (closed) itemsets k. Furthermore, a sampling model, called feature preserved sampling (FPS) that sequentially generates a high-quality sample over sliding windows, is developed. The sampling quality we consider refers to the degree of consistency between the sample proportion and the population proportion of each attribute value in a window. FPS has several advantages: (1) it sequentially generates a sample from a time-variant data source over sliding windows; (2) the execution time of FPS is linear with respect to the database size; (3) the relative proportional differences between the sample proportions and population proportions of most distinct attribute values are guaranteed to be below a specified error threshold, ε, while the relative proportion differences of the remaining attribute values are as close to ε as possible, which ensures that the generated sample is of high-quality; (4) the sample rate is close to the user specified rate so that a high-quality sampling result can be obtained without increasing the sample size; (5) FPS can excellently preserve the population proportion of multivariate statistics in the sample; and (6) FPS can be applied to infinite streams and finite datasets equally, and the generated samples can be used for various applications. We next investigate an important characteristic in real datasets, named the itemset support distribution, to provide better understanding on real datasets. The itemset support distribution refers to the distribution of the count of itemsets versus the itemset support. Importantly, from observations on various retail datasets and as validated by our empirical studies later, we find that the power-law relationship indeed appears in the itemset support distribution and we can characterize that as a Zipf distribution. Since it is prohibitively expensive to retrieve lots of itemsets before we identify the characteristics of the itemset support distribution in targeted data, we also propose a valid and cost-effective algorithm, called algorithm PPL, to extract characteristics of the itemset support distribution. Furthermore, to fully explore the advantages of our discovery, we also propose novel mechanisms with the help of PPL to solve two important problems: (1) determining a subtle parameter for mining approximate frequent itemsets over data streams; and (2) determining the sufficient sample size for mining frequent patterns. In this dissertation, we also attempt to answer an important question: "What patterns will be frequent in the future?" Such a kind of patterns, referred to as prospective frequent patterns, is very informative to end-users, because many cross-selling strategies in real cases rely on the precise prediction of frequent patterns that will appear. Since any naive extension of previous works cannot effectively obtain the desired result, we proposed the framework of PFP, to precisely predict prospective frequent patterns while also predicting their supports.

並列關鍵字

Data Mining ； Frequent Patterns ； Data Sampling ； Memory

參考文獻

[3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of VLDB,

[1] F. Afrati, A. Gionis, and H. Mannila. Approximating a Collection of Frequent Sets. In Proc. of

ACM SIGKDD, 2004.

[2] R. Agrawal, T. Imielinski, andA. Swami.Mining association rules between sets of items in large

Wesley, 1986.

國際替代計量

實用性導向頻繁集勘測

全文下載

主題瀏覽