透過您的圖書館登入
IP:44.200.39.110
  • 學位論文

資料串流環境中之頻繁時間樣式探勘

Mining of Frequent Temporal Patterns on Data Streams

指導教授 : 陳銘憲

摘要


近年來,許多資料查詢與資料探勘的相關議題被引進資料串流的環境中。在眾多資料探勘議題中,以購物交易資料進行頻繁樣式的探索已被公認為極具重要性之研究方向。在此論文中,主要的研究課題有三:一、以靜態的交易資料庫作為出發點來探討頻繁樣式的探勘模型,進而推廣其觀念以應用於線上環境所產生之交易資料串流;二、研究資料串流環境中有限資源之有效利用方式;三、研發追蹤線上資料串流時的品質保證機制。其明確之相關研究內容簡述如下: 由探索頻繁樣式進而推導關連性法則的流程,我們發展了替代性法則此一新的資料探勘技術。所謂的替代性是指當購買行為發生時,顧客對於商品間的選擇與取捨關係。此種替代性法則的發掘,可以有效提供關於購物預測、顧客行為分析與決策支援等各方面的寶貴知識。具體而言,藉由建立替代性法則的理論基礎,我們更歸納商品組合間的出現頻率關係,來增進計算負面商品組合時的效率,並進而獲取深具統計意義的結果。 為了在資料串流環境中探索頻繁時間樣式,我們首先發展出一適用於各種時間樣式之頻率計數架構,並設計一具有兩大主要特色的演算法:首先是僅對每筆記錄進行單次掃瞄來線上蒐集各統計量值,另一特色則是利用迴歸理論產生簡潔的樣式表示式。藉此,線上的交易資料串流可即時地轉化為各種可能的頻繁樣式,各樣式的頻率變化亦可以多線段的迴歸方法來追蹤,而更利用了區段微調 (segment tuning) 與區段緩和(segment relaxation) 的技巧來確保記憶體的使用容量。結合這些特性後,此演算法不僅能對可變的時間區間進行資料探勘工作,並能有效地進行趨勢偵測。 在資料串流環境中另一應被重視的課題為:如何妥善利用記憶體空間與運算能力等有限的資源來產生準確的預估模型。針對追蹤線上時間序列時不同的精細度考量,系統資源可確保被利用於使用者較為重視的部分,例如:時間精細度是指隨著時間的變化,人們對越新發生的事件較感興趣,這意味著較多的系統資源應被用於仔細地探索較新的資料;此外,當進行頻繁時間樣式之資料探勘工作時,較多的資源也應用於處理所謂的邊緣樣式(borderline patterns)—發生頻率非常接近臨界值者,藉此能有效地辨別正確的頻繁樣式。有鑑於此,我們發展了以小波理論為基礎的演算法,來實現此一具資源感知性的樣式探勘工作,而藉由動態調配記憶體空間的使用方式,多個動態資料串流所產生之時間樣式亦可被正確發掘。 為追蹤由感測器收集得來或由資料探勘演算法產生之時間序列,我們利用小波轉換中的能量守恆特性來推導L1 與L2 誤差的理論關係,在藉由捨去較不重要的小波係數以節省寶貴系統資源時,可提供還原原始序列後的誤差保證。此外,為了處理無限長的線上資料串流,我們提出一個較佳的資料結構以適用於動態的資料摘要保存方式。經實驗證明,此種具解析度可適性之漸進拆解法,在保留時間序列重要特徵上所花的記憶體空間十分小,而當進行漸進式資料更新時,可以得到近似最佳解。

並列摘要


In recent years, several query problems and mining capabilities have been explored for a data stream environment. Among various data mining capabilities, the one receiving a significant amount of research attention is on mining frequent patterns over market basket data. In this dissertation, we first explore the model of frequent itemsets from static transaction databases and generalize relevant concepts to discovering of temporal relationship from online transaction flows. Then, we investigate the resource utilization issues in a data stream environment. Finally, we study the problem of quality guarantees when tracking online data streams. For the problem of mining frequent itemsets to derive association rules, a new mining capability, called mining of substitution rules, is first developed by extending the concepts of mining of association rules. Substitution refers to the choice made by a customer to replace the purchase of some items with that of others. The discovery of substitution rules, same as that of association rules, will lead to very valuable knowledge in various aspects, including market prediction, user behavior analysis and decision support. Specifically, we first derive theoretical properties for the model of substitution rule mining and devise a technique on the induction of positive itemset supports to improve the efficiency of support counting for negative itemsets. Then, in light of these properties, algorithm SRM (standing for substitution rule mining) is designed and implemented to discover the substitution rules efficiently while attaining good statistical significance. To mine frequent temporal patterns on data streams, a regression-based algorithm, called algorithm FTP-DS (Frequent Temporal Patterns of Data Streams) is devised. While providing a general framework of pattern frequency counting, algorithm FTP-DS has two major features, namely one data scan for online statistics collection and regression-based compact pattern representation. To attain the feature of one data scan, the data segmentation and the pattern growth scenarios are explored for the frequency counting purpose. Algorithm FTP-DS scans online transaction flows and generates candidate frequent patterns in real time. The second important feature of algorithm FTP-DS is on the regression-based compact pattern representation. In addition, we develop the techniques of the segmentation tuning and segment relaxation to enhance the functions of FTP-DS. With these features, algorithm FTP-DS is able to not only conduct mining with variable time intervals but also perform trend detection effectively. The fundamental problem that how the limited resources, e.g., memory space and computation power, can be well utilized to produce accurate estimates in a data stream environment is also addressed. Two important features for tracking mined patterns with properly utilized resources are examined. The first issue is temporal granularity which refers to the phenomenon that as time advances, people are more interested in recent events, meaning that more resources can be utilized to explore more recent data with finer granularities. Second, with the mining task of discovering frequent temporal patterns, more resources are expected to be allocated to the processing of those borderline patterns whose statistics, e.g., occurrence frequencies, are close to the specified threshold so as to have proper frequent itemset identification. This feature is called mining with support count granularity. Consequently, a wavelet-based algorithm, called algorithm RAM-DS (Resource-Aware Mining for Data Streams) is devised to perform general pattern mining tasks for data streams by exploring both temporal and support count granularities. Algorithm RAM-DS is designed to not only reduce the memory required for data storage but also retain good approximation of target time series. In addition, algorithm RAM-DS can support a varying number of data streams by allocating memory space adaptively when tracking patterns generated from online transactions. For tracking online time series data which is directly collected from sensors or is generated by stream mining algorithms, we explore the energy preservation property of wavelet-based transform. The commonly used L1- and L2-error metrics are theoretically guaranteed when insignificant coefficients are discarded for saving precious resources in our framework. In addition, to handle infinite online data flows, an enhanced data structure RAID-tree which is based on the error tree is proposed for dynamic synopses maintenance over data streams. Specifically, an algorithm RAID with the resolution adaptability for incremental decomposition is developed. Experimental results have shown that the memory required for storing significant features of time series data is very small and the quality of approximation is stable when performing incremental data updates.

參考文獻


[39] J. Han, G. Dong, and Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series
[81] A. Savasere, E. Omiecinski, and S. Navathe. An Efficient Algorithm for Mining Association
[91] K. Wang, S. Q. Zhou, and S. C. Liew. Building Hierarchical Classifiers Using Class Proximity.
[25] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and
Generation of Frequent Itemsets. Journal of Parallel and Distributed Computing (Special

延伸閱讀