廣度優先之序列性規則資料探勘方法

自從GSP演算法提出之後，許多相關的演算法被提出來且大多專注在找尋所有序列樣式。CloSpan演算法首先提出找尋封閉集合。封閉集合比全集合更精簡有效，且具有相同的表達能力。因此，CloSpan就以PrefixSpan演算法為基礎，加上兩個其稱為backward sub-pattern與backward super-pattern的刪減技巧，有效地找出封閉集合。因此我們提出一個新的演算法以找尋封閉集合。然而不同於之前演算法多採深度優先的策略，我們的演算法是屬於廣度優先的方法。另外，之前提出的演算法鮮有明顯地利用項目的順序關係(item ordering)來強化找尋樣式的效率。我們利用定位資料串列(positional data list)來保存項目的順序關係。我們利用這些資料來幫助樣式(pattern)的產生，並依此提出了兩種刪減技巧分別為backward super-pattern condition與same positional data condition。為了確保儲存最後結果的柵格(lattice)的正確性與簡潔，我們另外還針對一些特殊情況做處理。由實驗的結果顯示，我們的演算法相較於CloSpan在中大型的資料庫與小的支持度（support）的狀況下都有較優良的表現。

關鍵字

序列性規則；封閉集合；資料探勘

並列摘要

Since the GSP algorithm is proposed to mine sequential patterns in sequence databases, many methods have been proposed and mostly focused on mining the complete set of frequent patterns. The CloSpan algorithm first suggested that the closed set of sequential patterns is more compact and has the same expressive power with respect to the full set. Based on the PrefixSpan algorithm, CloSpan added two pruning techniques, backward sub-pattern and backward super-pattern, to efficiently mine the closed set. Therefore, in this thesis, we propose a new sequential pattern mining algorithm to mine closed sequences. However, instead of depth-first searching used in many previous methods, we adopt a breadth-first approach. Besides, previous methods seldom utilize the property of item ordering to enhance efficiency. We used a list of positional data to reserve the information of item ordering. By using these positional data, we developed two main pruning techniques, backward super-pattern condition and same positional data condition. To ensure correct and compact resulted lattice, we also manipulated some special conditions. From the experimental results, our algorithm outperforms CloSpan in the cases of moderately large datasets and low support threshold.

並列關鍵字

sequential pattern ； closed set ； data mining

參考文獻

[4] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, “PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth”, In Proc. Int. Conf. Data Engineering (ICDE ’01), Heidelberg, Germany, April 2001, pp. 215-224.

[7] M. J. Zaki, “SPADE: An efficient algorithm for mining frequent sequences”, Machine Learning, vol. 1, no. 1~2, 2001, pp. 31-60.

[11] R. Agrawal and R. Srikant, “Mining sequential patterns”, In Proc. Int. Conf. Data Engineering (ICDE’95), Taipei, Taiwan, March 1995, pp. 3-14.

[13] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets”, In Proc. SIAM Int. Conf. on Data Mining (SDM'03), San Francisco, CA, May 2003.

[1] F. Masseglia, F. Cathala, and P. Poncelet, “The psp approach for mining sequential patterns”, In Proc. 1998 European Symp. Principle of Data Mining and Knowledge Discovery (PKDD’98), Nantes, France, September 1998, pp. 176-184.

Google Scholar

國際替代計量

廣度優先之序列性規則資料探勘方法

全文下載

主題瀏覽