On Frequent Sequence Mining and online Classification Techniques

近幾年來，隨著資料大量的成長，隱藏在資料中的資訊也越來越多，因此，資料探勘的相關研究也隨之重要起來。為了能有效率且精確的擷取隱藏在資料中的資訊，許多資料探勘的技術被發展出來。在本論文中，我們致力於改進兩項技術：頻繁序列探勘技術和分類技術。找尋頻繁序列的主要困難來自於需要花費很高的代價來處理大量的資料。因此，為了節省處理代價，在本論文中，我們提出一個新的找尋頻繁序列的策略，其可以避免計算非頻繁序列的支持記數。此外，先前的研究利用較短的頻繁序列來刪除候選序列，但我們的策略則是用相同長度的序列來刪除候選序列。因此，我們的策略可以跟先前的研究結果相互配合，以達到加速的目的。我們探討了過去三個常用的策略，並且將這些策略與我們所提出的策略做結合，設計出一個新的探勘頻繁序列的演算法。此演算法藉著動態利用不同的策略，來達到比過去其他演算法更好的效果。實驗結果顯示我們的演算法在各種參數設定下都比過去的演算法還好。多媒體資料擷取的精確度可以藉由資料分類和使用者回饋的機制來提升。然而，在高維空間中建構一個分類器是一件相當耗時的工作。為了支援使用者回饋的機制，避免使用者等待時間太久，在本論文中，我們研究如何有效率的建構一個分類器。我們的主要想法是在分類器的建構過程中，利用索引結構來加速建構的過程。為此，我們選擇RCE-network來作為要改進的分類器，主要是因為其具有相當高的分類精確度，此外，其建構過程是利用簡單的幾何概念來達成，因此合適於利用索引結構來加速。我們提出了一個新的RCE-network建構演算法，避免過去建構演算法的缺點。此外，我們提出維度刪剪技術來加速在高維空間中建構分類器的過程。與數個現有的分類器建構方法相比，實驗結果顯示我們的方法顯著的提升了分類器建構的速度。

關鍵字

資料探勘；頻繁序列；序列比對；策略轉換；分類；高維空間；索引方法

並列摘要

In recent years, the field of data mining is getting more important. The reason is that the growth of data brings a huge amount of hidden knowledge. For efficiently and accurately extracting the knowledge, many data mining techniques are proposed. In the thesis, we focus on two techniques: frequent sequence mining and classification. The main challenge of mining frequent sequences is the high processing cost due to the large amount of data. In this thesis, we propose a novel strategy to find all the frequent sequences without having to compute the support counts of non-frequent sequences. The previous works prune candidate sequences based on the frequent sequences with shorter lengths, while our strategy prunes candidate sequences according to the non-frequent sequences with the same lengths. As a result, our strategy can cooperate with the previous works to achieve a better performance. We then identify three major strategies used in the previous works and combine them with our strategy into an efficient algorithm. The novelty of our algorithm lies in its ability to dynamically switch from a strategy to our new strategy in the mining process for a better performance. Experiment results show that our algorithm outperforms the previous ones under various parameter settings. The accuracy of multimedia data retrieval can be enhanced by a data classification and feedback mechanism. It is known that constructing a classifier for the multimedia data in high dimensional feature space is time-consuming. For supporting user feedbacks immediately, in this thesis we study how to efficiently construct the classifier. Our main idea is to speed up the classifier construction process by employing an indexing strategy. The RCE-network classifier is good for this purpose due to its high accuracy and simple construction process. A new RCE-network construction algorithm which overcomes the defects of the existing algorithms was proposed. Moreover, a pruning method with dimension-independent pruning ability was used to efficiently construct the classifier in the high dimensional feature space. Compared with several existing classification methods, the experiment results show that our method significantly promotes the construction efficiency of the classifier for its online uses.

並列關鍵字

無資料

參考文獻

[1] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for On-Demand Classification of Evolving Data Streams,” IEEE Transactions on Knowledge and Data Engineering, 18(5): 577-589, 2006.

[4] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu “Sequential Pattern Mining using A Bitmap Representation,” Proc. of ACM Conf. on Knowledge Discovery and Data Mining, 2002.

[5] J. Barros, J. French, W. Martin, P. Kelly, and M. Cannon, “Using the Triangle Inequality to Reduce the Number of Comparisons Required for Similarity-based Retrieval,” International Conference on Storage and Retrieval for Image and Video Databases, pp. 392-403, 1996.

[6] A. Berman and L. Shapiro, “Efficient Image Retrieval With Multiple Distance Measures,” International Conference on Storage and Retrieval for Image and Video Databases, pp. 12-21, 1997.

[7] J. K. Bonfield and R. Staden, “ZTR: A New Format for DNA Sequence Trace Data,” Bioinformatics, 18(1): 3-10, 2002.

國際替代計量

On Frequent Sequence Mining and online Classification Techniques

主題瀏覽