帳號:guest(3.138.141.202)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):邱鼎穎
論文名稱(中文):On Frequent Sequence Mining and online Classification Techniques
論文名稱(外文):頻繁序列探勘和線上分類技術之研究
指導教授(中文):陳良弼
學位類別:博士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:908312
出版年(民國):97
畢業學年度:97
語文別:英文
論文頁數:96
中文關鍵詞:資料探勘頻繁序列序列比對策略轉換分類高維空間索引方法
相關次數:
  • 推薦推薦:0
  • 點閱點閱:242
  • 評分評分:*****
  • 下載下載:11
  • 收藏收藏:0
近幾年來,隨著資料大量的成長,隱藏在資料中的資訊也越來越多,因此,資料探勘的相關研究也隨之重要起來。為了能有效率且精確的擷取隱藏在資料中的資訊,許多資料探勘的技術被發展出來。在本論文中,我們致力於改進兩項技術:頻繁序列探勘技術和分類技術。
找尋頻繁序列的主要困難來自於需要花費很高的代價來處理大量的資料。因此,為了節省處理代價,在本論文中,我們提出一個新的找尋頻繁序列的策略,其可以避免計算非頻繁序列的支持記數。此外,先前的研究利用較短的頻繁序列來刪除候選序列,但我們的策略則是用相同長度的序列來刪除候選序列。因此,我們的策略可以跟先前的研究結果相互配合,以達到加速的目的。我們探討了過去三個常用的策略,並且將這些策略與我們所提出的策略做結合,設計出一個新的探勘頻繁序列的演算法。此演算法藉著動態利用不同的策略,來達到比過去其他演算法更好的效果。實驗結果顯示我們的演算法在各種參數設定下都比過去的演算法還好。
多媒體資料擷取的精確度可以藉由資料分類和使用者回饋的機制來提升。然而,在高維空間中建構一個分類器是一件相當耗時的工作。為了支援使用者回饋的機制,避免使用者等待時間太久,在本論文中,我們研究如何有效率的建構一個分類器。我們的主要想法是在分類器的建構過程中,利用索引結構來加速建構的過程。為此,我們選擇RCE-network來作為要改進的分類器,主要是因為其具有相當高的分類精確度,此外,其建構過程是利用簡單的幾何概念來達成,因此合適於利用索引結構來加速。我們提出了一個新的RCE-network建構演算法,避免過去建構演算法的缺點。此外,我們提出維度刪剪技術來加速在高維空間中建構分類器的過程。與數個現有的分類器建構方法相比,實驗結果顯示我們的方法顯著的提升了分類器建構的速度。
In recent years, the field of data mining is getting more important. The reason is that the growth of data brings a huge amount of hidden knowledge. For efficiently and accurately extracting the knowledge, many data mining techniques are proposed. In the thesis, we focus on two techniques: frequent sequence mining and classification.
The main challenge of mining frequent sequences is the high processing cost due to the large amount of data. In this thesis, we propose a novel strategy to find all the frequent sequences without having to compute the support counts of non-frequent sequences. The previous works prune candidate sequences based on the frequent sequences with shorter lengths, while our strategy prunes candidate sequences according to the non-frequent sequences with the same lengths. As a result, our strategy can cooperate with the previous works to achieve a better performance. We then identify three major strategies used in the previous works and combine them with our strategy into an efficient algorithm. The novelty of our algorithm lies in its ability to dynamically switch from a strategy to our new strategy in the mining process for a better performance. Experiment results show that our algorithm outperforms the previous ones under various parameter settings.
The accuracy of multimedia data retrieval can be enhanced by a data classification and feedback mechanism. It is known that constructing a classifier for the multimedia data in high dimensional feature space is time-consuming. For supporting user feedbacks immediately, in this thesis we study how to efficiently construct the classifier. Our main idea is to speed up the classifier construction process by employing an indexing strategy. The RCE-network classifier is good for this purpose due to its high accuracy and simple construction process. A new RCE-network construction algorithm which overcomes the defects of the existing algorithms was proposed. Moreover, a pruning method with dimension-independent pruning ability was used to efficiently construct the classifier in the high dimensional feature space. Compared with several existing classification methods, the experiment results show that our method significantly promotes the construction efficiency of the classifier for its online uses.
1 Introduction
1.1 Frequent Sequence Mining
1.2 Classification
2 Related Works
2.1 Related Works on Frequent Sequence Mining
2.2 Related Works on Classification
3 Efficient Frequent Sequence Mining by a Dynamic Strategy Switching Algorithm
3.1 The DISC Strategy
3.1.1 Sequence Comparison
3.1.2 Finding the Minimum K-subsequence
3.1.3 Mining Frequent K-sequences
3.1.4 Finding the Conditional Minimum K-subsequence
3.1.5 Strength of the DISC Strategy
3.2 The DISC Strategy Correctness Analyses of MKS-gen and CMKS-gen
3.2.1 Correctness Analysis of MKS-gen
3.2.2 Correctness Analysis of CMKS-gen
3.3 The Dynamic DISC-all Algorithm
3.3.1 Dynamic Stage Transition
3.3.2 Locative AVL-tree
3.4 Experiments
3.4.1 Comparison with PrefixSpan
3.4.2 Comparison with SPAM
3.4.3 Analysis of Reduction Rate
4 Enhancing the Accuracy of Multimedia Data Retrieval by an Online Classifier
4.1 Classifier construction
4.1.1 The RCE-network and its construction algorithms
4.1.1.1 The structure of the RCE-network
4.1.1.2 Two existing construction algorithms of the RCE-network
4.1.2 Our construction algorithm of the RCE-network
4.1.3 The classification problem of the RCE-network
4.2 Finding a suitable index method
4.2.1 Index selection
4.2.2 The MRP method and its drawbacks
4.2.2.1 The MRP method
4.2.2.2 The drawback of the MRP method
4.2.3 Our pruning method
4.2.3.1 Finding the approximate nearest neighbor by multiple reference points
4.2.3.2 The dimension pruning
4.3 Experiments
4.3.1 The experiments on synthetic data
4.3.2 The experiments on real data
5 Future Works
[1] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for On-Demand Classification of Evolving Data Streams,” IEEE Transactions on Knowledge and Data Engineering, 18(5): 577-589, 2006.
[2] R. Agrawal and R. Srikant, “Fast Algorithm for Mining Association Rules,” Proc. of International Conf. on Very Large Data Bases, pp. 487-499, 1994.
[3] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. of IEEE International Conf. on Data Engineering, pp. 3-14, 1995.
[4] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu “Sequential Pattern Mining using A Bitmap Representation,” Proc. of ACM Conf. on Knowledge Discovery and Data Mining, 2002.
[5] J. Barros, J. French, W. Martin, P. Kelly, and M. Cannon, “Using the Triangle Inequality to Reduce the Number of Comparisons Required for Similarity-based Retrieval,” International Conference on Storage and Retrieval for Image and Video Databases, pp. 392-403, 1996.
[6] A. Berman and L. Shapiro, “Efficient Image Retrieval With Multiple Distance Measures,” International Conference on Storage and Retrieval for Image and Video Databases, pp. 12-21, 1997.
[7] J. K. Bonfield and R. Staden, “ZTR: A New Format for DNA Sequence Trace Data,” Bioinformatics, 18(1): 3-10, 2002.
[8] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, 2(2): 121-167, 1998.
[9] G.H. Cha, X. Zhu, D. Petkovic, and C.W. Chung, “An Efficient Indexing Method for Nearest Neighbor Searches in High-Dimensional Image Database,” IEEE Transactions on Multimedia, 4(1):76-87, 2002.
[10] D. Y. Chiu, Y. H. Wu, A. L. P. Chen, “An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting,” Proc. of IEEE International Conf. on Data Engineering, pp. 375-386, 2004.
[11] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An efficient access method for similarity search in metric spaces,” Proc. of International Conf. on Very Large Data Bases, pp. 426-435, 1997.
[12] S. Cong, J. Han, and D. Padua, “Parallel Mining of Closed Sequential Patterns,” Proc. of ACM International Conf. on Knowledge Discovery in data mining, pp. 562-567, 2005.
[13] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, 20(3): 273-197, 1995.
[14] B. Cui, B.C. Ooi, J. Su, and K.L. Tan, “Indexing High-Dimensional Data for Efficient In-Memory Similarity Search,” IEEE Transactions on Knowledge and Data Engineering, 17(3): 339-353, 2005.
[15] A. Durg, W.V. Stoecker, J.P. Cookson, S.E. Umbaugh, and R.H. Moss, “Identification of Variegated Coloring in Skin Tumors: Neural Network vs. Rule-based Induction Methods,” IEEE Engineering in Medicine and Biology Magazine, 12(3): 71-74, 98, 1993.
[16] T. Evgeniou, M. Pontil, C. Papageorgiou, and T. Poggio, “Image Representation and Feature Selection for Mulitmedia Database Search,” IEEE Transactions on Knowledge and Data Engineering, 15(4): 911-920, 2003.
[17] V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, 30(2): 170-231, 1998.
[18] M. N. Garofalakis, R. Rastogi, and K. Shim, “Mining Sequential Patterns with Regular Expression Constraints,” IEEE Transactions on Knowledge and Data Engineering, 14(3): 530-552, 2002.
[19] K. Goh, E.Y. Chang, and K.T. Cheng, “SVM Binary Classifier Ensembles for Image Classification,” ACM Conference on Information and Knowledge Management, pp. 395-402, 2001.
[20] J. W. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. C. Hsu, “FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining,” Proc. of ACM International Conf. on Knowledge Discovery and Data Mining, pp. 355-359, 2000.
[21] C. C. Ho, H. F. Li, F. F. Kuo and S. Y. Lee, “Incremental Mining of Sequential Patterns over a Stream Sliding Window,” Proc. of IEEE International Conf. on Data Mining Workshops, pp. 677-681, 2006.
[22] J. L. Hsu, C. C. Liu, and A. L. P. Chen “Discovering Nontrivial Repeating Patterns in Music Data,” IEEE Transactions on Multimedia, 3(3): 311-325, 2001.
[23] D.H. Kim and C.W. Chung, “Qcluster: Relevance Feedback Using Adaptive Clustering for Content-Based Image Retrieval,” ACM SIGMOD Conference, pp. 599-610, 2003.
[24] J. K. Kruschke and J. R. Movellan, “Benefits of Gain: Speed Learning and Minimal Hidden Layers in Back-propagation Networks”, IEEE Transaction on systems, Man and Cybernetics, 21(1): 273-280, 1991.
[25] N. Lesh, M. J. Zaki, and M. Ogihara, “Mining Features for Sequence Classification,” Proc. of ACM International Conf. on Knowledge Discovery and Data Mining, pp. 342-346, 1999.
[26] Y. Lu, H. Zhang, L. Wenyin, and C. Hu, “Joint Semantics and Feature Based Image Retrieval Using Relevance Feedback,” IEEE Transactions on Multimedia, 5(3): 339-347, 2003.
[27] T.M. Mitchell, “Artificial Neural Networks,” Machine Learning, pp. 81-127, McGraw Hill, 1997.
[28] X. Mu, M. Artiklar, M.H. Hassoun, and P. Watta, “An RCE-based Associative Memory with Application to Human Face Recognition,” INNS-IEEE International Joint Conference on Neural Networks, 4: 2552-2557, 2003.
[29] N. Panda and E.Y. Chang, “KDX: An Indexer for Support Vector Machines,” IEEE Transactions on Knowledge and Data Engineering, 18(6): 748-763, 2006.
[30] N. Panda, E.Y. Chang, and G. Wu, “Concept Boundary Detection for Speeding up SVMs,” International Conference on Machine Learning, pp. 681-688, 2006.
[31] J. Pei, J. W. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,” Proc. of IEEE International Conf. on Data Engineering, pp. 215-224, 2001.
[32] J. Pei, J. W. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu, “Mining Sequential Patterns by Pattern Growth: The PrefixSpan Approach,” IEEE Transactions on Knowledge and Data Engineering, 16(11): 1424- 1440, 2004.
[33] J. Pei, J. W. Han, and W. Wang, “Mining Sequential Patterns with Constraints in Large Databases,” Proc. of ACM Conf. on Information and Knowledge Management, 2002.
[34] H. Pinto, J. W. Han, J. Pei, K. Wang, Q. Chen, and U. Dayal, “Multi-Dimensional Sequential Pattern Mining,” Proc. of ACM International Conf. Information and Knowledge Management, pp. 81-88, 2001.
[35] O. Procopiuc, P.K. Agarwal, L. Arge, and J.S. Vitter, “Bkd-tree: A Dynamic Scalable Kd-tree,” International Symposium on Spatial and Temporal Databases, pp. 46-65, 2003.
[36] F. Qian, M. Li, H.J. Zhang, W.Y. Ma, and B. Zhang, “Alternating Feature Spaces in Relevance Feedback,” Multimedia Tools and Applications, 21(1): 35-54, 2003.
[37] C. Raissi, P. Poncelet, and M. Teisseire, “SPEED: Mining Maximal Sequential Patterns over Data Streams,” Proc. of IEEE International Conf. on Intelligent Systems, pp 546-552, 2006.
[38] V. Rajan, J. Ying, S. Chakrabarty, and K. Pattipati, “Machine Learning Algorithms for Fault Diagnosis in Analog Circuits,” IEEE Conference on Systems, 2: 1874-1879, 1998.
[39] J.T. Robinson, “Physical Storage Structures: The K-D-B-tree: A Search Structure for Large Multidimensional Dynamic Indexes,” ACM SIGMOD Conference, pp. 10-18, 1981.
[40] P. Y. Rolland, “FlExPat: Flexible Extraction of Sequential Patterns,” Proc. of IEEE International Conf. on Data Mining, pp. 481-488, 2001.
[41] C. She, J. Tang, L. Li, H. Wang and Z. Fan, “An Improved Parallel Algorithm for Sequence Mining,” Proc. of the IEEE International Conf. on Mechatronics and Automation, pp. 1692-1696, 2005.
[42] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proc. of International Conf. on Extending Database Technology, 1996.
[43] S. Tong and E. Chang, “Support Vector Machine Active Learning for Image Retrieval,” ACM Multimedia conference, pp. 107-118, 2001.
[44] A. P. Vries, N. Mamoulis, N. Nes, and M. Kersten, “Efficient k-NN Search on Vertically Decomposed Data,” ACM SIGMOD Conference, pp. 322-333, 2002.
[45] R. Weber, H.J. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Space,” Proc. of International Conf. on Very Large Data Bases, 1998.
[46] M. A. Weiss, Data Structures and Algorithm Analysis in C–2nd ed., Addison-Wesley, 1997.
[47] J. J. Wesselink, B. Iglesia, et al. “Determining a Unique Defining DNA Sequence for Yeast Species Using Hashing Techniques,” Bioinformatics, 18(7): 1004-1010, 2002.
[48] D.A. White and R. Jain, “Similarity Indexing with the SS-tree,” IEEE Conference on Data Engineering, pp. 516-523, 1996.
[49] B. M. Wilamowski, S. Iplikci, O. Kaynak, and M. Önder Efe, “An Algorithm for Fast Convergence in Training Neural,” INNS-IEEE International Joint Conference on Neural Networks, pp.1778-1782, 2001.
[50] Y. H. Wu and A. L. P. Chen, “Prediction of Web Page Accesses by Proxy Server Log,” World Wide Web: Internet and Web Information Systems, 5(1): 67-88, 2002.
[51] J. Yang, W. Wang, P. S. Yu, and J. W. Han, “Mining Long Sequential Patterns in a Noisy Environment,” Proc. of ACM International Conf. on Management of Data, 2002.
[52] C. Yu, B.C. Ooi, K.L. Tan, and H.V. Jagadish, “Indexing the Distance: An Efficient Method to KNN Processing,” Proc. of International Conf. on Very Large Data Bases, pp. 421-430, 2001.
[53] H. Yu, J. Yang, J. Han, and X. Li, “Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing,” Data Mining and Knowledge Discovery, 11(3): 295-321, 2005.
[54] D. Yu and A. Zhang, “ClusterTree: Integration of Cluster Representation and Nearest-neighbor Search for Large Data Sets with High Dimensions,” IEEE Transactions on Knowledge and Data Engineering, 15(5): 1316-1337, 2003.
[55] M. J. Zaki, “Efficient Enumeration of Frequent Sequences,” Proc. of ACM International Conf. on Information and Knowledge Management, pp. 68-75, 1998.
[56] M. J. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences,” Machine Learning, 42(1): 31-60, 2001.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *