高效率之遞增式資料探勘演算法－ICI

隨著資訊科技的進步、電腦的普及，蒐集資料變得更容易、快速而且方便。但長時問之下，資料庫累積了大量且有隱藏知識的資料。所以，如何將這些被隱藏的知識，做正確又有效率地探勘，成為一個重要的議題。因此，資料探勘的技術便應運而生。當中，最被廣為使用的技術為關聯規則之探勘。關聯規則探勘主要是探討如何從龐大資料庫中找出高頻項目集，進而發掘有用的知識。而在關聯規則中最常被使用的方法為Apriori演算法。雖然此方法可以找出關聯規則，但是它有二個最大的缺點：第一點為在找高頻項目集合時，會產生大量的候選項目集合；第二點為執行時必須經常掃瞄整個資料庫，造成執行效率不佳。後續有許多研究皆針對此缺點做改進，但皆未跳脫Apriori演算法的整體架構，以致於其執行效率並無很大的進展。本研究所提出ICI演算法脫離Apriori演算法的架構，在產生大項目集合時，只需掃描資料庫一次，因此可以有效率地降低I/O的存取時間，並且快速地找出關聯規則，使得探勘更有效率。此外ICI演算法不需要任何修改就可以當作線上即時漸增式資料探勘(On-line Incremental Data Mining)的演算法。

關鍵字

資料探勘；關聯規則； Apriori演算法；高頻項目集；遞增式資料探勘

並列摘要

Due to the improvement of information technologies and popularization of computers, collecting information becomes easier, rapider and more convenient than before. As the time goes by, database accumulates huge and knowledge-hiding information. Therefore, how to correctly uncover and efficiently mining hidden knowledge from those information becomes a very important issue. Hence the technology of data mining becomes one of the solutions. Among the data mining technologies association rules mining is one of the most popular technologies to be used. Association rules mining explores the approaches to extract the frequent itemsets from large database and to derive the knowledge behind implicitly. The Apriori algorithm is one of the most frequently used algorithms. Although the Apriori algorithm can successful derive the association rules from database, the Apriori algorithm has two major defects: First, the Apriori algorithm produces large amounts of candidate itemsets during extracting the frequent itemsets from large database. Secondly, the whole database is scanned many times which leads to inefficient performance. Many researches try to improve the performance of the Apriori algorithm, but still not escape from the frame of the Apriori algorithm and lead to a little improvement of the performance. In this paper we propose ICI (Incremental Combination Itemsets) which escapes the frame of Apriori algorithm, and it only needs to scan whole database once during extracting the frequent itemsets from large database. Therefore, the ICI algorithm efficiently reduces the I/O time, and rapidly extracts the frequent itemsets from large database, and makes data mining more efficient than before. Meanwhile, ICI algorithm doesn't need to scan database and reconstruct data structure again when database is updated or minimum support is varied. Therefore, it can be applied to online incremental mining applications without any modification.

並列關鍵字

Data Mining ； Association Rule ； Frequent Itemsets ； Incremental Mining

參考文獻

Aarawal, R.,R. Srikant(1995).Mining Sequential Patterns.IEEE International Conference on Data Engineering.(IEEE International Conference on Data Engineering).

Google Scholar

Agrawal, R., T. Imielinski,A. Swami(1993).Mining Association Rules Between Sets of Items in Large Databases.ACM SIGMOD Conference on Management of Data.(ACM SIGMOD Conference on Management of Data).

Google Scholar

Agrawal, R.,T. Imielinski,A. Swami(1993).Mining Association Rules Between Sets of Items in Large Databases.ACM SIGMOD Conference on Management of Data.207-216.