排序對關聯演算法速度的影響

摘要近年各企業全面電腦化的普及程度幾乎到了沒有企業不使用電腦的地步，而全方位的電腦化，使得企業累積了大量資料，也加快了資料累積的速度，而這些資料是企業經營的第一手資料，內含大量豐富的潛在資訊，可是以往都將這些資料做為財務報表的資料來源而己，鮮少於在其中發現有價值的隱含資訊，更不要說其中蘊含的知識. 但企業競爭的加劇，全球化市場的需求，商業由原來的守勢轉變為攻勢，掌握機先則成為生存的要務，在需求孔急眾緣齊備的狀況，資料探勘成為當紅炸子雞，我們希望尋找資料挖掘中關聯規則(Association Rules) 的新架構。因古典的Apriori 演算法，利用猜測瀘除的方法，找出交易中品項間的關係，使用Apriori 演算法耗費時間，多次的產生龐大的備選集，再以掃描資料庫過瀘支持度低的備選集，需要大量的記憶體並且執行速度緩慢。本研究利用壓縮陣列將交易資料的關聯頻次紀錄在記憶體中，而再由完整的關聯頻次紀錄找出關聯規則，由於全部作業皆在記憶體中運作，直接歸納，不做猜測瀘除的步驟，故效率提昇顯著，而因為壓縮率較FP-Tree為高故在FP-Tree不適運作的狀況亦能穩定作業，且FP-Tree在實作時常使用指標連結，故較難serialize，也較難儲存結果，而SortCompress使用壓縮陣列，緊密且小，便於儲存結果，且直接循序讀出即可和新的壓縮陣列合併，故在實用上提供了更多的應用方式，適合做為基底的關聯規則演算法，提供其它變化型的關聯應用．關鍵字:資料挖掘、關聯式規則、交易物項、高頻物項集合、非高頻物項集合

關鍵字

關聯式規則；交易物項；頻物項集合；非高頻物項集合；資料挖掘

並列摘要

Abstract In recent years many business use computer to keep their transaction data and control their business process. Until now all Corp. has their own computers keep a lot’s of their process digital data. Faster and faster data generates. Those data conclude very important information and some useful knowledge , but we had no idea to use and discover it , but now the information science has a big improves, the faster computer, the larger size disk, the more popular database application, those let’s can find the knowledge hides in database. The way to find knowledge in data is called “Data mining”. One data mining method is called “Association Rules”. A classical method “Apriori” has use on it for many years, It’s a way to look for Association Rules but it’s really not a good way, Apriori use the Psychology “Guess and filter” so spend many efforts on something no need to care about, so It’s need many main memory and compute time. Someone discovered a smart way to do it. Use a compressed tree to record the pattern with counts , so can in two pass database read to find whole all association rules. The compressed tree we call it “FP-Tree”. It’s a pretty nice method to find association rules. It’s brought a reduced complexity order and lower efforts. But it’s not made by the god. It’s need much memory to construct the FP-Tree. So it can hardly use on very large database, if it need to swap the tree between memory and disk. It’s may be a nightmare. The master idea of FP-Tree is compress transaction database into a limited main memory. Main memory faster than disk for ten thousand times. So use it can reduce lot’s of efforts on swap and scan. Now we find an new way with large compress ratio on transfer transaction data into to main memory can stored and can easy to use it to extracts association rules. We named this logarithm “SortCompress”. We use a compress array to keep the transaction association rules and use sort to help the compress process faster. It slower than FP-Tree a little, but it just need half size memory to stored it, so we can process more complexity transaction data use just an normal computer, and do it well. Now we will introduce that how we do it. Keywords: datamining, association rules, apriori, fp-tree, sortcompress,

並列關鍵字

sortcompress ； fp-tree ； apriori ； association rules ； datamining

參考文獻

[1] A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases”, In Proceedings of 21st VLDB, pp.432-444,1995

[2] D. Lin and Z. M. Kedem, “Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set”, In Proceedings of VI Intl. Conference on Extending Database Technology, 1998.

[6] J.S.Park,M.S.Chen,and P.S.Ui,”An Effective Hash Based Algorithm for Mining Association Rules”, Proceedings of the ACM SIGMOD, pp.1750186,1995

[7] M.Houtsma and A. Swami, “Set-Oriented Mining of Association Rules in Relational Databases,” 11th int’l Conference on Data Engineer, 1995

參考文獻

Google Scholar

國際替代計量

排序對關聯演算法速度的影響

未授權

主題瀏覽