海量資料下資料探勘的啟發式抽樣資料準備方法

海量資料近年來無論在業界或學界中都是非常熱門的話題，其資料的特性不僅數量龐大、來源紛雜，同時資料會不停地新增成長，基於這些特性而使得這些資料比起過往的資料內容更加難以分析，原有的資料探勘方式應用在海量資料時有非常大的可能遭遇到無法適用的狀況，特別是在執行時間上極有可能因為海量資料的特性而不能夠即時有效的產生分析結果，甚至可能因為資料的物件或屬性的數量過大造成完全無法取得結果。在本篇研究中，我們使用基於關聯性規則的分類法作為資料探勘的分類方式，在不改動原有資料探勘方法的前提下，透過資料的選擇、前處理以及產生分類器結果後的評估、整合來解決所遭遇到的海量資料問題。我們提出的方法分為兩個部分，首先是在初始狀態下對資料進行有目的的啟發式抽樣方法，使得抽樣出來的資料能夠足以代表整個海量資料的母體，再針對屬性的部分計算各個屬性分別的鑑別力與重要性，從中選擇出重要的屬性來做為後續資料探勘所用。針對資料分布型態的不同，我們可以視需求使用適當的方法調整抽樣的比率，使得某些特定的稀有分類資料能夠有相對應的分類規則能夠使用。第二部分則是特別處理資料成長增加的問題，首先使用初始狀態的方式分別對舊有資料以及新進資料進行抽樣並建立分類器，再透過新進資料與舊有資料的整合，將舊有與新進的分類器合併，重新驗證分類器中的規則，刪除不必要的規則並將其餘規則重新排序，成為最終調整過後的分類器並得以得以應用在資料上。本研究所提出的方法應用在海量資料下的資料探勘時，透過實驗的結果能夠得知產生的結果與使用所有資料時能夠有相似的準確率，但能夠有效的減少所需要的執行時間，使得分析結果能夠迅速的產生，並將其結果應用在其他資料上。

關鍵字

海量資料；增量式資料；資料探勘；資料分類；資料準備；資料抽樣；屬性選擇

並列摘要

Big data has been a greatly popular topic among industries and academics. It contains several characteristics which are extreme scale, various data sources and incremental. These characteristics make big data harder to be analyzed while classic data mining techniques are highly possible to be infeasible. Especially, the processing time may not be efficient enough to generate analytic results in time due to the characteristics of big data. Furthermore, it could fail to generate any result since the number of objects and attributes are too large. In this study, we use classification based on association rules as our data mining technique. Under the premise of not changing existing data mining method, we try to solve the problem of big data by data preparation, integration and evaluation. The algorithm we proposed separates to two parts. The first part is a heuristic sampling method at the initial phase. Samples the data that is representative to the population of big data and then selects attributes which are important and discriminative. The sampling result can be further applied to following data mining techniques. For the purpose of handling different class distributions, we can apply undersampling method for some specific rare class to generate corresponding rules. The second part is dealing with incremental problem. Using the sampled data of initial phase from both the preliminary data and the incremental data and their classifiers, we merge the data and apply these data to verify the combined classifier. After pruning invalid rules and ranking all rules, we can obtain the final modified classifier as the result and apply the modified classifier on other data in the population. Applying the algorithm we proposed in data mining under big data environment, we can generate the result that is comparable to the one using the whole dataset. Moreover, the processing time is significantly reduced and thus the analytic result can be obtained in time to make further applications.

並列關鍵字

Big Data ； Incremental Data ； Data Mining ； Data Classification ； Data Preparation ； Data Sampling ； Attribute Selection

參考文獻

[2] Angiulli, F., G. Ianni, and L. Palopoli, "On the complexity of inducing caategorical and quantitative association rules", Theoretical Computer Science, Vol. 314, no. 1-2, 2004, pp 217-249.

[3] Arockiaraj, M. C., "Application of Data Mining Technique in Invasion Recognition", IOSR Journal of Computer Engineering, Vol. 10, no. 3, 2013, pp 20-23.

[4] Cano, J. R., F. Herrera, and M. Lozano, "Stratification for scaling up evolutionary prototype selection", Pattern Recognition Letters, Vol. 26, no. 7, 2005, pp 953-963.

[5] Cano, J. R., F. Herrera, and M. Lozano, "Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability", Data & Knowledge Engineering, Vol. 60, 2006, pp 90-108.

[6] Chen, M. C., L. S. Chen, C. C. Hsu, and W. R. Zeng, "An information granulation based data mining approach for classifying imbalanced data", Information Sciences, Vol. 178, 2008, pp 3214-3227.

國際替代計量

海量資料下資料探勘的啟發式抽樣資料準備方法

查找全文

主題瀏覽