以資料樣式分群進行變動資料流分類探勘之研究

不同於靜態資料庫中保存的歷史資料，資料流上的資料會以高速、連續、沒有限量的方式進入，資料概念與分布還可能隨著時間而改變。所以在資料流環境上進行分類工作必須兼顧速度與準確度，並能夠適時地偵測到概念改變發生，調整原有的分類模型，以符合最近的資料概念和分布趨勢。因此本論文先利用最近鄰居演算法對訓練資料做分群，再從每個群中挑選一個代表樣式來產生分類規則，以降低分類模型的大小。此外，由於近期資料通常比較符合變化的趨勢，所以在分類的過程中，我們使用一個移動視窗來觀察視窗內分類準確度的變化，並記錄在移動視窗中無法被合適預測的資料點，當分類準確度低於設定的門檻值時，則利用被記錄的資料點調整原分類模型。本論文可透過分群時所設定的距離門檻值來調整分類規則數，使分類模型所佔之空間彈性化。並能有效偵測資料概念改變，且動態調整分類模型，以維持預測新進資料類別的準確度。

關鍵字

分類；分群；變動資料流；概念改變

並列摘要

Differ from the static database for storing history data, the data stream is continuously and unlimited produced in high-speed. Moreover, the implicit concept and the distribution of data may change as time goes by. Accordingly, the classification model is not only required to perform the predictions correctly and efficiently, but also to detect concept changes for adjusting the classification rules to catch recent trends in time. In this thesis, a clustering based classification method is provided for reducing the number of classification rules. First, the nearest neighbor algorithm is adopted to cluster the training data. Then a representational pattern is chosen from each cluster to construct a classification rule. In the process of predicting, the exception data in a sliding window is recorded and the accuracy in the window is monitored. When the accuracy is below a user-defined threshold value, the recorded data is used to adjust the classifier. The proposed method controls the number of classification rules by setting different distance threshold when performing data clustering. Therefore, the required storage of the constructed classifier is adaptable. Furthermore, the constructed classifier model is adjustable for concept changes to keep accurate predictions.

並列關鍵字

無資料

參考文獻

[4] F. J. Ferrer-Troyano, J. S. Aguilar-Ruiz, and J. C. Riquelme, “Incremental Rule Learning based on Example Nearness from Numerical Data Streams,” in Proc. of the 20 ACM Symposium on Applied Computing, ACM SAC, 2005.

[5] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” in Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 97-106, 2001.

[11] J. R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan-Kaufmann Publishers, San Mateo, CA, 1993.

[12] L. Breiman, “Bagging Predictors,” Machine Learning, 24(2):123-140, 1996.

[14] P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining,” Addison Wesley, 2005.

國際替代計量

以資料樣式分群進行變動資料流分類探勘之研究

主題瀏覽