具概念遞移之串流資料的分類技術

如何有效地管理大量、精細且快速累積的串流資料(data streams)對於以分析靜態資料為主的資訊勘測是一項新的挑戰。傳統的分類(classifiers)主要功能在於自資料中分析出不變的(stationary)預測觀測，但對於具概念遞移(Concept-Drifting)串流資料而言，無法有效地捕捉並學習遞移後的新概念。在本論文中，我們提出以SODA(Speedy cOncept-drift etection Algorithm,高速概念遞移偵測演算法)為基礎的一個高效率概念遞移串流資料分類器。SODA 演算法是一個線上漸進式的分類器，其主要優點在於可在常數時間(constant time)內分析新進資料並學習遞移後的新概念。本論文的主要貢獻主要包含幾個部份，首先，有別於以往相關研究，我們將串流資料的概念遞移定義為「對主要預測標的，最具鑑別度的維度發生資料分佈之顯著改變」。基於此定義與統計檢定而研發之快速概念遞移偵測法，SODA 演算法可快速且有效地捕捉在串流資料中的概念遞移。再者，整合以資訊量精進度及分類正確率為基礎之決策樹修剪檢驗函數，決策樹之深度可維持於最佳大小，使其並能同時兼顧偵測效率與分類器準確度。最後，藉由具高度成效之候選決策樹選擇策略，SODA 演算法可有效地檢驗決策樹的適時性，並且選擇最佳的候選樹。經由一系列的實驗結果實證，本論文所提出的SODA 演算法在概念遞移的偵測效度、運行效率、分類正確率及耗用記憶體空間各方面，均能有效改善相關研究中所提出之演算法。

關鍵字

串流資料；概念遞移；資訊探勘；決策樹

並列摘要

We devise in this thesis a concept-drift-driven classification algorithm, called SODA(Speedy Concept-Drift Detection Algorithm) to mine data streams with concept drift. SODA is an on-line incremental learning algorithm which is able to keep its model consistent with new concepts and to process each example in constant time. The contributions of the algorithm SODA are many folds. We address the problem of detecting concept drifts by inspecting the distribution of one attribute which is most discriminative to target class. The SODA algorithm is capable of capturing concept drifts in data streams efficiently, and looks after execution performance and accuracy of classifiers. From the empirical studies in Section 4, by applying the efficient split checking method, the concept drift detection with statistical analysis, and the effective alternative tree selection strategy, algorithm SODA outperforms prior works in terms of execution efficiency, performance of detecting concept drifts, and economic usage of memory. Thus, the concepts in data streams can be captured and learned efficiently. Therefore, SODA algorithm is able to strike a balance between the memory usage and accuracy of the classifier in data streams.

並列關鍵字

Data Streams ； Concept Drfit ； Classification ； Data Mining

參考文獻

[1] L. Breiman. Bagging predictors. Machine Learning, 24(2):123—140, 1996.

[2] F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of data

[3] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. Of SIAM, 2002.

[4] A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In Proc. Of SIGMOD, 2002.

[5] P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. Of

國際替代計量

具概念遞移之串流資料的分類技術

全文下載

主題瀏覽