透過您的圖書館登入
IP:18.191.223.123
  • 學位論文

發展二元變數與不平衡資料限制下之自適應加權飄移模型

On the Development of Concept Drifting Model with Adaptive Weights under the Constraints of Binary Variables and Imbalanced Data

指導教授 : 藍俊宏
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


現今資料流的取得和分析應用越來越普遍,然而資料流經常有類別不平衡(瑕疵與良品比例相差懸殊)和概念飄移(資料分布隨時間改變)的問題,而當兩者共同出現並相互影響時,會使得多數機器學習模型難以完成目標任務。且當資料的型態皆為類別變數時,如何定義適應類別不平衡和概念飄移的距離測量方式,使得分析和預測的挑戰更加困難。 本研究提出一線上學習分類架構:ARF-WRE (Adaptive Random Forest with Weighted REsampling),以ARF模型為基礎框架,考量類別型態資料的變數重要性,進行資料重採樣後再進行線上分類。ARF-WRE首先透過動態決定的加權漢明距離改變資料分布結構,以因應持續概念飄移的資料,再藉由不同重採樣技術處理類別不平衡的問題,最後並結合預警重訓練機制提升線上分類模型預測表現。 本研究使用台灣TFT-LCD製造商收集的製程事件、警報資料來預測其出貨前品質檢測結果,由於成熟製程的良率極高,因此資料的不平衡度可達1:2000;另外由於事件、警報資料並非規律發生,資料分布也因此動態改變。經實驗結果發現,本論文提出的ARF-WRE面對極端類別不平衡且概念飄移的資料集時,能夠展現有效的預測結果。此外透過不同模型架構的比較,發現ARF-WRE透過重採樣基礎讓模型能在保持好的預測表現同時還大幅提升訓練的效率,再輔以加權漢明距離和預警重訓練的機制,可以在具有資料類別不平衡和概念飄移的情境下達到最佳的模型表現。

並列摘要


Nowadays, data stream acquisition and analysis are becoming a fashion, but data quality/consistency is critical to analytical performance. Common issues, such as class imbalance (the ratio of non-defective units to defects is high) and concept drift (data distribution changes over time), jeopardize the resulting quality of machine learning models. Moreover, when data types are mostly categorical, it is essential and challenging to find a proper distance metric to tackle the issues of class imbalance and concept drift. This thesis proposes an online learning classification architecture: ARF-WRE (Adaptive Random Forest with Weighted REsampling), which takes ARF as a basis model. ARF-WRE aims at handling the binary variable importance and resampling techniques simultaneously. It firstly changes the data distribution through the weighted hamming distance based on the dynamic variable importance to cope with constantly drifting data. Different resampling techniques are then used to tackle the class imbalance issue. Finally, an early warning retrain mechanism is proposed to improve online classification performance. This research employs the process event log and alarm data, provided by a Taiwan TFT-LCD manufacturer, to predict its pre-shipping inspection results. Due to its high yield of matured products, the data imbalance can reach 1:2000. Moreover, the irregular occurrence of event and alarm makes data distribution change dynamically. The experimental results show that the proposed ARF-WRE improves the prediction results significantly. Through comparing different model designs, ARF-WRE further enhances the training efficiency through data resampling.

參考文獻


Bach, S. H., Maloof, M. A. (2008, December). Paired learners for concept drift. In 2008 the 8th IEEE International Conference on Data Mining (pp. 23-32). IEEE.
Bifet, A., Gavalda, R. (2007, April). Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining (pp. 443-448). Society for Industrial and Applied Mathematics.
Bifet, A., Holmes, G., Pfahringer, B. (2010, September). Leveraging bagging for evolving data streams. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 135-150). Springer, Berlin, Heidelberg.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.

延伸閱讀