透過您的圖書館登入
IP:3.141.41.187
  • 期刊

A Modified Likelihood-Based Approach for Semi-Supervised Learning for Normal Mixture Data with Multiple Components Per Class

適用於每類別具有多個子群之半監督學習方法

摘要


In semi-supervised learning, we have sample data of features that are from different classes, and only a small part of data have class labels. To predict class labels for unlabelled data, one approach is to model the data in each class using a mixture normal distribution, estimate the parameters using maximum likelihood estimation via EM algorithm, and predict class labels based on estimated probabilities. To implement the maximum likelihood approach, we provide an algorithm to determine initial values for the EM algorithm and refer this method as the ML-EM method. We conducted some simulation experiments to check the prediction performance of the ML-EM method and six other methods. According to our simulation results, the ML-EM method outperforms the six methods in some challenging cases based on average ARI values. However, in a simple case where data points from different classes are well separated, sometimes the ML-EM method can be outperformed by some of the six methods. To improve the prediction performance of the ML-EM method, we propose the ML-EM II method, a modified version of the ML-EM method. Simulation results show that the ML-EM II method performs better than the ML-EM method.

並列摘要


在半監督學習中,我們觀察到來自不同類別的的樣本資料,但僅有一小部分資料具有類別標籤。爲了預測無標記資料的類別,一種做法是假設資料來自一混合常態分配,接著使用最大概似法來估計參數並利用估計所得之機率來預測資料的類別。在此研究中我們考慮的模型容許每個類別之資料可來自於數個不同的常態分配,亦即容許每類別之資料分配有多個中心以對應多個子群。爲了透過EM演算法求得最大概似估計,我們提出了一種決定EM初始值的方式,並將這套半監督學習方法稱爲ML-EM方法。本研究進行了模擬實驗以檢驗ML-EM方法和其他六種方法的預測表現。藉由比較模擬實驗的結果,我們發現在分類一些較具挑戰性的資料時,ML-EM方法優於其他六種方法。但是當不同類的資料完全分開的簡單情形下,有時ML-EM方法的表現反而不如其他六種方法中的某些方法。爲了提高ML-EM方法的預測效能,我們又提出了ML-EM方法的改進版本,稱之爲ML-EM II方法。由模擬研究發現,ML-EM II方法的效果優於ML-EM方法。

參考文獻


Basu, S., Banerjee, A. and Mooney, R. J. (2002). Semisupervised clustering by seeding. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML’02, pages 27-34, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
Bilenko, M., Basu, S. and Mooney, R. J. (2004). Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML’04, pages 11-, New York, NY, USA, 2004. ACM.
Carcillo, F., Le Borgne, Y.-A., Caelen O., Kessaci, Y., Oblé, F. and Bontempi, G. (2019). Combining unsupervised and supervised learning in credit card fraud detection. Information Sciences, 2019.
Carcillo, F., Le Borgne, Y.-A., Caelen, O. and Bontempi, G. (2018). Streamin active learning strategies for real-life credit card fraud detection: assessment and visualization. International Journal of Data Science and Analytics, 5(4), pages 285-300.
Carcillo, F., Pozzolo, A. D., Le Borgne, Y.-A., Caelen, O., Mazzer, Y. and Bontempi, G. (2018). Scarff: A scalable framework for streaming credit card fraud detection with spark. Information Fusion, 41, pages 182-194.

延伸閱讀