學習式EM演算法在T分配混合模型上的應用

摘要資料敘述著自然現象、生物物種、科學實驗結果、動態機械系統的特性和特徵。我們也要透過它—資料瞭解我們觀察的物質種類和現象。它是我們了解的基石，透過對它做分析、推論和決定。資料分析中最重要的其中一種是聚類，聚類是數據挖掘的重要工具。非常多的數據分析是去進行分類或聚類數據到屬於資料的組類別或集群，分類資料到屬於資料特性或特徵的範疇集合。因為它可以識別主要模式或趨勢但不需要事先地知道所有的訊息。EM演算法式其中一個以概率密度函數估計為無監督預測學習問題的框架，但是EM演算法的高斯混合模型對初始值相當敏感，需要給予先驗機率和其群集的數量。Yang et al. [14] 提出了穩健式EM高斯混合模型聚類演算法，解決初始化問題並可自動獲取群集的最佳數目，但其所提穩健式EM高斯混合模型聚類演算法對離群值並不穩健，事實上，一些如t-分佈和PearsonⅦ型分佈，是比高斯分佈更穩健。所以在本論文中，我們研究了t分佈，並將穩健式EM高斯混合模型聚類演算法和t-分佈做結合，即結合兩者的優點。我們考慮t分佈，希望它解決高斯分布離群值問題。特此，我們研究一個使用t分佈演算法的混合模型，叫學習式EM演算法在T分配混合模型(Learning-Based EMT algorithm)上的應用。

關鍵字

資料分析；聚類分析； EM 演算法

並列摘要

Abstract One of important things in data analysis is to classify or group data into a set of categories or clusters. It is that the same set of data objects should show similar properties based on certain criteria. Data clustering provides a basis for further analysis, reasoning, understanding phenomena and decision-making so that clustering is an important tool of data mining, mainly because it can identify patterns or trends without any monitoring information or system, such as data labels. It can be broadly defined as a group of objects into clusters such that each of which represents a significant subgroup. Objects can be graphs, text, images, or any other personal characteristic or distinction by a group of database records for describing the relationship between the collection nodes. Gaussian mixture models are generally used as model fitting for a data set where the Expectation-Maximization (EM) algorithm is the most used algorithm. However, the EM algorithm for Gaussian mixture models is quite sensitive to initial values and the number of mixing components needs to be given a priori. Yang et al. [14] provided the robust EM clustering for Gaussian mixture models. They give a new way to solve initialization problems and constructed a schema to automatically obtain an optimal number of clusters. But, in their conclusion, it said, “Gaussian distribution is not robust for outliers. Some distributions, such as t-distribution and Pearson type Ⅶ distribution, are more robust to outliers than Gaussian distribution.” Thus, in this paper, we make advanced study in t-distribution and robust EM algorithm. We try to combine their advantages by replacing Gaussian distribution with t-distribution and solve the outlier problem. We then create the learning-based EM algorithm for mixture models with T distribution, called learning-based EMT algorithm.

並列關鍵字

data analysis ； EM algorithm ； clustering analysis

參考文獻

[1] A.A. Lubischew, On the use of discriminant functions in taxonomy, Biometrics, vol. 18, pp. 455-477, 1962.

[3] B. D. Chuong and S. Batzoglou, What is the expectation maximization algorithm, Nature Blotechnology-computational Blology-Prime, pp. 897-900, 2008.

[5] C. Bishop, Neural networks for patten recognition, Oxford University Press, 1995.

[6] D. Peel and G. J. MacLaren. Robust mixture modeling using the t distribution, Statistics and Computing, pp. 339-348, 2000.

[8] Gaussian Mixture Models, MathWorks.

國際替代計量

學習式EM演算法在T分配混合模型上的應用

未授權

主題瀏覽