透過您的圖書館登入
IP:3.138.33.87
  • 學位論文

核分組費雪區別法於文件分類之應用

Applications of Kernel-Partition FLD to Document Classification

指導教授 : 陳正剛

摘要


屬性解釋(Attribute Interpretation)對於分類方法是非常重要的,但一般非線性的分類方法,都無法提供屬性解釋的功能。因此,這篇論文的目的,是希望能夠建立一個方法,讓核心費雪區別法(Kernel Fisher Discriminant)和核心最小平方誤差法(Kernel Minimum Squared Error)兩種非線性的核分類方法亦能提供屬性解釋的能力。這兩種方法都是把實例(instances)從原本的屬性空間(attribute space)投射到高維度的特徵空間(feature space)裡,並在特徵空間裡有效的對實例做樣式辨認(pattern recognition)或特徵提取(feature extraction),卻也因此失去了屬性解釋的功能。因此,我們可以利用核區別得點(kernel discriminant score)將原本的實例切割成幾個具有線性結構的小組,並在每個小組中使用線性的費雪區別法(Fisher Linear Discriminant),如此便可得到每個小組中的屬性解釋。我們所提出的方法稱為核分組費雪區別法(Kernel-Partition FLD)。此外,我們也將針對核分組來提供屬性解釋。我們將證明核分組的分類方法不但可以如同非線性的分類方法有較好的區分能力,另外也可以針對每個小組提供屬性解釋的功能,並且知道每一個組別中重要的屬性為何。我們也將核分組費雪區別法應用在文件分類的問題中。文件資料的高維度造成了運算的成本高昂,並且可能存在許多不必要的雜訊影響分類的結果。因此,我們也提出了一個減少維度的方法,並證明這個方法不但減輕了我們的運算成本並且將因減少維度而損失的資訊降到最低。對於核分組費雪區別方法,這些屬性解釋甚至可以成為核分類方法的分類依據,我們先在特徵空間裡利用適當的區別得點將實例分組,然後建構出區別函數使得預測的正確率可以達到最高。最後經由一些模擬和實際的文件分類之例子證實,核分組費雪區別法確實結合了一般線性和非線性分類方法的優點。

並列摘要


Since attribute interpretation is important in classification but not provided by nonlinear classifiers, the objective of this research is to develop a methodology for nonlinear classification methods, namely KFD (Kernel Fisher Discriminant) and KMSE (Kernel Minimum Squared Error), to provide attribute interpretation. The proposed methodology is called Kernel-Partition FLD. KFD and KMSE are both kernel-based classifiers which transform the instances from the original attribute space to the feature space. The feature space is efficient for feature extraction and pattern recognition but losses the meanings of the original attributes. For attribute interpretation, we need to partition the instances with nonlinear structures into several groups where each group has its own linear structure. Then, we can apply FLD (Fisher Linear Discriminant) for each group to provide attribute interpretation. In addition, we also attempt to attribute interpretation for the Kernel-Partition in this study. We will then apply the methodology to document classification. Classification of a large number of documents with a great number of terms is a challenge for all learning algorithms and will be the focus of this research. A novel approach should be developed such that the text dataset can be better classified through nonlinear classification with knowledge on which terms (attributes) are more important for classification of certain types of documents. Moreover, the high computation cost and the sparsity problem of document vectors in document classification is also an issue to be addressed in this research. Thus, a dimension-reduction methodology is developed to effectively diminish the computation requirement and reduce the dimensions without loss much information. With Kernel-Partition FLD, the attribute interpretation can be further developed to become classification rules for kernel-based classification approaches. In this research, the proposed methodologies will be shown to successfully combine the advantages of both linear and nonlinear classifiers through simulated cases and the real-world cases of the text dataset.

參考文獻


[2] Joachims, T., Text categorization with support vector machines: Learning withmany relevant features, European Conference on Machine Learning (ECML), 1998.
[5] McCallum, A. and K. Nigam, A comparison of event models for naive bayes text
classification, AAAI-98 Workshop on Learning for Text Categorization, 1998.
[6] I. Jollife. Principal Component Analysis. Spring-Verlag, New York,1986.
[7] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179-188.

延伸閱讀