多年來,文獻分類在幾個領域中是一個典型的問題。然而,先前大多數的工作都假設認為,語料庫可以被明確標記以及顯著分類。在這論文中,我們將注重於自動收集品質良好的訓練資料。我們提出探勘方法從給定的無標記的語料庫中,或者網路上,來收集訓練資料。我們提出的方法是全自動的,只需要人們事先建立好分類類別。 在我們的論文中,類別名稱的概念是可以從和其他被分類的類別中捕獲的,這就是在類別之中的共同概念。此外,我們可以重複地在各個類別之中發掘鑑別性的概念。這麼一來,藉由尋找共同的概念和鑑別性的概念,我們可以獲得品質很高的訓練資料。實驗評估給了經驗上的證據:被訓練的分類器因此有了顯著的準確率。總而言之,藉由我們提出的方法來自動收集品質良好的訓練資料,是我們這篇論文中最主要的貢獻。
Document classification is a typical problem in several fields for many years. However, most previous work has the assumptions that the corpora can be explicitly-labeled and well-classified. In this work, we will concentrate on automatic acquisition of training data in good quality. We propose mining approaches to collect training data from given unlabeled corpus or the web, and our proposed approaches are fully automatic which is only needed to construct classes by humans in advance. In our work, the concept of class name can be captured by comparing with other classes, which is the common concept among classes. Moreover, we can discover discriminative concepts iteratively within each class. In this way, by finding common concepts and discriminative concepts, we can acquire training data of high quality. The evaluation gives empirical evidence that the classifiers thus created have promising accuracy. In a word, the automatic acquisition of training data in good quality by our proposed methods is the primary contributions of this work.