透過您的圖書館登入
IP:3.15.151.159
  • 學位論文

文件分類中自動訓練資料收集法

Automatic Training Corpora Acquisition for Document Classification

指導教授 : 鄭卜壬

摘要


多年來,文獻分類在幾個領域中是一個典型的問題。然而,先前大多數的工作都假設認為,語料庫可以被明確標記以及顯著分類。在這論文中,我們將注重於自動收集品質良好的訓練資料。我們提出探勘方法從給定的無標記的語料庫中,或者網路上,來收集訓練資料。我們提出的方法是全自動的,只需要人們事先建立好分類類別。 在我們的論文中,類別名稱的概念是可以從和其他被分類的類別中捕獲的,這就是在類別之中的共同概念。此外,我們可以重複地在各個類別之中發掘鑑別性的概念。這麼一來,藉由尋找共同的概念和鑑別性的概念,我們可以獲得品質很高的訓練資料。實驗評估給了經驗上的證據:被訓練的分類器因此有了顯著的準確率。總而言之,藉由我們提出的方法來自動收集品質良好的訓練資料,是我們這篇論文中最主要的貢獻。

關鍵字

文件分類 訓練資料

並列摘要


Document classification is a typical problem in several fields for many years. However, most previous work has the assumptions that the corpora can be explicitly-labeled and well-classified. In this work, we will concentrate on automatic acquisition of training data in good quality. We propose mining approaches to collect training data from given unlabeled corpus or the web, and our proposed approaches are fully automatic which is only needed to construct classes by humans in advance. In our work, the concept of class name can be captured by comparing with other classes, which is the common concept among classes. Moreover, we can discover discriminative concepts iteratively within each class. In this way, by finding common concepts and discriminative concepts, we can acquire training data of high quality. The evaluation gives empirical evidence that the classifiers thus created have promising accuracy. In a word, the automatic acquisition of training data in good quality by our proposed methods is the primary contributions of this work.

參考文獻


[3] Chen-Ming Hung and Lee-Feng Chien. Web-Based Text Classification in the Absence of Manually Labeled Training Documents. In Journal of the American Society for Information Science and Technology, 2007.
[5] J. Xu and W. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference, pages 412–420, 1996.
[6] C. Carpineto, R. De Mori, G. Romano, and B. Bigi. An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems, 19(1):1–27, 2001.
[7] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000.
[9] J. H. H. Yu, C. Zhai. Text classification from positive and unlabeled documents. In Proceedings of the 12th Annual International ACM Conference on Information and Knowledge Management, pages 232–239, 2003.

延伸閱讀