以基因演算法為基礎建立自動化文件分類模式

數位資訊迅速地成長，如何有效地分類管理文件已是一項重要的研究議題。因此，文件分類(Text Categorization, TC)研究的重要性與日俱增，而目前資料探勘技術於TC之研究，多數是在不同的單一分類器中找尋出一準確度最高的單一分類器。然而，僅依賴單一分類器進行分類，可能會存在僅適配特定資料集的問題，只在某些資料集才有較佳表現。因此，本研究結合多種單一分類器組成多重分類模式，然後彙整多個專家(分類器)的意見再進行分類，如此即可改善原先所面臨之問題。另外在TC領域中，還可能面臨文件特徵維度過高的問題，因此，我們以基因演算法(Genetic Algorithm, GA)為基礎選取文件中特徵字詞給予不同的分類器做訓練，依據GA編碼方式不同本研究提出兩種方法：(1)無重複特徵集合之選取(Selection of Disjoint Feature Subsets, SDFS)各特徵僅分配給多重分類模式中一種單一分類器；(2)可重複特徵集合之選取(Selection of Possibly Overlapping Feature Subsets, SPOFS)各特徵任意分配給多重分類模式中各單一分類器。透過本研究提出之GA特徵選取方式，試圖讓多重分類模式中各個單一分類器都能自我學習達到最佳化，進而提升整體的分類效能，預期能建構出一分類效果佳且穩定的TC模式。實驗部分，本研究採用Reuters-21578 新聞文件資料集，並依照Modified Apte切分法分為訓練與測試資料集，進一步評估本研究提出之SDFS與SPOFS方法所建立的TC模式，並與原先未採用GA之方法(TOTAL)比較，同時也驗證多重分類模式是否如預期能夠優於最佳單一分類器。實驗結果顯示，SDFS表現不如預期，可能原因是GA挑選特徵的限制過於嚴格，導致多重分類模式中各單一分類器分類準確度反而降低。而SPOFS則明顯優於SDFS與TOTAL，分類準確度無論是單一分類器或多重分類模式皆有明顯地改善，證實多重分類模式分類準確度確實優於最佳單一分類器，同時也驗證本研究提出之GA為基礎的TC模式確實能改善分類效能且表現更穩定。

關鍵字

基因演算法；多重分類器；文件分類

並列摘要

The rapid accumulation of a large number of digital information indeed raises the difficulties in searching information, so effectively manage documents has become an important task. Therefore, Text Categorization (TC) research growing in importance. The majority of TC studies focus on trying to find out a best individual classifier with the highest accuracy from different classifiers to be the model of TC. However, the individual classifier often provides better results only in the appropriate data. So our research attempts to integrate various individual classifiers into ensemble to improve the classification performance. And then compile the opinions of different experts (classifiers) to make decision. In this way, it can solve the problem of that the original individual classifier can only fit the particular document datasets. TC is also likely to be confronted by the problem of excessive document feature dimensions. Therefore, We hope to use the Genetic Algorithm (GA) to optimize the classifier's training, and make each classifier have diverse features, mutual independences and better prediction abilities, and further enhance the overall classification performance. We propose two versions of GA encoding methods: (1) Selection of Disjoint Feature Subsets (SDFS) which lets each feature can use only one kind of classifier to perform training. (2) Selection of Possibly Overlapping Feature Subsets (SPOFS) which lets each feature can use more than one kinds of classifiers to perform training. In experimental evaluation, we use the real-world data set from Reuters-21578 news article collection with Modified Apte Split. The experimental result shows that our method can improve the document classification accuracy both in individual classifier and ensemble, and ensemble document classification model which has good and stable classification effects.

並列關鍵字

Genetic Algorithm ； Ensemble ； Text Categorization

參考文獻

Ambert, K. H., & Cohen, A. M. (2012). K-information gain scaled nearest neighbors: a novel approach to classifying protein-protein interaction-related documents. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1), 305-310. doi: 10.1109/TCBB.2011.32

Apté, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS), 12(3), 233-251. doi: 10.1145/183422.183423

Cheatham, M., & Rizki, M. (2006). Feature and prototype evolution for nearest neighbor classification of web documents. Paper presented at the Third International Conference on Information Technology: New Generations, ITNG-2006. Abstract retrieved from

Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435. doi: 10.1016/j.eswa.2008.06.054

Cohen, J. D. (1995). Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting. JASIS, 46(3), 162-174.

國際替代計量

以基因演算法為基礎建立自動化文件分類模式

未授權

主題瀏覽