以成本敏感分類分析法建構之多語言文件分類技術

由於網際網路的興起以及全球化的趨勢，各種文件的取得變的容易，並且這些文件都用各種不同的語言所撰寫，越來越多的組織和個人必須具備處理多語文件的能力。假設這些組織與個人有大量的已分類多語文件，就可以利用這些文件，建立一個自動的多語分類系統。當獲得新的多語文件時就可以利用這個系統來做適合的分類。但到目前為止，多語文件分類系統並不多見，並且已做出的系統分類準確度仍有待加強，因此本研究提出了一套以成本敏感分類分析法建構之多語言文件分類技術，本研究會將不同語言的文章透過統計雙語詞典做翻譯，並且給每篇翻譯文章一個翻譯成本(品質)，此成本為分類預測錯誤所獲得的成本，希望在分類錯誤成本最低的情況下，獲得最好的分類準確度。本研究以特徵加強之多與文件分類技術作為我們的目標，在經過科學化的實驗方法後，證明本研究不論在中文或是英文文集都比目標較為傑出。關鍵字：文件探勘，文件分類，多語文件分類，文件翻譯，成本敏感度學習，統計雙語詞典

關鍵字

文件探勘；文件分類；多語文件分類；文件翻譯；成本敏感度學習；統計雙語詞典

並列摘要

Because of the trend of globalization, organizations and individuals often generate, acquire, and then archive documents written in different languages (i.e., poly-lingual documents). If organizations or individuals have already organized poly-lingual documents into their categories and would like to use this set of preclassified poly-lingual documents as training documents for constructing text categorization models that can classify newly arrived poly-lingual documents into appropriate categories, the organizations and individuals face the poly-lingual text categorization (PLTC) problem. Poly-lingual text categorization (PLTC) refers to the automatic learning of a text categorization model(s) from a set of preclassified training documents written in different languages and the subsequent assignment of unclassified poly-lingual documents to predefined categories on the basis of the induced text categorization model(s).Many text categorization techniques have been proposed in the literature; however, most of them deal with monolingual documents. In this study, we propose a cost-sensitive poly-lingual text categorization (CS-PLTC) technique that involves inclusion of translated documents to expand the training size for PLTC and use of cost-sensitive learning to reflect different qualities of training documents. Using the existing feature-reinforcement-based PLTC (FR-PLTC) technique as performance benchmarks, our empirical evaluation results show that our proposed CS-PLTC technique outperforms than the benchmark technique in both English and Chinese corpora. Keywords: Text mining, Text categorization, Poly-lingual text categorization, Document translation, Cost-sensitive learning, statistical-based bilingual thesaurus

並列關鍵字

Text mining ； Text categorization ； Poly-lingual text categorization ； Document translation ； Cost-sensitive learning ； statistical-based bilingual thesaurus

參考文獻

[BJN03] B. Zadrozny, J. Langford, and N. Abe., “Cost-sensitive learning by cost-proportionate example weighting,” Proceedings of the Third IEEE International Conference on Data Mining, 2003, pp. 435-442.

[CS99] Cohen, W. W. and Singer, Y., “Context-sensitive Learning Methods for Text Categorization,” ACM Transactions on Information Systems, Vol. 17, No. 2, 1999, pp. 141-173.

[FSZ99] Fan, W., Stolfo, S., Zhang, J., and Chan, P. H., “AdaCost: Misclassification Cost-sensitive Learning,” Proceedings of the Sixteenth International Conference on Machine Learning (ICML'99), June 1999, pp.97-105.

[JC94] Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994.

[J98] Joachims, T., “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proceedings of 10th European Conference on Machine Learning (ECML 98), Chemnitz, Germany, 1998, pp. 137-142.

國際替代計量

以成本敏感分類分析法建構之多語言文件分類技術

全文下載

主題瀏覽