分類不一致對文件自動分類效果的影響

本文探討分類不一致對自動分類成效的影響。經由近似文件的自動偵測，以及兩種分類方法針對兩個測試文件集做的比較實驗，本文發現：訓練資料的分類不一致性，即使高達34%，幾乎也不會影響分類器的成效。此項發現，其重要的意涵是，即使過去的研究使用了一致性不高的測試集做實驗，其結論仍舊是有效的。當然，分類不一致性高的資料，拿來訓練後，不管分類器好壞，其得到的分類成效都是比較低的。除了以上發現外，本文也介紹了一套中文分類測試集，免費提供各界研究使用。另外，作者也提出了一套偵測複本或相似文件的可靠方法，與過去的方法比較，此方法可以偵測過去方法所無法偵測到的相似文件。

關鍵字

文件分類；一致性；分類測試集；主題分析；複本偵測

並列摘要

This article discusses the effect of inconsistency in training data on the performance of text classifiers. Our experiments show that the inconsistency, even reaching a level as high as 34%, hardly affects the effectiveness of the classifiers. Better classifiers perform better independent of duplicates and label inconsistency. The implication is that past experiments (especially on the Reuters-21578 collection) remain valid. In the experiment process, the author proposes a duplicate detection technique that is far more effective than previous ones. A new Chinese test collection for text categorization is also introduced for general free download.

並列關鍵字

Document classification ； Consistency ； Test collection for categorization ； Subject analysis ； Duplicate detection

參考文獻

Amit Singhal,Gerard Salton,Chris Buckley(1996).Proceedings of Fifth Annual Symposium on Document Analysis and Information Retrieval.

Google Scholar

Daniel Loprcsti,Jiangying Zhou(1996).Proceedings of the Fifm Annual Symposium on Document Analym and Infiinnation Retrieval.

Google Scholar

Dmioy V. Khmelev,William J. Teahan(2003).A Repetition Baaed Measure for Verification of Text Collectiona and for Text Categorization.ACM SIGIR.104-110.

Google Scholar

被引用紀錄

陳智揚（2012）。使用降低規則相依問題影響來改善關聯式分類效能〔博士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2012.00572

王務本（2011）。關聯式分類演算法結合規則優先權以改善分類之準確度〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2011.00175

董純賢（2010）。應用多層次架構之類別優先度與多重分類器改善文件分類準確率〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2010.00211

邱信淵（2010）。利用多層次類別優先度之規則排序以改善關聯式分類效能〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2010.00129

陳育民（2009）。利用關聯式法則改善文件分類準確度-結合其他分類器〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2009.00601

國際替代計量

分類不一致對文件自動分類效果的影響

未授權

主題瀏覽