Multi-label Text Categorization Using a Chi-Square Based Method

指導教授 : 陸承志


本研究提出一個以inverse chi-square分類器為基礎的方法,這個方法包含一個為各類別挑選特徵詞的流程,以及建立了一個詞彙-類別相關的權重矩陣,為測試文件找尋對應於各類別的特徵權重。再以inverse chi-square分類器計算出文件在各類別的指標值,作為分類之依據。本研究採用DF (Document Frequency)、CC (Correlated Coefficient)與ICF (Inverted Conformity Frequency) 三種門檻值分別為不同類別篩選出不同的特徵詞。最後以 Reuters 21578 資料集中文件篇數前10大類別的實驗結果顯示,本方法的Precision、 Recall 和 F1-measure 分別可達 87%, 98% 和92%左右,和多重分類研究中著名的Boostexter的效能表現相當。


This study presents a based method to multi-label text categorization term-category weighted matrix. This method uses an inverse chi-square classifier to calculate an indicator value with respect to each category under consideration based the testing document’s feature weights represented by correlation coefficient. We use three thresholds including DF (Document Frequency), CC (Correlated Coefficient) and ICF (Inverted Conformity Frequency), to extract different category’s relevant terms. Finally, we conduct experiments on the top 10 categories of Reuters 21578. The experimental results show that the Precision, Recall, F1-measure can reach 87%, 98%, 92%, respectively. Our method is shown to be comparable to the famous multi-label method, Boostexter.


