TFIDF與熵值法在支援向量機上分類評估-以統計試題為例

在升學階段或是專業技能認證、求職測試，都會經歷過無數次的考試。隨著各領域的專精，須紛紛參考國際間具有代表性的書籍，且多為原文撰寫。根據中華電子報對大學生的英文測試統計，發現國內英文程度普遍低下，花費大量時間去檢討試卷及尋找章節出處。近年來機器學習演算法套用到文件分類領域上皆有不錯的成果，在此使用支援向量機做試題的自動分類器。詞頻-逆向文件頻率(TFIDF)是為廣泛運用的權重法，利用關鍵字在文件中和文件之間的出現頻率，給予其權重。還有從熱力學到資訊理論的熵值權重法，根據關鍵字在訓練文件上各類別出現頻率給予權重，已有研究利用熵值法對原TFIDF權重進行改良，在研究中有兩種改良方法，分別是TFIDF_entr和TFIDF_entr∆。在訓練樣本1177題的情形下，經由TFIDF_entr∆權重改良後，可以達到88.2465%的分類準確度；TFIDF_entr權重改良可達到86.6109%的分類準確度；未改良的TFIDF也有84.5188%的分類準確度。

關鍵字

詞頻-逆向文件頻率；熵值；支援向量機

並列摘要

It goes through countless exams in studies, certifications of professional skills and job-seeking tests. To adapt to specializations of various areas, it is necessary for consulting internationally representative books, which was written in English. According to the statistic of English test China Electronics News did on university students, it found that the level of domestic English degree was generally low. It not only results the weakness of students' abilities in comprehending English on original books and questions, but also costs a lot of time to reviewing examination papers and looking for the source section. In recent years, there are some good results in machine learning algorithms applied to the field of document classification. In this case we use the support vector machine to do the automatic classify. TFIDF is weighting method which is widely used. Use the frequency of keywords in documents which appear between the files to give weights. From the thermodynamic to information theory, there is also the entropy weight method. To give weights by the frequency that keywords appear in the training documents of each of the categories. There are already some researches on using entropy to improve the original TFIDF weights. There are two improved methods which are TFIDF_entr and TFIDF_entr∆. In the case of 1177 training samples, through TFIDF_entr∆ weight improvement, the accuracy of classification is up to 88.2465%; TFIDF_entr weight improvement can be 86.6109%;TFIDF which haven't be improved is 84.5188%.

並列關鍵字

TFIDF ； entropy ； SVM

參考文獻

[1] 李俊宏，鄭原平，「Support Vector Machines分類技術應用於中文垃圾郵件辨別之探討」，工程科技與教育學刊，4卷4期，2007，第462-474頁。

[6] 黃宇翔，潘柏璇(2008)，「以樹狀結構及新詞判斷分類XML文件」，資訊管理學報，15卷3期，1996，第135-155頁。

[7] 羅淑娟，柯秀奎，林晶璟，「網路服務品質探勘與管制」，交大管理學報，第二十八卷，第一期，2008。

[8] Joachims, T. “Text categorization with support vector machines: Learning with many relevant features.” In Proceedings of the European Conference on Machine Learning, Berlin, 1998 ,pp. 137-142. Springer.

[10] Harksoo Kim, Jungyun Seo, “Cluster-Based FAQ Retrieval Using Latent Term Weights”, Intelligent Systems, IEEE, Vol. 23 , Issue: 2, 2008

國際替代計量

TFIDF與熵值法在支援向量機上分類評估-以統計試題為例

全文下載

主題瀏覽