不當網頁內容分類用的特徵詞集之挑選與績效評估

運用統計式分類器來做不當內容分類時，必須搭配適當的特徵挑選方法，從訓練用的網頁中萃取出不當內容特徵詞集，以便對測試網頁進行精準的分類。因此，特徵詞集的良窊，決定了分類精準度之高低。本研究提出一個以卡方相關係數與資訊熵數為基礎的特徵詞集挑選機制，可以在沒有人工干預下，以較短之訓練時間從訓練用的不當內容中萃取出有效特徵詞集，以便在不當網頁內容之分類上，達成較佳之精準度。在進行完跨語言與跨領域之 k-fold cross validation 實驗之後，我們發現以 CCcon_tendency 搭配 Chebyshev on CCcon_tendency 之實驗組合，不僅能挑出類別相關度高之特徵詞，亦能在挑選出集中度高之特徵詞之同時，顧及到每一挑選出之特徵詞皆能有極端之傾向值，因此可得到最高之精準度，而且所需訓練時間較短。實驗結果顯示，這個方法在色情 (中文、英文、日文) 與賭博 (英文) 之不當類別範疇中 F-measure 皆在0.985以上，也同時驗證我們提出的訓練通則之可用性。在測試未曾訓練過的網頁時，精準度雖然會小幅下降，但只要將錯誤網頁再次訓練，精準度即可恢復原有水準，可見本研究提出的方法挑選出的詞彙具有高度穩定性與精準度。

關鍵字

卡方相關係數；資訊熵數；集中度；特徵挑選；不當網頁內容分類

並列摘要

The study proposes a feature selection mechanism based on Correlated Coefficient (CC) and Conformity (an Entropy measure) to select feature terms for web content classification. This mechanism first calculates CC value for each term obtained from training data set, and then uses each term’s conformity to adjust its CC value to get a CCcon_tendency. Finally, the chebyshev inequality is used to select outliers as feature terms. The experimental results show that the average F-measure can reach 0.985 by using the selected feature terms to classify porn (including Chinese, English and Japanese) and English gamble content. Meanwhile, the experimental results not only confirm the applicability of the general training procedure we proposed to cross domain content, but also confirm that the resulting feature terms have superior classification performance even without human intervention. Moreover, testing the web pages that has not been trained yet may reduce the accuracy slightly, but training on errors makes the accuracy recover quickly. This phenomenon indicates that the proposed mechanism has produced feature terms that can be incrementally updated to maintain superior performance.

並列關鍵字

Feature Selection ； Web Content Classification ； Correlated Coefficient ； Entropy ； Conformity ； Chebyshev Inequality

參考文獻

[4] Apte, C., Damerau, F. and Weiss, S. (1994), “Towards language independent automated learning of text categorization models,” In Proceeding of the 17th Annual ACM/SIGIR conference, pp. 23-30.

[7] Bong, How, C., and Narayanan, K. (2004), “An Empirical Study of Feature Selection for Text Categorization based on Term Weightage,” Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI'04), pp. 599-602.

[9] Cheng, J. and Greiner, R. (1999), “Comparing Bayesian Network Classifiers,” InUAI'99, pp. 101-107.

[10] Creecy, R.H. (1992), “Trading MIPS and memory for knowledge engineering: classifying census returns on the connection machine,” Comm. ACM, pp. 48-63.

[11] Dalton, J. and Deshmane, A. (1991), “Artificial neural networks,” IEEE Potentials, Vol. 10 No. 2, pp. 33-36.

國際替代計量

不當網頁內容分類用的特徵詞集之挑選與績效評估

全文下載

主題瀏覽