Classifying Pornographic Web Pages Using a Chi-Square Based Statistics Method


由於網際網路的普及,資訊的散佈非常迅速,網路上充斥著各種良莠不齊的資訊,越來越多的不當資訊,例如色情小說、圖片與粗暴文字等,在缺乏完善的網路內容管理機制之下,使用者只要透過搜尋引擎輸入相關的關鍵字,就可以從搜尋結果藉由超連結輕易存取網站內容,因此網路內容管理已成為刻不容緩的議題。本研究針對不當資訊中的色情範疇,提出一個以色情網頁分類,來蒐集黑名單的方式,對色情網站內容中文字的部份,求出個別字詞(Word)的色情傾向(Porn Tendency),透過卡方分配計算出色情指標值(Indicator Value),將網頁分成色情(Porn)、未確定(Unsure)與非色情(Non-Porn)三類。色情類網頁的網址即為所謂的黑名單,可做為網路色情過濾的依據。本研究針對中文與英文語系網頁實作一個系統,實驗結果顯示,本提議方法具有高度的精確率與相當低的正誤判率。

Parallel abstracts

With the rapid growing of Internet usage, inappropriate materials (e.g. porn, drug, violence et al.) had been flooded on the Web. The open characteristic of the Web allows users to access almost any type of such inappropriate materials, consequently having various negative effects on the users, particularly on the children. Thus, web content rating and filtering mechanism is a worthy and pressing issue. This study proposes a chi-square based statistics method for classifying pornographic materials. Given a web page, its textual content is first split into a list of tokens, along a porn tendency weight for each token. The proposed method then calculates an indicator value (I-value) for the web page by combining the tokens' porn tendency weights through properties of chi-square distribution. The resulting I-value is used to classify the given web page into one of three categories, Porn, Unsure and Non-Porn. The web pages in the Porn Category are finally collected into a black list. Currently, the proposed method can classify English and Chinese Web pages. Experimental results indicate that the proposed method can detect pornographic web content at a superior precision rate along with a very low false positive rate.



