由於網際網路的普及與開放,使用者只要透過搜尋引擎輸入相關的關鍵字,就可以從搜尋結果中輕易地存取不當的網站內容,因此網路內容分類與管理已成為刻不容緩的議題。本研究針對不當資訊中的色情範疇,提出並且實作一個以卡方分配為基礎的色情分類方法來蒐集色情黑名單。這個方法針對色情網站中的文字內容部份,先求出個別字詞的色情傾向,再透過卡方分配計算出色情指標值 (Indicator Value),以便將網頁分成色情(Porn)、未確定(Unsure)與非色情(Non-Porn)三類。我們把色情類網頁的網址收錄為所謂的黑名單,可做為網路色情過濾的依據。此外,我們也設計一個遞增式更新的機制,可根據已蒐集到的色情Hub名單,有效率地蒐集新增的色情網頁。本研究實作的系統可判別中文與英文網頁,目前已蒐集到超過60萬筆色情網頁,實驗結果顯示我們的系統不僅比其他系統的精確度高,而且訓練時間明顯縮短。
The proliferation of the Web has allowed users to access a growing number of inappropriate materials (e.g. porn, drug, violence) on the Internet. Thus, web content rating and filtering has received intensive attention. This study proposes and implements a chi-square based method in classifying web pages to generate a pornographic blacklist. An indicator value of pornography is calculated for each web page under investigation using a chi-square combining scheme on the porn tendencies of tokens contained in each individual web page. A web page is classified into one of three categories: Porn, Unsure and Non-Porn according to its indicator value. The web pages in Porn category are put on a blacklist. An incremental update mechanism is also created for collecting newly added pornographic sites by recursively crawling on pornographic hubs. The present implementation can classify English and Chinese web pages and has collected more than 0.6 million pornographic URLs. Experimental results indicate that the present implementation achieves a higher precision rate in detecting pornographic web pages; while spending less training time, than related work.