透過您的圖書館登入
IP:3.128.199.210
  • 學位論文

一個蒐集色情黑名單的網路內容分類機制之研究

A Web Content Classification System for Pornographic Blacklist Generation

指導教授 : 陸承志
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


由於網際網路的普及與開放,使用者只要透過搜尋引擎輸入相關的關鍵字,就可以從搜尋結果中輕易地存取不當的網站內容,因此網路內容分類與管理已成為刻不容緩的議題。本研究針對不當資訊中的色情範疇,提出並且實作一個以卡方分配為基礎的色情分類方法來蒐集色情黑名單。這個方法針對色情網站中的文字內容部份,先求出個別字詞的色情傾向,再透過卡方分配計算出色情指標值 (Indicator Value),以便將網頁分成色情(Porn)、未確定(Unsure)與非色情(Non-Porn)三類。我們把色情類網頁的網址收錄為所謂的黑名單,可做為網路色情過濾的依據。此外,我們也設計一個遞增式更新的機制,可根據已蒐集到的色情Hub名單,有效率地蒐集新增的色情網頁。本研究實作的系統可判別中文與英文網頁,目前已蒐集到超過60萬筆色情網頁,實驗結果顯示我們的系統不僅比其他系統的精確度高,而且訓練時間明顯縮短。

並列摘要


The proliferation of the Web has allowed users to access a growing number of inappropriate materials (e.g. porn, drug, violence) on the Internet. Thus, web content rating and filtering has received intensive attention. This study proposes and implements a chi-square based method in classifying web pages to generate a pornographic blacklist. An indicator value of pornography is calculated for each web page under investigation using a chi-square combining scheme on the porn tendencies of tokens contained in each individual web page. A web page is classified into one of three categories: Porn, Unsure and Non-Porn according to its indicator value. The web pages in Porn category are put on a blacklist. An incremental update mechanism is also created for collecting newly added pornographic sites by recursively crawling on pornographic hubs. The present implementation can classify English and Chinese web pages and has collected more than 0.6 million pornographic URLs. Experimental results indicate that the present implementation achieves a higher precision rate in detecting pornographic web pages; while spending less training time, than related work.

參考文獻


2.Arentz, W. A., and Olstad, B., “Classifying Offensive Sites Based on Image Content,” Computer Vision and Image Understanding, 94, 2004, pp.295-310.
6.Bertino, E., Ferrari, E., and Perego, A., “Content-based Filtering of Web Documents: the MaX system and the EUFORBIA project,” International Journal of Information Security, Vol. 2, No. 1, 2003, pp. 45-58.
8.Bosson, A., Cawley, G.. C., Chan, Y., and Harvey, R., “Non-retrieval: Blocking Pornographic Images,” Proceedings of the International Conference on Image and Video Retrieval, 2002, pp.50-60.
9.Brin, S., and Page, L., “The Anatomy of a Large Scale Hypertextual Web Search Engine,” Computer Networks and ISDN Systems, Vol. 30, Issue 1-7, 1998, pp.107-117.
11.Cao, L. L., Li, X. L., Yu, N. H. and Liu, Z. K., “Naked People Retrieval Based on Adaboost Learning,” IEEE Proceedings of the First International Conference on Machine Learning and Cybernetics, 2002, pp.1133-1138.

被引用紀錄


孫宗業(2006)。一個植基於郵件標頭分析的垃圾郵件過濾器〔碩士論文,元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2006.00230

延伸閱讀