Title

運用以卡方為基礎的統計方法於色情網頁分類之研究

Translated Titles

Classifying Pornographic Web Pages Using a Chi-Square Based Statistics Method

DOI

10.6382/JIM.200704.0225

Authors

李龍豪(Lung-Hao Lee);陸承志(Cheng-Jye Luh)

Key Words

網路內容分類 ; 色情黑名單 ; 不當資訊過濾 ; 卡方分配 ; Web Content Rating ; Pornographic Black List ; Inappropriate Web Content Filtering ; Chi-Square Distribution

PublicationName

資訊管理學報

Volume or Term/Year and Month of Publication

14卷2期(2007 / 04 / 01)

Page #

225 - 246

Content Language

繁體中文

Chinese Abstract

由於網際網路的普及,資訊的散佈非常迅速,網路上充斥著各種良莠不齊的資訊,越來越多的不當資訊,例如色情小說、圖片與粗暴文字等,在缺乏完善的網路內容管理機制之下,使用者只要透過搜尋引擎輸入相關的關鍵字,就可以從搜尋結果藉由超連結輕易存取網站內容,因此網路內容管理已成為刻不容緩的議題。本研究針對不當資訊中的色情範疇,提出一個以色情網頁分類,來蒐集黑名單的方式,對色情網站內容中文字的部份,求出個別字詞(Word)的色情傾向(Porn Tendency),透過卡方分配計算出色情指標值(Indicator Value),將網頁分成色情(Porn)、未確定(Unsure)與非色情(Non-Porn)三類。色情類網頁的網址即為所謂的黑名單,可做為網路色情過濾的依據。本研究針對中文與英文語系網頁實作一個系統,實驗結果顯示,本提議方法具有高度的精確率與相當低的正誤判率。

English Abstract

With the rapid growing of Internet usage, inappropriate materials (e.g. porn, drug, violence et al.) had been flooded on the Web. The open characteristic of the Web allows users to access almost any type of such inappropriate materials, consequently having various negative effects on the users, particularly on the children. Thus, web content rating and filtering mechanism is a worthy and pressing issue. This study proposes a chi-square based statistics method for classifying pornographic materials. Given a web page, its textual content is first split into a list of tokens, along a porn tendency weight for each token. The proposed method then calculates an indicator value (I-value) for the web page by combining the tokens' porn tendency weights through properties of chi-square distribution. The resulting I-value is used to classify the given web page into one of three categories, Porn, Unsure and Non-Porn. The web pages in the Porn Category are finally collected into a black list. Currently, the proposed method can classify English and Chinese Web pages. Experimental results indicate that the proposed method can detect pornographic web content at a superior precision rate along with a very low false positive rate.

Topic Category 基礎與應用科學 > 資訊科學
社會科學 > 管理學
Reference
  1. Arentz, W. A.,Olstad, B.(2004).Computer Vision and Image Understanding.
  2. Baeza-Yates, R.,Ribeiro-Neto, B.(1999).Modern Information Retrieval.
  3. Balkin, J. M.,Noveck, B. S.,Roosevelt, K.(1999).Filtering the Internet: A Best Practices Model.Information Society Project at Yale Law School,1-38.
  4. Bosson, A,Cawley, G. C.,Chan, Y.,Harvey, R.(2002).Non-retrieval: Blocking Pornographic Images.Proceedings of the International Conference on Image and Video Retrieval,50-60.
  5. Casell, G.,Berger, R. L.(2001).Statistical Inference.
  6. Chan, Y.,Harvey, R.,Smith, D.(1999).Building Systems to Block Pornography.Challenge of Image Retrieval,1-9.
  7. Duan, L.,Cui, G.,Gao, W.,Zhang, H.(2002).Adult Image Detection Method Base-On Skin Color Model and Support Vector Machine.The fifth Asian Conference on Computer Vision (ACCV),797-780.
  8. Etzioni, O.(1996).The World-Wide Web: Quagmire or Gold Mine?.Communications of the ACM,39(11),65-68.
  9. Goodwin, S.,Vidgen, R.(2002).Content, Content, Everywhere Time to Stop and Think? The Process of Web Content Management.Computing and Control Engineering Journal,13(2),66-70.
  10. Hammami, M.,Chahir, Y.,Chen, L.(2003).WebGuard: Web Based Adult Content Detection and Filtering System.IEEE/WIC International Conference on Web Intelligence,574-578.
  11. Jiao, F.,Gao, W.,Duan, L.,Cui, G.(2001).Detecting Adult Images Using Multiple Features.Info-tech and Info-net2001 Proceedings (ICII),378-383.
  12. Jicheng, W.,Yuan, H.,Gangshen, W.,Fuyan, Z.(1999).Web Mining: Knowledge Discovery on the Web.IEEE International Conference on Systems, Man, and Cybernetics,137-141.
  13. Kolariand, P.,Joshi, A.(2004).Web Mining: Research and Practice.IEEE Computational Science and Engineering (web Engineering),6(4),49-53.
  14. Kosala, R.,Blocked, H.(2000).Web Mining Research: A Survey.ACM SIGKDD Explorations,2(1),1-15.
  15. Lee, P. Y.,Hui, S. C.,Fong, A. C. M.(2003).A Structural and Content-Based Analysis for Web Filtering.Internet Research: Electronic Networking Applications and Policy,13(1),27-37.
  16. Lee, P. Y.,Hui, S. C.,Fong, A. C. M.(2002).Neural Networks for Web Content Filtering.IEEE Intelligent Systems,17(5),48-57.
  17. Liu, L.,Chen, J.,Song, H.(2002).The Research of Web Mining.Proceedings of the Fourth World Congress on Intelligent Control and Automation,2333-2337.
  18. Meyer, T. A.,Whateley, B.(2004).SpamBayes: Effective Open-source, Bayesian Based, Email Classification System.First Conference on Email and Anti-Spam (CEAS),1-8.
  19. Robinson, G.(2003).A Statistical Approach to the Spam Problem.Linux journal
  20. Ross, S. M.(2004).Introduction to Probability and Statistics for Engineers and Scientists
  21. Schettini, R.,Brambilla, C.,Cusano, C.,Cioeea, G.(2003).On the Detection of Pornographic Digital Images.Proceedings of SPIE, Visual Communications and Image Processing,2105-2113.
  22. Smith, D.,Harvey, R.,Chen, Y.,Bangham, A.(1999).Classifying Web Pages by Content.IEE European Workshop on Distributed Imaging,99(109),8-1.
  23. Srivastava, J.,Desikan, P.,Kumar, V.(2002).Web Mining Accomplishments and Furture Directions.Proceedings US. National Science Foundation Workshop on Next-Generation Data Mining,51-70.
  24. Torres, L.,Vila, J.(2002).Automatic Face Recognition for Video Indexing Application.Pattern Recognition,35(3),615-625.
  25. 王鐵雄、陳思翰、蔡顯明、林俊男、李新林(2004)。從眾行為在不當資訊防制上的應用。2004年台灣網際網路研討會
  26. 李龍豪、陸承志、黃威穎(2005)。參數調校模擬於高效率的色情網頁分類機制之應用。2005年台灣網際網路研討會
  27. 林宜隆、李璘昱、劉金和、莊育秀、許盛凱(2003)。不當資訊防制政策與管理策略之初探。2003年台灣網際網路研討會
  28. 邱志傑、王明習、賴溪松(2003)。TANet不當資訊尋與分析。2003年台灣網際網路研討會
  29. 邱志傑、王明習、賴溪松(2004)。不當資訊防制分析。2004年台灣網際網路研討會
  30. 邱忠俊(1999)。碩士論文(碩士論文)。中央警察大學資訊管理研究所碩士論文。
  31. 邱建明(2004)。碩士論文(碩士論文)。國立中央大學資訊工程研究所碩士論文。
  32. 郭永明(2001)。碩士論文(碩士論文)。國立成功大學電機工程學系碩士論文。
  33. 楊良吉(2001)。碩士論文(碩士論文)。國立台灣大學資訊工程學研究所碩士論文。
Times Cited
  1. 呂靜婷(2008)。一個以卡方為基礎的文件多重分類方法。元智大學資訊管理學系學位論文。2008。1-53。