透過您的圖書館登入
IP:3.149.234.141
  • 學位論文

通過機器學習模型大規模分析維基百科的引文品質

Analyzing Wikipedia Citation Quality at Scale via Productionization of Machine Learning Models

指導教授 : 彭文志

摘要


旨在提高維基百科中的引文品質的機器學習模型,例如基於文字檢測需要引文的句子(Citation Need模型),已經引起了科學界和維基百科社群的廣泛關注。但是,由於其高技術性,此類模型的可訪問性受到限制,它們的使用通常僅限於機器學習研究人員和從業人員,而非維基百科中志願的編輯與工具開發社群。為了填補這一隔閡,我們開發了Citation Detective,該系統旨在定期在英語維基百科上的大量文章上運行Citation Need模型,並發佈公開的、可用的每月數據轉儲,以揭示被歸類為缺失引文的句子。通過使Citation Need模型可為廣大公眾使用,Citation Detective為研究和應用打開了新的可能。我們通過對維基百科中的引文品質進行大規模分析,提供了一個由Citation Detective支持的研究方向的示例,顯示出引文品質與文章品質,活躍編輯者社群的大小以及編輯者之間的貢獻不平等呈正相關。此外,《生物學》類別的文章是英語維基百科中來源覆蓋最豐富的文章。《女性傳記》和與《非洲》有關的文章也有良好的來源覆蓋率,這顯示越來越多的倡議尋求消除地理和性別偏見有效提高文章引文品質。另一方面,英文維基百科中與《西歐》國家相關的文章引文品質有限,緣於來源語言不同而缺乏可用性。我們將Citation Detective數據和源代碼開源,並與維基百科社群工具(如Citation Hunt)進行結合,以幫助維基百科提高其文章的可驗證性和可靠度。

並列摘要


Machine learning models designed to improve citation quality in Wikipedia, such as text-based classifiers detecting sentences needing citations (“Citation Need” models), have received a lot of attention from both the scientific and the Wikimedia communities. However, due to their highly technical nature, the accessibility of such models is limited, and their usage generally restricted to machine learning researchers and practitioners. To fill this gap, we present Citation Detective, a system designed to periodically run Citation Need models on a large number of articles in English Wikipedia, and release public, usable, monthly data dumps exposing sentences classified as missing citations. By making Citation Need models usable to the broader public, Citation Detective opens up new opportunities for research and applications. We provide an example of a research direction enabled by Citation Detective, by conducting a large-scale analysis of citation quality in Wikipedia, showing that citation quality is positively correlated with article quality, the size of active editors' community, and contribution inequality between editors. Also, articles in Biology are the most well sourced in English Wikipedia. Women's biographies and Africa-related articles are also well-sourced in Wikipedia with growing number of initiatives are seeking to eliminate geographical and gender bias. On the other hand, articles related to western European countries are poorly sourced in English Wikipedia as the availability of sources may cause bias due to the language a source is written. The Citation Detective data and source code will be made publicly available and are being integrated with community tools for citation improvement such as Citation Hunt.

參考文獻


[1] M. Redi, B. Fetahu, J. Morgan and D. Taraborelli, "Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability," in International Conference on World Wide Web, San Francisco, CA, USA, 2019.
[2] A. Kittur and R. E. Kraut, "Harnessing the wisdom of crowds in wikipedia: quality through coordination," in ACM Conference on Computer Supported Cooperative Work, 2008.
[3] O. Arazy and O. Nov, "Determinants of wikipedia quality: the roles of global and local contribution inequality," in ACM Conference on Computer Supported Cooperative Work, 2010.
[4] D. Yang, A. Halfaker, R. Kraut and E. Hovy, "Identifying Semantic Edit Intentions from Revisions in Wikipedia," in Conference on Empirical Methods in Natural Language Processing, 2017.
[5] J. T. Morgan, M. Gilbert, D. W. McDonald and M. Zachry, "Editing beyond articles: Diversity & dynamics of teamwork in open collaborations," in 17th ACM conference on Computer supported cooperative work & social computing, New York, NY, USA, 2014.

延伸閱讀