  • 學位論文


Cross-Language Encyclopedia Article Linking

指導教授 : 項潔
共同指導教授 : 蔡宗翰(Richard Tzong-Han Tsai)


線上百科全書(如維基百科等)已成為目前網路上最重要的內容服務之一。 將線上百科全書中不同語言的條目建立連結在多語知識庫的建置與整合上是一相當重要的課題,許多先前之相關研究主要著重在建立維基百科不同語言版本間之跨語言連結,然而維基百科於各個語言的條目涵蓋數量有相當顯著的差異,為解決此問題,將數個重要的不同語言之單語線上百科之條目建立其連結以建置一個跨語言線上百科全書已成為一個重要的研究課題。於本論文之中,我們定義了跨語言線上百科連結之研究問題,並提出一個利用雙語主題模型與相關翻譯內容為特徵之基於支持向量機的跨語言線上百科連結方法,將英文維基百科與中文百度百科之對應條目建立連結。為驗證我們所提出之方法的有效性,我們自中文百度百科與英文維基百科收集了一定數量之對應條目並以此建置了數個實驗資料集。實驗之數據顯示我們所提出之跨語言線上百科連結方法於平均倒數排名(MRR) 評估指標可達到0.8252,較基準系統高了0.1745 (+26.82%),其數據說明我們的方法在建立英中跨語言線上百科連結是相當有效的。我們的方法並非高度依賴語言之特性,可易於擴展應用於建立其它語言間之線上百科條目之連結。


Online encyclopedias, like Wikipedia, are one of the most widely used internet services around the world. Though Wikipedia has many language editions, their coverage is imbalanced when compared to the number of language users both online and offline. Furthermore, large alternative online encyclopedias exist for some languages, such as Chinese Baidu Baike. We could improve access to the knowledge in these various sources by constructing and integrating multiple online encyclopedias into large multilingual knowledge bases. The main task in such a project is creating links between articles in different encyclopedias in different languages. Most research to date has focused on linking articles in the different language editions of Wikipedia, yet little work has been done in linking other platform encyclopedias. In this thesis, we develop a method for cross-language encyclopedia article linking (CLEAL) between encyclopedias on different platforms, English Wikipedia and Chinese Baidu Baike. We use a bilingual topic model and translation features based on an SVM model to link articles between these two encyclopedias. To evaluate our approach, we compile datasets from Baidu Baike articles and their corresponding En Wikipedia articles. The evaluation results show that our approach achieves 0.8252 in MRR, outperforming the baseline system by 0.1745 (+26.82%). Our method does not heavily depend on specific platform formats or linguistic characteristics, so it could be easily extended to generate cross-language article links among other online encyclopedias in other languages and on other platforms.


[1] T. H. Davenport and L. Prusak, Working knowledge: How organizations manage what they know. Harvard Business Press, 1998.
[4] M. M. Hasan and Y. Matsumoto, “Multilingual document alignment-a study with chinese and japanese,” in Proceedings of Natural Language Processing Pacific Rim Symposium, pp. 617–623, 2001.
[11] M. Jiang, “The business and politics of search engines: A comparative study of baidu and google’s search results of internet events in china,” New Media & Society, vol. 16, no. 2, pp. 212–233, 2014.
[12] H.-T. Liao, “How does localization influence online visibility of user-generated encyclopedias? a study on chinese-language search engine result pages (serps),” in Proceedings of the 9th International Symposium on Open Collaboration, p. 27, ACM, 2013.
[15] C.-J. Lee, J. S. Chang, and J.-S. R. Jang, “Extraction of transliteration pairs from parallel corpora using a statistical transliteration model,” Information Sciences, vol. 176, no. 1, pp. 67–90, 2006.

