透過您的圖書館登入
IP:18.116.90.141
  • 學位論文

基於網路語料之專有名詞翻譯方法於中日韓跨語言資訊檢索之應用

Web-based Named Entity Translation Method for Korean-Chinese and Japanese-Chinese Cross-language Information Retrieval

指導教授 : 顏嗣鈞
共同指導教授 : 許聞廉(Wen-Lian Hsu)

摘要


專有名詞翻譯在許多自然語言處理的研究上,例如資訊檢索與機器翻譯等,扮演了 重要的角色。於本篇論文中,我們主要著重在將韓文及日文的專有名詞翻譯成中文,用以增進韓–中及日–中跨語言資訊檢索的效能。中文所使用的漢字為一種形意文字,一個音節可以對應到數個不同的漢字,這造成了專有名詞翻譯上的困難。我們提出一種混合的專有名詞翻譯方法,首先整合數個線上的語料庫來擴增雙語辭典的涵蓋率。我們以維基百科的中英日韓版本的跨語言連結為基礎作為一個翻譯的工具。此外,亦使用了 Naver.com 所提供的人物檢索引擎用以查詢人名的中文或英文翻譯。 第二種方法為翻譯模板方法,我們的系統能夠自動從網路的語料庫中學習出韓–中、韓–英、日–中、日–英、及英–中的翻譯模板。而後這些模板便可以用以自 Google 搜尋引擎所回傳的網頁文字片段中抓取出相應的中文翻譯。根據實驗結果,在跨語言資訊檢索系統中加入我們的專有名詞翻譯方法後,在平均準確率 (Mean Average Precision, MAP) 上較單用雙語辭典的方法高出了五倍。平均準確率達到 0.3385,而召回率 (Recall) 亦達到 0.7578。我們的方法可以處理中日韓及非中日韓的專有名詞的翻譯,並可有效提升跨語言資訊檢索系統的效能。

並列摘要


Named entity (NE) translation plays an important role in many applications, such as information retrieval and machine translation. In this paper, we focus on translating NEs from Korean/Japanese to Chinese in order to improve Korean-Chinese and Japanese-Chinese cross-language information retrieval. The ideographic nature of Chinese makes NE translation difficult because one syllable may map to several Chinese characters. We propose a hybrid NE translation system. First, we integrate two online databases to extend the coverage of our bilingual dictionaries. We use Wikipedia as a translation tool based on the inter-language links between the Korean/Japanese edition and the Chinese or English editions. We also use Naver.com’s people search engine to find a query name’s Chinese or English translation. The second component of our system is able to learn Korean-Chinese (K-C), Korean-English (K-E), and English-Chinese (E-C) translation patterns from the web. These patterns can be used to extract K-C, K-E and E-C pairs from Google snippets. We also have the Japanese-Chinese (J-C), Japanese-English (J-E) translation patterns for translating Japanese NEs. We found CLIR performance using this hybrid configuration over five times better than that a dictionary-based configuration using only the bilingual dictionary. Mean average precision was as high as 0.3385 and recall reached 0.7578. Our method can handle Chinese, Japanese, Korean, and non-CJK NE translation and improve performance of CLIR substantially.

參考文獻


[3] V. Gudivada, V. Raghavan, W. Grosky, R. Kasanagottu, and D. Markets, “Information retrieval on the World Wide Web,” Internet Computing, IEEE, vol. 1, no. 5, pp. 58–68, 1997.
[5] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, 1975.
[6] K. Jones, S. Walker, and S. Robertson, “A probabilistic model of information retrieval: development and comparative experiments,” Information Processing and Management, vol. 36, no. 6, pp. 779–808, 2000.
[8] J. McCarley, “Should we translate the documents or the queries in cross-language information retrieval?,” Proceedings of the 37th conference on Association for Computational Linguistics, pp. 208–214, 1999.
[9] G. Jones, T. Sakai, N. Collier, A. Kumano, and K. Sumita, “A comparison of query translation methods for english-japanese cross-language information retrieval (poster abstract),” Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 269–270, 1999.

延伸閱讀