透過您的圖書館登入
IP:3.12.108.18
  • 學位論文

命名實體的中英文音譯

English to Chinese Transliteration of Named Entity

指導教授 : 吳世弘
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


在現今網路上,不管是資訊檢索或是機器翻譯等應用上,使用者常常需要使用命名實體(named entity)的音譯詞,音譯詞可以幫助使用者可以找到需要的資訊,許多外國命名實體的翻譯是在字眼中找不到的,這就是一般所稱的out-of-vocabulary(OOV) problem,然而音譯常常會因為發音方式的不同或是地方字詞彙的不同而會有不一樣的結果,在真實世界中一個命名實體常常會有一名多譯的狀況發生,例如"Bush"會被音譯為"布希"或是"布什"等。 在本篇論文中,我們主要是針對英文對中文跨語言資訊檢索所遇到音譯以及命名實體消歧義兩個問題進行探討,我們利用機器學習演算法,對於音譯的一名多譯以及OOV的問題進行學習,提供使用者多個恰當的音譯候選詞,並且協助使用者適應網路上一名多譯的狀況,我們的系統會提供恰當的音譯詞組提供給使用者使用。 我們主要是使用Named Entity Workshop Shared task 2009(NEWS 2009)以及維基百科(Wikipedia)的資料進行實驗,NEWS主要提供一個由新華社音譯的語料庫,新華社針對音譯詞進行音譯會遵循較嚴謹的規則以及中文字,在訓練資料中會有一個英文名稱對應一個中文音譯,維基百科是網路上一個開放式的百科全書,可以讓所有使用者進行編輯,維基百科擁有266種語言版本,由於是線上由全球的使用者進行編輯,因此並沒有遵循翻譯規則,針對發音以及使用的字詞也會有所不同,因此我們擷取維基百科中擁有中英文雙語條目的音譯詞進行實驗,比較正規式音譯所訓練出來的模型測試非正規音譯詞的狀況。由實驗結果發現,利用正規音譯所訓練出來的音譯模型在測試正規音譯與非正規音譯的效果差距很大。

並列摘要


Cross Language Information Retrieval (CLIR) and Machine Translation are very important applications. Transliteration of named entity is an essential technology in these two applications. It can help to search the right information for users. When users cannot find translation from a dictionary, it is called an out-of-vocabulary (OOV) problem. However, the same named entity might have several versions of transliteration. For instance, “Bush” can be transliterated as “布希” or “布什”. There is no open training or test set for named entity disambiguation. We will create a corpus for named entity disambiguation from Wikipedia. In this paper, we discuss the problem of transliteration in Cross Language Information Retrieval. We used machine learning algorithm to solve the problem. We will provide several transliteration candidates to solve the OOV term problem and help users adapted one named entity from several transliterations. We use the data set provided by the Named Entity Workshop Shared task 2009 (NEWS 2009) and Wikipedia. NEWS provides a pure transliteration corpus from Xinhua news agency. In the training data set, a source name will correspond to one target name. Wikipedia is an open editing encyclopedia. It includes 266 language versions and edit by all users on the Internet. We use Wikipedia multi-language pair to extract transliteration pairs. We used NEWS training data train CRF model and test on Wikipedia transliteration pairs and compared the result.

參考文獻


[3]Gou, Y.Q., Wang, H.F., “Chinese-to-English Backward Machine Transliteration” in International Joint Conferences on Artificial Intelligence on Nature Language Processing 2004.
[6]Jurafsky. D. & Martin, J. H. “Speech and Language Processing”, Prentice-Hall, Englewood Cliffs, NJ, 91-188 2000.
[9]Kuo, J.S., Li, H.Z., Yang, Y.K., “A Phonetic Similarity Model for Automatic Extraction of Transliteration Pairs” in ACM Transaction on Asian Language Information Processing, Vol 6, NO.2, Article 6 September 2007.
[10]Lafferty, J.D., McCallum, A., Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML''2001. pp.282~289
[11]Lee, C.J., Chang, J.S., Jang. J.S.R., “Extraction of transliteration pairs from parallel corpora using a statistical transliteration model” in Elsevier on Information Sciences June 2004.

延伸閱讀


國際替代計量