透過您的圖書館登入
IP:18.217.203.172
  • 學位論文

以中對中統計式機器翻譯模式為基礎之中文繁簡詞彙對應模型

Aligning Simplified-Traditional Chinese Vocabularies with a Chinese-to-Chinese Statistical Machine Translation Model

指導教授 : 張景新

摘要


本文提出一個單語式的中對中統計式機器翻譯模式, 以及一個整體性的最佳化策略, 來抽取及對應同義的中文繁簡詞彙 (例如 “雷射” 和 “激光”)。這樣的同義中文繁簡詞彙, 被廣泛使用在中文語系的不同社群裡。 初步的實驗顯示, 對小規模的測試語料而言, 簡體中文 (SC) 詞彙所對應的繁體中文 (TC) 同義詞彙, 可以被辨識的正確率達84%。反之, 繁體詞彙到簡體中文詞彙的翻譯正確率達到87%。 先前的相關研究顯示, 繁簡中文詞對的候選詞, 可以由一些網頁上的結構標籤 (structural tags) 找出來。 可惜, 像結構標籤這樣強烈的線索, 通常只會標註到某些重要詞彙, 例如, 一些著名公司的名字. 至於網頁上的普通詞彙, 則通常不會被加上特殊的標籤. 這樣的結構訊息就無法使用. 經過定量分析之後更發現, 在多數繁簡特有的詞彙中, 僅有少數詞彙是可能被標上特殊結構標籤的. 要利用結構性的標籤來尋找配對的繁簡詞彙, 顯然有其侷限性。 因此, 上述翻譯模型背後的概念, 就是從非平行的網頁中, 建立出 “平行的左/右上下文” (parallel left/right contexts), 作為 “虛擬的標籤”, 以協助進行詞彙的對應。 由於單語的統計式機器翻譯模式本質上可以用來尋找同一語言的同義詞, 因此, 對於尋找一般語言的同義詞集合 (synsets) 可能有相當的潛力。如能將這樣的模型加以改變, 以便從非平行的語料庫中發掘大量同義詞集合, 則其潛力是可以預期的。

並列摘要


A monolingual Chinese-to-Chinese SMT model as well as a global optimization strategy are proposed in this paper to extract equivalent Chinese terms (such as “雷射” and “激光” for “laser”) used in different areas of the various Chinese-speaking communities. Preliminary evaluation shows that the synonymous Traditional Chinese (TC) terms for Simplified Chinese (SC) terms can be identified with an accuracy of 84% on a small test set. On the other hand, the traditional-to-simplified Chinese term translation achieves 87% accuracy. Previous works show that structural tags on web documents could be used to identify candidate SC-TC term pairs. Unfortunately, such strong hints are available only for some important terms like the names of famous companies. Other ordinary terms, in general, will not be tagged. Such structural information is thus unavailable. Quantitative analysis further indicates that only a small fraction of SC or TC-specific terms has the potential of being tagged. Using structural tags for identifying SC-TC term pairs is therefore limited. Therefore, the main idea behind the monolingual SMT model is to create “parallel left/right contexts” as “pseudo tags” out of non-parallel web pages for term alignment. The monolingual SMT model, by its very nature to find translation equivalents can potentially be useful for finding synonym sets (synsets) for any generic monolingual lexicon. The potential for adapting such a model for mining large-scale synsets from non-parallel corpora is therefore expectable.

參考文獻


[Brown 90] Brown, Peter F., J. Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. “A statistical approach to machine translation.” Computational Linguistics, 16(2):79–85.
[Brown 93] Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and R. L. Mercer. 1993. “The mathematics of statistical machine translation: Parameter estimation.” Computational Linguistics, 19(2):263–311.
[Chang 97] Chang, Jing-Shin and Keh-Yih Su, "An Unsupervised Iterative Method for Chinese New Lexicon Extraction", International Journal of Computational Linguistics and Chinese Language Processing (CLCLP), vol. 2, no. 2, pp. 97-148, August, 1997.
[Cheng 04] Cheng, J., Y.-C. Pan, W.-H. Lu, L.-F. Chien. 2004. “Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora.” In Proc. of ACL 2004, pp. 535-542.
[Chiang 92] Tung-Hui Chiang, Jing-Shin Chang, Ming-Yu Lin and Keh-Yih Su, "Statistical Models for Word Segmentation and Unknown Word Resolution," Proceedings of ROCLING-V, pp. 123-146, Taipei, Taiwan, R.O.C., 1992.

延伸閱讀