  • 學位論文


Semi-Automatic Identification of Chinese Resultative Verb Compounds and Their English Translation Equivalents

指導教授 : 高照明


中譯英受到中英文句子結構和語法的巨大差異而變得複雜,其中一個難點是由動詞和結果補語所形成的複合動詞,又稱為動補式複合動詞(RVC)。RVC大部分是兩個字的組合,其中第一個字表示某種動作或方式的動詞,第二個字則表示結果、方向、或程度(例如,吵醒, 跌下,讀熟等)。關於中文動補式複合動詞(RVC)的形成歷史或在現代漢語語法中的功能,過去已有大量的研究,但是在語料庫語言學或翻譯研究領域中,很少有相關的研究。 本研究著重於RVC的辨識和翻譯。我們選擇中國大陸作家姜戎所著的小說《狼圖騰》和漢學家Howard Goldblatt所翻譯的英文本Wolf Totem作為語料。中文原著和英文翻譯以人工方式進行段落對齊,成為一個平行語料庫,並以原著的前18章進行辨識RVC的實驗,我們採用了半監督機器學習方法,以及CRF++套件。我們首先以人工方式擷取原著第一章中所有的動補式複合動詞(RVC)標記作為CRF++ 套件中的種子,然後將前18章中將某些關鍵特徵(包括詞性,單詞中的字符位置)附加到每個字符上,以創建訓練文件。我們發現相對NLPIR詞性標記系統中的“主類”,NLPIR的“次類”標記有較高的正確率。在辨識RVC後,我們創建了一個程式界面,並利用多語詞對應程式Anymalign自動找到這些RVC的英文翻譯,雖然由於語料屬於文學性質,該程式無法找到許多的RVC,但是程式可以讓翻譯研究者和譯者從已經段落對齊的中英平行語料找到動補式複合動詞(RVC)在不同語境下的各種不同的翻譯。


Drastic differences in sentence structure and grammar complicate Chinese to English translation, with one particularly inconspicuous grammatical feature of Mandarin Chinese significantly hindering an accurate English translation: The Resultative Verb Compound or RVC. An RVC is a combination of characters (often in pairs, but not always) in which the first character constitutes some action or manner verb, and the second some result, direction, or extent (e.g., 打斷, 坐下, 讀熟, etc.). Vast amount of research on RVCs with respect to the history of its formation or its function in modern Chinese grammar has been conducted, but little to no serious research has been carried out on RVCs in the field of corpus linguistics or translation studies. This study is thus focused on the identification and subsequent translation of RVCs based on the Chinese novel《狼圖騰》by Jiang Rong and its English translation Wolf Totem by Howard Goldblatt, which was manually aligned by us at the paragraph level to form a parallel corpus. To identify RVCs within the first 18 chapters of a novel, we adopted a method of semi-supervised machine learning with the use of the CRF++ toolkit. By first manually tagging RVCs in the first chapter of the text to act as seeds and then affixing certain key features – NLPIR and NAER part-of-speech tags, the corresponding B, I, and E tags for character positioning in the beginning, middle, and endings of a word, and the RVC seeds – to each character in the first 18 chapters to create a training file, we were able to generate thousands of predicted RVCs in two separate experiments (Experiment 1 used NLPIR “parent” tags, and Experiment 2 used NLPIR “child” tags). We discovered the NLPIR “child” tags produced more accurate results when compared to the NLPIR “parent” tags. Upon identifying the RVCs, we created an interface to find the English translations of the RVCs using the program Anymalign, which is a multilingual word aligner. Though the program was unable to find many of the RVCs due to their low frequency and the literary nature of the text provided, the interface program allows translation researchers and working translators to manually identify translation equivalents of Mandarin Chinese RVCs and study the different translations based on the previously-aligned parallel corpus.


