透過您的圖書館登入
IP:3.135.216.174
  • 學位論文

機器譯本與人工譯本的差異:基於Coh - Metrix 3.0與詞性標記的定量分析

Differences between Human and Machine Translations: A Quantitative Analysis Based on Coh-Metrix 3.0 and CLAWS Tagger

指導教授 : 高照明

摘要


機器翻譯系統憑藉其速度快、成本低、專業術語一致等優勢始終受到推崇,但其譯文品質卻備受爭議,無法與人工譯文相媲美。所以瞭解機器譯文與人工譯文之間的差異就顯得尤為重要。通過比較可以瞭解機器翻譯在哪些方面差強人意,這樣才能為今後更好得改進機器翻譯系統做貢獻。 因此,本研究致力於探討中進英機器翻譯與人工翻譯的差異,找出可以具體反映兩者之間顯著差異的指標, 并提出相關的可行性建議。本研究使用的語料為中譯英可比語料,包涵人工譯文及機器譯文,另外還搜集了英文原文作為參考語料;文類涵蓋合同、專利、政府、文學、法律、科技、財經及環境8大領域。人工譯本與機器譯本首先會經由文本分析工具Coh-Metrix 3.0 進行文本特徵的量化分析,其結果接著由統計工具StatPlus 針對描述性指標、文本凝聚力指標、句構指標、詞彙多樣性指標及可讀性指標進行依次分析,並作t-test檢定;最後通過CLAWS 詞性標記、AntConc、對數似然比計算器等文本分析工具分析結果并總結原因。 研究結果顯示:機器譯本與人工譯本在字數、句長、句構相似性、介系詞片語數、被動語態及可讀性方面均有顯著差異,這與機器翻譯系統對標點的處理,和產出介系詞及介系詞片語的數量具有緊密聯繫;此外,本研究發現機器譯文會視文類的不同而呈現不同的翻譯品質,且文本的詞彙多樣性與文本的凝聚力也有著密切的關係。

並列摘要


Machine translation (MT) has been advancing significantly in recent years. It is fast to run, easy to operate, and has become more human-like as techniques improve. However, its quality is still a concern and far from perfect. This research, therefore, employs a corpus-based approach, aiming to compare human and machine Chinese-to-English translations statistically at a deep and comprehensive textual level, including the aspects of lexical diversity, syntax, cohesion, etc., in order to identify which textual features can significantly indicate the differences between the human corpus and machine corpus, and to figure out possible explanations that might contribute to the improvement of MT output in the future. Such multilevel comparisons not only enable us to find out shortcomings of MT in detail, but also offer us insights on how to improve MT systems. Coh-Metrix 3.0, an automated text analysis tool, plays a major role in generating textual features for the four corpora that comprise an original Chinese corpus, a human corpus, a machine corpus, and a reference corpus, covering 8 domains, namely, contracts, patents, governments, literature, law, finance, environment, as well as science and technology. Results obtained from Coh-Metrix 3.0 are compared using t-tests in StatPlus, as well as further processed and analyzed by CLAWS Part-of-speech Tagger, AntConc and Log-likelihood Calculator. The research findings show human and machine translations are significantly different from each other with respect to basic textual features, readability and syntax; long and nonsense sentences can be commonly seen in machine translations because machine translation systems are not able to insert punctuation marks into sentences based on semantics; prepositions or prepositional phrases mainly account for an unexpected result that human translations contain significantly more words than machine translations; the performance of machine translation systems varies in accordance with text type; besides, lexical diversity was found to be associated with textual cohesion. This thesis will elaborate on the findings and further provide feasible suggestions for future research.

參考文獻


Allen, J. (2003). Post-editing. Benjamins Translation Library, 35, 297-318.
ALPAC Report (1966). Languages and machines: computers in translation and linguistics. A report by the Automatic Language Processing Advisory Committee, Division of Behavioral Sciences, National Academy of Sciences, National Research Council. (Publication 1416).
Anazawa, R., Ishikawa, H., Park, M. J., & Kiuchi, T. (2013). Online machine translation use with nursing literature: evaluation method and usability. Computers Informatics Nursing, 31(2), 59-65.
Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. Text and Technology: In Honour of John Sinclair, 233, 250.
Baker, M. (1996). Corpus-based translation studies. Terminology, LSP and Translation Benjamins Translation Library Studies in Language Engineering in Honour of Juan C. Sager, 175.

延伸閱讀