複合名詞,專名實體,縮寫等詞語或名詞片語,常佔文章內容的一大部分。因此“詞語翻譯”對於建構辭典、機器翻譯 (machine translation, MT)、跨語言資訊檢索 (cross-language information retrieval, CLIR)和其他語言相關應用,皆扮演著重要的角色。而詞語翻譯可分為透過另一種語言來描述該詞語的涵義(又稱意譯),或依該詞語的原語言 (source-language)的讀音的翻譯(又稱音譯)。然而,隨著全球化的進展以及科技的日新月異,新的詞語與日俱增,且常來不及收錄到詞典,造成了未知詞的問題 (OOV, out of vocabulary)。此外,詞語在不同領域的翻譯亦有很大差異性。種種原因導致僅靠字典查找很難完善地處理詞語翻譯,這也使得詞語翻譯成為機器翻譯以及跨語言資訊檢索等研究或應用的一個棘手問題。在本論文中,我們提出了一套學習在網路上尋找詞語翻譯的新方法。該方法包含兩個處理階段:在訓練階段,我們使用雙語術語表來學習來源詞語及翻譯之間的表面樣式 (source-target pattern)、詞素關係,以及領域特定關鍵詞,以求能為各類的詞語提供有效的查詢擴充詞,以及擷取翻譯。此階段目標在於獲得有效的擴充查詢後,我們就能透過搜尋引擎自網路上取得更多包含有翻譯資訊的混合語言資料 (mixed-code data)。當在執行階段,我們會自動地將給定的詞語轉換為一組附有新增詞彙的擴充查詢式。擴充查詢的目的在於透過網路搜索引擎在大量的混合語言文件中搜尋時,大幅提昇回傳資料中含有適當翻譯(包括音譯或特定領域翻譯)的機會。獲得該查詢所回傳的摘要資訊後,我們隨即從中擷取翻譯候選詞彙並排序之。於本論文中,我們將所提出的方法實作成了一套名為TermMine的系統,經過對於TermMine的實驗和評估後,顯示本研究所提出的方法可以達到相當高的準確率(precision)和召回率 (recall),並且在詞語翻譯方面優於現有的機器翻譯系統。
Terms, such as compound nouns, named entities, acronyms, and other noun phrases, make up a bulk of documents. “Term translation,” a term description rendered in an alternative language with its meaning or what it sounds like (which is also called transliteration), plays an important role in lexicon construction, machine translation (MT), cross-language information retrieval (CLIR), and other natural language processing applications. However, with the advent of globalization and technology, many new terms are created and usually become out of vocabulary (OOV). In addition, the translations of a term often vary in different domains. Term translation, therefore, is difficult to handle via simple dictionary lookup, and presents a serious problem for such tasks as MT and CLIR. In this thesis, we present novel methods for learning to find translations of a given term on the Web. The methods involve two processing parts: during the training stage, we use a bilingual term list to learn source-target surface patterns, morpheme relations, and domain-specific knowledge query expansion terms for collecting more mixed-code data containing relevant translations. At run time, the proposed methods automatically transform the given term into expanded queries aimed at maximizing the probability of retrieving appropriate translations including transliterations or domain-specific translations from a very large collection of mixed-code documents via a Web search engine. Then, the methods extract translation candidates from retrieved snippets of the results of submitting the queries, and finally rank the candidates. We present an implementation of a prototype system, TermMine, which applies the methods to find appropriate translations of a given term. Evaluation on a set of experiments shows that the proposed methods can achieve high precision and recall, and outperform existing state-of-the-art machine translation systems.