探勘維基百科可比語料庫並用於改善特定領域之機器翻譯

近年來，學者持續發展有益於自然語言處理的資源，例如開發雙語或多語辭典，使得某種語言的文件﹑句子或是詞彙能夠被轉換為其他語言，如此便可有效幫助跨語言處理的任務。為了持續讓這些資源保持在可以被信任，即是高質量的狀態，必須仰賴人力不斷維護更新以及添加新資源，但這項工作既費時又消耗成本。因此，學者致力於開發「自動建構平行語料庫」這項技術，企圖建構大量的平行語料庫用以輔助多種自然語言處理，例如：跨語言資訊檢索﹑機器翻譯﹑文本分類等。然而，現實生活中絕大部分的語料可被視為「可比語料庫」，例如：新聞﹑教科書﹑含有主題的雜誌書籍，甚至是網頁等。由於這類型的語料並不拘泥於內容，而是在特定主題下探討發表相關的內容，因此即便兩則文章談論的是同一主題，內容也可能有所不同。正因為如此，可比語料庫才能夠包含更廣大﹑富涵更龐大於平行語料庫的資訊供人適當地擷取利用。雖然可比語料庫擁有比起平行語料庫更豐富的資源，但比起平行語料庫，由於文本內含有可對列之語料乃是未知數，因此從可比語料庫中擷取出平行語料形成了一項待改進的任務。本系統於擷取平行語料可分為兩個部分：第一部分，為避免可比語料庫中含有高比例的非平行語料，我們基於Ma學者經由TF-IDF改寫而成的STF-IDTF，計算以句子為層級的頻率，並藉由門檻值裁定句子是否有機率成為平行語料，我們稱此方法為候選對列句子。於此，我們實驗了門檻值0.1至1.0，選擇最適當的區間作為往後實驗挑選句子的基準；第二部分，將被挑選為候選對列句子的部分以Champollion進行句子對列，同時分為1-0﹑0-1﹑1-1﹑1-2﹑2-1﹑2-2﹑1-3﹑3-1﹑1-4﹑4-1等十種類別以動態規劃演算法計算其句子相似度分數，並挑選最佳對列路徑。本篇論文以改善特定領域之機器翻譯為目的，利用文句對列技術從維基百科擷取特定領域之中英文平行語料，並以GIZA++取得詞彙對列後使用NiuTrans訓練機器翻譯。接著利用NTCIR-9的專利機器翻譯（Patent machine translation）任務的資料集，以BLEU和NIST比較其機器翻譯含有維基百科訓練語料與否﹑Google translation以及Base line系統，作為評估維基百科用於特定領域之機器翻譯可行性。實驗結果表明，本篇論文從維基百科擷取的平行語料，其質量足夠以少量資源達到輔助特定領域之機器翻譯，且其效能優於主流的線上翻譯系統。而且，維基百科提供超過300種類別的主題條目，例如：數學﹑地理﹑戲劇﹑…等，透過線上使用者不斷更新校正，能夠持續擴大其資訊含量供本系統不斷更新平行語料庫，將之使用於跨語言處理任務。

關鍵字

可比語料庫；句子對列；維基百科；平行語料庫；機器翻譯

並列摘要

Comparable corpora are very useful for various natural language processing (NLP) applications such as machine translation (MT) and cross-lingual information retrieval (CLIR). Comparable corpora in various domains can be collected from news, textbooks or web sites. To our knowledge, Wikipedia is the largest multilingual free website on the Internet. For this reason, we tried to extract sentence pairs from Wikipedia to build comparable corpora for different domains. This paper reports that such comparable corpus can be used to improve machine translation in specific domains. In our approach, we used a sentence alignment system Champollion to extract sentence pairs from Wikipedia between Chinese and English. To test the quality of extracted data, we used the data on machine translation task for observing the data, which can help machine translation or cannot. We tested the machine translation in several specific domains. As the experimental results showing, the parallel data which extracted from Wikipedia can help the quality of machine translation system to be better with a less additional data.

並列關鍵字

sentence alignment ； machine translation. ； comparable corpora ； Wikipedia ； parallel corpora

參考文獻

[9] G. Doddington, Automatic evaluation of machine translation quality using n-gram co-occurence statistics, Proceeding of the Second International Conference of Human Language Technology Research, 2002.

[11] J. Giles, Internet encyclopaedias go head to head, Nature, 2005.

[12] J. Goodman, A Bit of Progress in Language Modeling, Technical report, Microsoft Research, 2001.

[13] M. Hepp, K. Siorpaes, and D. Bachlechner, Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management, IEEE Internet Computing, 2007, pp. 54-65.

[14] S. Hewavitharana, and S. Vogel, Extracting Parallel Phrases from Comparable Data, Proceedings of the 4th Workshop on Building and Using Comparable Corpora, 2011, pp. 61-68.

國際替代計量

探勘維基百科可比語料庫並用於改善特定領域之機器翻譯

主題瀏覽