華英平行句 的 詞組對齊 初探

針對平行語料庫中的台語、華語讓格平行句，林淑卿(2009)、楊哲瑋(2010) 使用最長共同子系列 (LCS, longest common subsequence) 方法來幫助平行句對齊。「讓格書寫」是書寫方式上的新提議，是以簡單詞組為單位的書寫方式。本文繼續平行句的對齊研究，對象換成英文、華文。借用chunking 技術幫助，我們首先將英文句轉換做簡短詞組序列，然後華英簡短詞組句的對齊就可使用同樣的LCS方法。平行句的 LCS對齊需要一個 gain函數，現在基礎單位是簡短詞組，因此，我們再使用類似的LCS方法，計算華英簡短詞組對的 LCS分數，然後用來執行平行句對齊。我們也使用本方法來幫助標記一個新聞語料庫的華英對齊。

關鍵字

讓格書寫；簡短詞組；平行語料庫；最長共同子序列；平行句對齊；詞組LCS分數

並列摘要

In previous studies Lin (2009) and Yang (2010) used the method of longest common subsequence (LCS) to help aligning parallel sentences in a parallel corpus between Taiwanese and Mandarin. Contrast to the traditional writing of no space inside a sentence, the parallel corpus consists of sentences written in so-called “LangGeh”(讓格) orthography, or, using simple short phrases(SSP) as a unit and having spaces in-between. This paper continues the alignment study on parallel sentences between Mandarin and English. With the help of chunking, we first segment an English sentence into sequence of simple short phrases, and align the Mandarin-English parallel sentences in SSP using the same LCS method. In sentence alignment using LCS, a gain function between the SSP’s is required. We again use LCS to compute a score for each pair of simple short phrases. The method is used to help aligning a news parallel corpus in Mandarin and English.

並列關鍵字

LangGeh ； simple short phrase ； parallel corpus ； longest common subsequence ； parallel sentence alignment ； LCS score

參考文獻

[5] 楊哲瑋(2010)。「台華平行讓格語料的自動對齊」新竹市：國立清華大學統計學研究所碩士論文 (2010)

[3] Steven Bird, Ewan Klein, Edward Loper. (2009) , "Natural Language Processing With Python - Analyze Text with the Natural Language Toolkit" , Publisher: O'Reilly Media (2009)

[4] 林淑卿(2009)。「從台華平行語料庫擷取對應詞組典」新竹市：國立清華大學統計學研究所碩士論文 (2009)

參考文獻

Google Scholar

[1] Peter F. Brown , John Cocke , Stephen A. Della Pietra , Vincent J. Della Pietra , Fredrick Jelinek , John D. Lafferty , Robert L. Mercer , and Paul S. Rossin.(1990) "A Statistical Approach To Machine Translation" , Computational Linguistics Volume 16 (Number 2 , June 1990)

Google Scholar

被引用紀錄

唐孝蘭（2012）。國中生節能減碳教學效果之研究～以台北市國中生為例〔博士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-1610201315280298

國際替代計量

華英平行句的詞組對齊初探

全文下載

主題瀏覽