透過您的圖書館登入
IP:13.59.36.203
  • 期刊
  • OpenAccess

Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria

並列摘要


We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written in two disparate languages such as Chinese-English. It is possible to use cognates on top of the length-based approach to increase the alignment accuracy. However, cognates do not exist between two disparate languages, which limit the applicability of the cognate-based approach. In this paper, we examine the feasibility of exploiting the statistically ordered matching of punctuation marks in two languages to achieve high accuracy sentence alignment. We have experimented with an implementation of the proposed method on parallel corpora, the Chinese-English Sinorama Magazine Corpus and Scientific American Magazine articles, with satisfactory results. Compared with the length-based method, the proposed method exhibits better precision rates based on our experimental reuslts. Highly promising improvement was observed when both the punctuation-based and length-based methods were adopted within a common statistical framework. We also demonstrate that the method can be applied to other language pairs, such as English-Japanese, with minimal additional effort.

參考文獻


Wu(2003).Bilingual Collocation Extraction Based on Linguistic and Statistical Analyses.Taiwan:National Tsing Hua University.
Brown, P. F.,J. C. Lai,R. L. Mercer(1991).Proceedings of the 29th conference on Association for Computational Linguistics.USA:Berkeley, CA.
Chen, A.,F. Gey(2001).Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval.
Chen, K.H.,H.H. Chen(1994).Proceedings of 15th International Conference on Computational Linguistics.Kyoto:
Chen, S. F.(1993).Proceedings of ACL-93.Columbus OH:

被引用紀錄


劉安宇(2006)。戴爾模式應用於台灣工業端點銷售電腦產業可行性研究 -以飛捷公司為例〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2006.00267
Bai, M. H. (2013). Extraction of Bilingual Multiword Expressions with Application to Bilingual Concordancer [doctoral dissertation, National Tsing Hua University]. Airiti Library. https://doi.org/10.6843/NTHU.2013.00703

延伸閱讀