透過您的圖書館登入
IP:18.223.119.17
  • 期刊
  • OpenAccess

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses

並列摘要


In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. Collocations are pervasive in all types of writing and can be found in phrases, chunks, proper names, idioms, and terminology. Therefore, automatic extraction of monolingual and bilingual collocations is important for many applications, including natural language generation, word sense disambiguation, machine translation, lexicography, and cross language information retrieval. Collocations can be classified as lexical or grammatical collocations. Lexical collocations exist between content words, while a grammatical collocation exists between a content word and function words or a syntactic structure. In addition, bilingual collocations can be rigid or flexible in both languages. Rigid collocation refers to words in a collocation must appear next to each other, or otherwise (flexible/elastic). We focus in this paper on extracting rigid lexical bilingual collocations. In our method, the preferred syntactic patterns are obtained from idioms and collocations in a machine-readable dictionary. Collocations matching the patterns are extracted from aligned sentences in a parallel corpus. We use a new alignment method based on punctuation statistics for sentence alignment. The punctuation-based approach is found to outperform the length-based approach with precision rates approaching 98%. The obtained collocations are subsequently matched up based on cross-linguistic statistical association. Statistical association between the whole collocations as well as words in collocations is used to link a collocation with its counterpart collocation in the other language. We implemented the proposed method on a very large Chinese-English parallel corpus and obtained satisfactory results.

並列關鍵字

無資料

參考文獻


Choueka, Y.(1988).RIAO, Conference on User-Oriented Context Based Text and Image Handling.
Choueka, Y.,Neuwitz, E.,Klein(1983).Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus.Journal of the Association for Literary and linguistic Computing.4(1),34-38.
Kenneth, K. K. W., K. W.(1994).The 4th Conference on Applied Natural Language Processing.
Dunning, T.(1993).Accurate Methods for the Statistics of Surprise and Coincidence.Computational Linguistics.19(1),61-74.
Patrick, Kenneth, K. K. W., K. W.(1990).Word Association Norms, Mutual information, and Lexicography.Computational Linguistics.16(1),22-29.

被引用紀錄


李慶清(2014)。對於學習中文的一個輔助專家系統〔碩士論文,國立虎尾科技大學〕。華藝線上圖書館。https://doi.org/10.6827/NFU.2014.00018
李啟維(2017)。基於隱藏式馬可夫模型的中文改錯〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU201701112
Chiu, H. W. (2014). Chinese Spell Checking Based on Noisy Channel Model [master's thesis, National Tsing Hua University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0016-2912201413553062
Yeh, M. C. (2017). 重述語的自動生成與改錯 [master's thesis, National Tsing Hua University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0016-0401201816121234

延伸閱讀