以生成對抗網路自動產生中英文語碼轉換文句

語碼轉換是指在一段話或是一段文字中交替使用兩種以上語言。基本上不同語者、不同對話內容、不同語言組合都可能有不同的語碼轉換風格與特性。雖然語碼轉換在自然語言中出現頻繁，但相較於單一語言，語碼轉換的語料相當缺乏。本論文目標是發展一種非監督式的技術來自動產生語碼轉換的語料，並在兩套中文為主位語言、英文為客位語言的語碼轉換資料集上實驗驗證。本論文的方法是藉由生成對抗網路以及梯度策略演算法，從單一語言的文句 (主位語言) 預測適合的語碼轉換位置，將這些位置以詞翻譯為客位語言後產生句內語碼轉換的文句，並用以作為語言模型的增強訓練語料。結果顯示本論文所提出的方式能夠小幅度改善語言模型，並小幅降低語音辨識系統的客位語言的錯誤率。

關鍵字

語碼轉換；文本生成；資料增強；語言模型；生成對抗網路

並列摘要

無資料

並列關鍵字

code-switching ； text generation ； data augmentation ； language modeling ； generative adversarial networks

參考文獻

[1] John Macnamara and Seymour L Kushnir, “Linguistic independence of bilinguals: The input switch,” Journal of Memory and Language, vol. 10, no. 5, pp. 480, 1971.

Google Scholar

[2] Carol Myers-Scotton, Social motivations for codeswitching: Evidence from Africa, Oxford University Press, 1995.

Google Scholar

[3] Paul McNamee, “Language identification: a solved problem suitable for undergraduate instruction,” Journal of Computing Sciences in Colleges, vol. 20, no. 3, pp. 94–101, 2005.

Google Scholar

[4] Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann, “Merging comparable data sources for the discrimination of similar languages: The dsl corpus collection,” in Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), 2014, pp. 11–15.

Google Scholar

[5] Jun Du, Yan-Hui Tu, Lei Sun, Feng Ma, Hai-Kun Wang, Jia Pan, Cong Liu, JingDong Chen, and Chin-Hui Lee, “The ustc-iflytek system for chime-4 challenge,” Proc. CHiME, pp. 36–38, 2016.

Google Scholar

國際替代計量

以生成對抗網路自動產生中英文語碼轉換文句

全文下載

主題瀏覽