透過您的圖書館登入
IP:18.222.117.109
  • 學位論文

應用序列到序列生成模型於雙向文本改寫之研究

Using the Sequence to Sequence Generative Model for Bidirectional Text Rewriting

指導教授 : 魏世杰

摘要


語言理解和掌握的能力固然因人而異,但同時也受到歷史變遷的影響。尤其是文言文作為過往的書面語,與一般現代人在日常生活中所使用的白話文存在著明顯的差異,因此現在很多人對於文言文會在理解能力上有所缺乏。 為了彌補文言文與白話文兩種書寫風格間的理解落差,本研究選擇以文言文與白話文的雙向文本改寫為主題,經由自然語言處理(Natural Language Processing)的方式進行語料處理,並且通過深度學習(Deep Learning)架構訓練 Seq2Seq 序列到序列模型,以生成對應書寫風格的語句。另外,本研究也以單語語料訓練文言文及白話文兩套獨立詞向量(Word Vector),來提取各書寫風格下內部詞語間的詞意關聯性。 本研究從文言文與白話文的對應關係著手,通過在兩者相應的平行語料提取彼此之間詞對應(Word Alignment)的關聯性,以此實作雙向神經機器翻譯(Neural Machine Translation)系統。最後,以 BLEU(Bilingual Evaluation Understudy)指標對於系統生成語句做評測。針對測試集的結果顯示,本系統於詞語層級所得到的BLEU得分中,白話文改寫文言文較佳;於字元層級所得到的BLEU得分中,則文言文改寫白話文較佳。而字元層級雙向文本改寫的BLEU得分都明顯勝過詞語層級的表現。 可看出本研究所採用的雙向文本改寫作法,已為導入自然語言技術,應用在理解白話文和文言文的中文書寫風格研究上,提供一個可供探索的方向。

並列摘要


Although the ability to understand and master a language varies from person to person, it is also affected by the evolution of the language itself. In particular, Classical Chinese as a written language of the past has obvious differences from Vernacular Chinese used in modern society. As a consequence, many Chinese today find it hard to understand Classical Chinese texts. In order to bridge the gap in understanding the two writing styles of Classical Chinese and Vernacular Chinese, this work chooses the bidirectional text rewriting of Classical and Vernacular Chinese as the topic. A parallel corpus is collected and processed by natural language techniques. The corpus is used to train a sequence to sequence model under the deep learning architecture. The model can be used to generate sentences of the desired writing style. In addition, this work also uses two separate monolingual corpora to train two independent sets of word vectors in Classical Chinese and Vernacular Chinese, respectively. It aims to extract the semantic relevance between words in each writing style. From the parallel corpus, this work tries to find the correspondence relations between Classical Chinese (CC) and Vernacular Chinese (VC). A neural machine translation model is applied to extract the relevant word alignments in the parallel corpus. As result, the BLEU metric is used to evaluate the generated sentences. For the test dataset, it is found that the word-level model can rewrite VC to CC better than CC to VC. In contrast, the character-level model can rewrite CC to VC better than VC to CC. Overall, the character-level model performs better than the word-level model in Chinese text rewriting. In this work, natural language technologies are applied in rewriting between the two Chinese writing styles of Vernacular Chinese and Classical Chinese. It can be seen that the bidirectional text rewriting method used in this work has provided a promising study direction for understanding related writing styles.

參考文獻


[1] Artetxe, M., Labaka, G., Agirre, E., Cho, K. (2018) Unsupervised Neural Machine Translation. arXiv preprint arXiv:1710.11041.
[2] Badalamenti, A.F. (1991) Language and the Intuition of Meaning. Systems Research 8(4), pp. 43-66
[3] Bahdanau, D., Cho, K., Bengio, Y. (2015) Neural Machine Translation by Jointly Learning to Align and Translate. In: ICLR 2015.
[4] Banchs, R.E., D'Haro, L.F., Li, H. (2015) Adequacy–Fluency Metrics: Evaluating MT in the Continuous Space Model Framework. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 23(3), pp. 472–482
[5] Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C. (2003) A Neural Probabilistic Language Model. Journal of Machine. Learning Research 3:1137–1155.

延伸閱讀