應用神經網路語言模型於同音別字之訂正

隨著科技的普及，人們越來越少提筆寫字，而在使用科技產品輸入文字進行溝通時，別字可能是現代人最大的困擾。別字訂正是一項重要具有挑戰性的任務，令人滿意的解決方案常需要人類水平的語言理解能力。又因為正體中文世界人們較常使用注音輸入法，使得同音別字的偵測訂正需求較大，所以本研究將專注於同音別字之偵測訂正。近年隨著眾多預訓練語言模型的釋出，先前常需倚靠大量運算的自然語言處理領域得以降低各種需要從頭訓練的資源門檻，其中BERT預訓練語言模型的表現亮眼獲得眾多青睞。本研究將使用BERT系列預訓練模型做別字訂正。資料集部分將針對教育部4808個常用字，依據全字庫文字屬性製作出每個常用字的同音字混淆集，再結合中文維基百科做成完整的同音別字資料集。經過評估其中Soft-Masked BERT模型在經過本同音別字資料集訓練後，其句子層級的別字偵測F1指標達到0.885，別字訂正F1指標為0.806，已逐漸達到可輔助人類訂正同音別字的效果。

關鍵字

BERT ；混淆集；同音字；深度學習；自然語言處理

並列摘要

As technology advances, people are increasingly less likely to write with a pen. Typos can be the biggest problem for modern people when using technology to enter text for communication. Typos correction is an important and challenging task, and a satisfactory solution often requires human-level language comprehension. As the phonetic input method is popular among people who use traditional Chinese, this study will focus on the detection and correction of homophone characters. For natural language processing (NLP), many pre-trained language models have been released in recent years. The need for massive compute-intensive resources to do the training from scratch is greatly reduced. Among these models, the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language models have gained much attention. In this study, we will use a variant of the pre-trained BERT model to do the typos correction. Based on a set of 4,808 commonly used characters compiled by the Ministry of Education, and the phonetic information from the Master Ideographs Seeker codebase, a confusion set of homophones for each commonly used character is constructed. With sentences from Chinese Wikipedia, a complete homophone data set for training and testing of typos correction models is made. In the experiments, at the sentence level, the Soft-Masked BERT model has achieved an F1-score of 0.885 for homophonic typo detection and 0.806 for homophonic typo correction. It shows that the tested model has gradually reached a level that can assist humans in correcting homophones.

並列關鍵字

BERT ； Confusion Set ； Homophones ； Deep Learning ； NLP

參考文獻

中文文獻

Google Scholar

[1].蔣宜靜 (2020)。應用 BERT 語言模型於同音別字之訂正〔未出版之碩士論文〕。淡江大學資訊管理學系。

Google Scholar

英文文獻

Google Scholar

[2].Bahdanau, D., Cho, K., Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In: ICLR 2015.

Google Scholar

[3].Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. In: Proceedings of EMNLP 2014.

Google Scholar

國際替代計量

應用神經網路語言模型於同音別字之訂正

全文下載

主題瀏覽