研究與實作基於自然語言深度學習的文章順暢度自動修正技術

現今的使用者在中文寫作方面，無論是撰寫作文、論文或是新聞稿，大多數都是以現有市面上的系統進行撰寫，如Microsoft Word 或是 Google Document等系統。這些系統提供使用者更便利的寫作環境，並辨識在文章中有寫錯字的部分，進而讓使用者能夠將其錯誤修正。在使用這些系統時，雖然能夠將錯字部分進行更正，但若使用者程度差距過大，還是會有中文句型上的錯誤及用詞上的錯誤，以及較離譜的錯字難以更正，導致文章讀起來並不順暢。為了提高文章的順暢度，許多學者也開始研究方法來將寫作時發生的錯誤最小化。在目前的研究中，有許多學者提出以深度學習來解決英文寫作不順暢之問題。在中文語句順暢度方面，現有研究的作法大多是先行使用斷句斷詞後，設計一個演算法去檢查每一個詞彙最靠近現有的何種詞彙，進而將錯誤詞修正為正確的詞彙。但文法上的問題目前研究尚未有人解決。綜合上述描述之問題，本論文將設計一個中文文章順暢度自動更正系統，能夠將使用者輸入的句子，自動更正錯字、檢查並更正文法上的錯誤。首先，本論文透過爬蟲程式，將文章爬蟲下載後，利用斷詞斷句，藉此建立資訊詞庫，並利用模糊比對的方式將資訊領域用詞上的錯誤更正。之後，再利用這些文章的句子，設計一個自動生成中文錯誤集的演算法，建立不順暢語句詞庫，並以此透過BERT訓練一個順暢度分類演算法，用於判斷句子是否為順暢語句。最後以利用BERT模型現有的Mask LM功能設計一個演算法，來解決缺漏字的問題。根據實驗數據顯示，本論文所發展的做法，能有效且精準的更正句子，並能夠解決文法錯誤的問題，也能針對專業領域詞語的錯字，同時也讓使用者能撰寫文章時更方便，進而讓讀者能夠理解文章中想表達的意思。

關鍵字

自然語言；深度學習；文章順暢度更正；錯誤更正

並列摘要

Most of the current users in Chinese writing use the existing systems on the market to write. Such as Microsoft Word or Google Document. These systems provide users with a more convenient writing environment and identify typos in the text. To make users to correct their errors. When using these systems, typos can be corrected. However if the user level gap is too large. There will be mistake in Chinese sentence patterns and wording. As well as more outrageous typos that are difficult to correct. Makes the article difficult to read. In order to improve the smoothness of the article, many scholars have also begun to research methods to minimize the errors that occur when writing. In the current research, many scholars have proposed to use deep learning to solve the problem of poor English writing. In terms of the fluency of Chinese sentences, most of the existing research methods are to use sentence segmentation and word segmentation first. Then design an algorithm to check which words each word is closest to the existing words and correct the wrong words to the correct words. However, the problem of grammar has not been solved by the current research. Based on the problems described above. This paper will design an automatic correction system for the smoothness of Chinese articles. And it can automatically correct typos, check and correct grammatical errors in sentences input by users. First, this thesis uses the crawler program to download the article crawler. And uses word segmentation and sentence segmentation to establish an information thesaurus. Use this information thesaurus to corrects the errors in the wording in the information field by means of fuzzy comparison. Then, using the sentences of these articles to design an algorithm to automatically generate Chinese error sets. Build a vocabulary of unsmooth sentences by the error datasets and train a smoothness classification algorithm through BERT to judge whether a sentence is a smooth sentence. Finally, an algorithm is designed by using the existing Mask LM function of the BERT model to solve the problem of missing words. According to the experimental data. The method developed in this paper can effectively and accurately correct sentences, solve the problem of grammatical errors, and also address the typo of words in professional fields. The system can make it easier for users to write articles, and allow readers to understand what the text is trying to convey.

並列關鍵字

Natural Language Processing ； deep learning ； article smoothness correction ； error correction

參考文獻

[1] Yang, Yi, et al. "Alibaba at IJCNLP-2017 task 1: Embedding grammatical features into LSTMs for Chinese grammatical error diagnosis task." Proceedings of the IJCNLP 2017, Shared Tasks. 2017.

Google Scholar

[2] Zheng, Bo, et al. "Chinese grammatical error diagnosis with long short-term memory networks." Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016). 2016.

Google Scholar

[3] Yang, Zongyu, Hao Zeng, and Hongyan Li. "Chinese Text Error Correction Method Based on Prefix Tree Merging." 2020 IEEE 3rd International Conference on Automation, Electronics and Electrical Engineering (AUTEEE). IEEE, 2020.

Google Scholar

[4] Duan, Jianyong, et al. "Research on Chinese Text Error Correction Based on Sequence Model." 2019 International Conference on Asian Language Processing (IALP). IEEE, 2019.

Google Scholar

[5] Lei, Zhang, et al. "Automatic Chinese text error correction approach based-on fast approximate Chinese word-matching algorithm." Proceedings of the 3rd World Congress on Intelligent Control and Automation (Cat. No. 00EX393). Vol. 4. IEEE, 2000.

Google Scholar

主題瀏覽