基於微調BERT模型的增強式中文問答系統

自然語言處理(NLP)是現在計算機科學及人工智慧著重的領域，得益於現今硬體運算能力的提升，深度學習網路可以計算比以往更加龐大的數據，使得計算機能處理並分析大量人類使用的語言數據，讓處理自然語言的方法與技術更加成熟。自然語言領域常見的應用有語音辨識、摘要文本大綱、機器翻譯、自然語言生成，情緒分析等，而在自然語言處理中常見的問答語言任務(QA)，則是可以包含最多文字語意訊息的任務類型，若可以訓練出一個優良的問答語言模型，則可以應用於許多領域，對於自然語言處理發展有很大的幫助。現在對於自然語言處理主流的架構，為2018年由google公司發布的BERT(Bidirectional Encoder Representation from Transformers)語言模型，google使用二階段式訓練並修改輸入與輸出的表示，使得BERT模型可以處理大多數的自然語言任務並取得當時11項任務的最佳成績，在這之後如何提高BERT架構的效能是自然語言研究的一大重點。由於BERT本身的架構龐大，其中標準版的12層編碼層架構參數量就已經達到一億一千萬，如何分析BERT模型也是一大挑戰，在先前的研究中發現可以將BERT分為淺中深三種編碼階段，每個階段對於任務的貢獻並不相同，在淺層編碼階段主要負責表面特徵編碼、中層負責語句特徵編碼、深層負責語意特徵編碼。根據先前研究的探討，發現淺層與深層所著重的部分並不相同，同時，在不同的編碼過程中會具有類似的工作情形，基於這種有趣的特性，故本文提出一種適用於中文語言與提高模型訓練效能的方法。本論文提出利用不同種資料集分別微調兩個BERT模型後再將其組合以達到在問答語言任務中更為優秀的效果。本論文的做法可以分為兩個部分，首先利用兩種不同類型的資料集DRCD-master資料集與MSRA數據集分別訓練出基本的BERT模型，最後，以經由DRCD-master微調訓練出來的BERT模型符合問答任務主要方向，由於經由MSRA數據集所訓練出的BERT模型可以更好的確定單詞的詞性，在不影響問答語言任務模型的前提下，通過交換兩個模型中的編碼層使BERT中文語言模型達到更佳的效果。研究結果顯示，本論文所提出利用訓練序列標註任務與訓練問答任務並交換特定的編碼層的方法，不會需要比原先更高規格的硬體效能，可以在有限的硬體資源中提高語言模型語言能力。

關鍵字

中文問答系統；自然語言處理； BERT

並列摘要

Natural language processing (NLP) is an important field of computer science and artificial intelligence. Due to the improvement of hardware computing power, deep learning networks can process more data than ever before and allows computers to process and analyze large amounts of human language data to increase the capability of processing natural languages. The common applications in natural language processing field include speech recognition, automatic summarization, machine translation, natural language generation, emotion analysis, and other applications. The question answering task (QA) is one of the most important tasks in the field of natural language processing since QA task includes semantic understanding, semantic inference, and so on. If we can train a good question answering model, it can be applied to many fields, which is of great help to the development of natural language processing. Now, the mainstream architecture of natural language processing is the BERT (Bidirectional Encoder Representation from Transformers) released by Google in 2018. Google uses transfer learning and modifies the representation of input and output, so that the BERT model can handle most natural language tasks and obtain the best results of 11 tasks at that time. After that, how to improve the efficiency of the BERT is a major focus of natural language researches. However, it is a difficult challenge to analyze the Bert model due to its huge architecture, the standard version of the 12 encoder layer architecture parameters has reached 110 million. One previous research found that the coding layer of BERT can be divided into three stages: shallow, medium, and deep layer coding stage. Each stage has a different contribution to the task. The shallow layer coding stage is mainly responsible for surface feature coding, the middle layer coding stage is responsible for syntactic feature coding, and deep layer coding stage is responsible for semantic feature coding. According to the previous study, we found that the emphasis of shallow and deep layers is not the same and there will be similar jobs in different encoding processes. Therefore, for these interesting features, this thesis proposes a method to improve the training performance of the BERT model with Chinese. This thesis uses two kinds of data sets to fine-tune two BERT models and then combines them to achieve better results in question answering task. The first step of the propose method is to train basic BERT model for two different types of data set, DRCD-master data set and MSRA data set. The model fine-tuned by DRCD-master is called DRCD-BERT model and is taken as the main stem of the question answering model. The BERT model fine-tuning by the MSRA data set, called MSRA-BERT model can better determine the part of syntactic of the word. Then, on the premise of not affecting the question answering task model, some encoding layers in MSRA-BERT model is taken to replace some encoding layers in DRCD-BERT model to achieve better performance in question answering task in Chinese. The experimental results show that the proposed method can successfully train sequence tagging task model and training question answering task model by exchange specific coding layer. This method does not need larger computing power and can improve the processing ability of the fine-tuned model with limited hardware resources.

並列關鍵字

BERT ； NLP ； Chinese Question Answering System

參考文獻

[1]Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, “Data Mining”. Elsevier Inc. Press, 2017.

Google Scholar

[2]Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton “ImageNet Classification with Deep Convolutional Neural Networks”. In NIPS, 2012.

Google Scholar

[3]Jeffrey L. Elman, “Finding Structure in Time”. Cognitive Science. Volume 14, Issue 2,Pages 179-211 (April–June 1990)

Google Scholar

[4]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding” ,(2019,Jun,19).

Google Scholar

[5]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”, (2019,Sep,26), Published as a conference paper at ICLR 2020.

Google Scholar

國際替代計量

基於微調BERT模型的增強式中文問答系統

全文下載

主題瀏覽