使用轉換學習方法進行少量資源中文機器閱讀理解

隨著語言模型預訓練技術的巨大進展，機器閱讀理解(MRC)研究有了顯著的進步，近年來深受重視。由於相關訓練資源較為充足，目前MRC的研究大多集中在英語的一般性應用領域。然而在少量訓練資源領域的MRC研究則較缺乏，但對許多特定領域(例如醫療保健/生物醫學應用)這類的需求其實很高。此外，針對中文MRC的研究也正在起步。本研究的主要目的是針對少量資源機器閱讀理解發展出一種可行的解決方案。並提出一種基於預訓練/微調語言建模結構的轉換學習方法，其中包括四種不同的微調方法：聯合(Joint)、依序(Cascade)、對抗式學習(Adversarial learning)及多任務學習(Multitask learning)訓練。這些微調方法是運用資源較為豐富的其他領域訓練數據來提高機器學習的成效。同時，本研究也進行了多種實驗來評估在不同條件下的成效，包括輔助訓練任務支持程度的不同、目標數據的大小變化、及使用不同的中文MRC評測標準進行測試。本研究結果證實，從資源相對較為豐富的領域取得數據，可以協助提高資源較為稀少的特定領域其機器學習成效；同時採用多任務訓練方法進行精細化處理，則可以最大化輔助數據集對於少量資源機器閱讀理解的支持，並可達到最佳的機器閱讀理解效能。

關鍵字

機器閱讀理解；轉換式學習；領域適配

並列摘要

With huge advances in language model pre-training technology, machine reading comprehension (MRC) has been significantly improved and attracted a lot of research attention in recent years. Most research on MRC has focused on general-domain contexts in English, which relies on the adequacy of the training sources. MRC research in low-resource domains has not yet been well developed, but is highly demanded for many specific domains such as healthcare/biomedical applications. Besides, there is less research conducted in Chinese MRC. The purpose of this research is to develop a feasible solution for low-resource machine reading comprehension. In this research, we employ the transfer learning approach which is based on a pre-training/fine-tuning language modeling structure , and develop four different domain adaptation methods, including joint, cascade, adversarial, and multitask learning. The domain adaptation methods are employed to improve the effectiveness of MRC through the support of the training data from other domains with high resources. Various experiments are conducted to evaluate the effectiveness of different domain adaptation methods under various conditions, including different scales of support from the auxiliary tasks, limited sizes of target data, and tests in different Chinese MRC benchmarks. According to our evaluation results, the effectiveness of MRC in specific domains with low training resources can be improved through the support of the data from other domains with high resources, and the multitask training method for fine-tuning generally can maximize the support of the auxiliary data sets to low-resource machine reading comprehension tasks and achieve the best effectiveness of MRC.

並列關鍵字

machine reading comprehension ； transfer learning ； domain adaptation

參考文獻

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN. arXiv preprint arXiv:1701.07875.

Google Scholar

Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T., et al. (2016). Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.

Google Scholar

Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.

Google Scholar

Chung, Y. A., Lee, H. Y., and Glass, J. (2017). Supervised and unsupervised transfer learning for question answering. arXiv preprint arXiv:1711.05345.

Google Scholar

Conneau, A. and Lample, G. (2019). Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, pages 7059–7069.

Google Scholar

國際替代計量

使用轉換學習方法進行少量資源中文機器閱讀理解

全文下載

主題瀏覽