基於BERT語言模型模擬人類對於對話系統之評分用於GPT-2生成情感對話聊天機器人自動評分研究

在於這個人工智慧得進步與普及的社會中，用於聊天的聊天機器人系統更是雨後春筍的出現。而現今有許多對話生成方面的研究大多都把重點放至生成技術方面，鮮少有針對生成的對話文字做評估的研究，本研究認為如何去評估聊天對話生成的文字反而是較重要的議題。本研究在參加NTCIR-14 STC-3 CECG中文情緒對話生成比賽後，發現其主辦單位使用的評分系統非常依賴大量的人工評估，主辦單位雖然有套詳細的評分規則但還是得花費非常龐大的時間以及人力將隊伍的對話生成系統評分，因此本研究更加的重視對於文字的自動評估探討。一般來說在自然語言領域中對話生成系統很難有自動評估系統來對於生成的對話文字做有效的評估。目前，對話評估還是重度依賴於人類判斷來做文字評估探討，並且它的系統無法延用在其他的生成模型上評估。我們認為，以AB Test的概念去比較兩個對話生成系統的方式做評估何者系統較好，並以機器學習方式達到對於對話生成系統做自動評估的效果。在本論文中，我們提出了一種機器學習方法，以學習人類判斷力比較兩個對話系統來減少人工評估的工作量。通過少量的人類標籤結果的訓練，評估模型可以了解在何種情況哪種生成模型效能較好。因此，它可以用於系統開發，使用者無需使用人工方式就可以評估哪一個模型較好並且微調的模型，或者比賽方決定何者對話生成系統勝利等用途。在我們的實驗中，我們使用BERT做為自動評估模型，發現我們系統使用較少的訓練資料集也可以有好的自動評估準確度，並且發現本研究的模型再學習人類判斷之後，可以多次對不同的模型做評估，都有超過百分之50以上的自動評估準確度。該實驗是在四個對話生成模型上進行評估，分別是兩套Seq2Seq GRU情緒對話生成模型以及兩套GPT-2對話生成模型。

關鍵字

對話生成系統；人工評估；自動評估； BERT 語言模型； AB 測試

並列摘要

Due to the advancement and popularization of artificial intelligence in our time, chat robot systems for chatting have sprung up in our society. Nowadays, most of the researches on dialogue generation focus on the generation technology, and there are few studies on the evaluation of the generated dialogue text. We believe that how to evaluate the text generated by the chat dialogue is a more important issue. Generally speaking, in the field of natural language, it is difficult for a dialogue generation system to have an automatic evaluation system to effectively evaluate the generated dialogue text. At present, dialogue evaluation still relies heavily on human judgment for text evaluation and discussion, and its result cannot be extended to evaluate other generative models. We believe that using the concept of A/B Test to compare the two dialog generation systems to evaluate which system is better, and use machine learning to achieve the effect of automatic evaluation of the dialog generation system. In this paper, we propose a machine learning method to learn human judgment to compare two dialogue systems to reduce the workload of manual evaluation. Through the training of a small amount of human label results, the evaluation model can understand which generation model has better performance under which conditions. Therefore, it can be used for system development, users can evaluate which model is better and fine-tuned without using manual methods, or the contestant can decide which dialogue system win. In our experiments, we used BERT as an automatic evaluation model, and found that our system can also have good automatic evaluation accuracy using less training data sets, and found that the model in this study can learn human judgment. The evaluation of different models has an automatic evaluation accuracy of more than 50%. The experiment is evaluated on four dialogue generation models, which are two Seq2Seq GRU emotional dialogue generation models and two GPT-2 dialogue generation models

並列關鍵字

Dialogue Generation System ； Manual Evaluation ； Automatic Evaluation ； BERT ； AB Test

參考文獻

[1] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. “BLEU: a Method for Automatic Evaluation of Machine Translation”, Association for Computational Linguistics. p. 311-318, 2002.

Google Scholar

[2] Sainik Kumar Mahata, Dipankar Das and Sivaji Bandyopadhyay. “MTIL2017: Machine Translation Using Recurrent Neural Network on Statistical Machine Translation”, https://doi.org/10.1515/jisys-2018-0016Received January 9, 2018.

Google Scholar

[3] Sepp Hochreiter, Schmidhuber, et al. “LONG SHORT-TERM MEMORY”, Neural Computation 9(8):1735{1780, 1997.

Google Scholar

[4] Kyunghyun Cho, et al.” Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation”, EMNLP. p.1724–1734, 2014.

Google Scholar

[5] Chung, Caglar Gulcehre, and Cho, et al. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, arXiv:1412.3555, 2014.

Google Scholar

國際替代計量

基於BERT語言模型模擬人類對於對話系統之評分用於GPT-2生成情感對話聊天機器人自動評分研究

主題瀏覽