聊天機器人對話中的情緒表現之評估方法初探

隨著科技的進步與普及。通訊軟體影響著人們之間的溝通方式，聊天機器人也隨著推陳出新。在現今有關聊天機器人的研究大多為技術取向方面，評估方面的研究較少。如何去評估聊天機器人所回覆的內容在不同面向的程度是一項重要的議題。評估的方法有許多種，包含自動評分與人工評分。雖然自動評分可以快速的評估對話系統的好壞，但是自動評分在許多情況下會因為資料的關係而導致分數不佳，而人工評分能夠避免這個問題，所以在評估方法中加入人工評分能讓評分更加有效，本研究採取人工評分的方式來評估聊天機器人的對話系統。本研究在NTCIR裡的STC任務中發現，其大會評估方法的規則屬於一般化的回答評分方式，導致無法評估回覆在各種面向的表現好壞。如何設計出一個能有效的評估對話系統為本研究的核心重點。我們設計了一套新的問卷設計流程，並且分析大會的評估方法。從大會資料中去發現多面向的因素，並且針對不同的問答類型來增設不同的多面向評估問題。使用一致性的分數來評估設計的評估問題是否良好。本研究以抒發類型的對話為主，從大會的資料中篩選了這類的資料進行分析與處理。發現從篩選出的抒發類型對話中，回覆內容會包含幾個面向，包括：同情、安慰和嘲笑等。因此在評估回覆的內容應要考慮到這些面向，本研究將大會的評估問題與這些面向設計一份評估問卷進行調查，驗證各面向問題是否容易達到一致性的意見。從問卷調查結果中發現我們所設計的評估問題獲得較高的一致性意見，因此我們所提出的問卷設計流程是可以設計一份問卷來多面向的去評估對話系統，針對不同的語料庫(corpus)也可以適用於我們設計的流程。

關鍵字

對話系統；評估方法；情緒表現； NTCIR ；聊天機器人

並列摘要

Along with the advancement of technology and popularity, people’s communication way is influenced by Instant Messaging. The Chatbot technology is also rising. Most studies of Chatbot is focused on engineering aspect, that is: how to make better Chatbot. But the research on evaluation aspect is few. However, it is also important on how to evaluate the degree of different aspects in responses that replied by Chatbot. There are two major evaluation methods: automatic evaluation method and the evaluation method that judge by human. Although the automatic evaluation method can judge the dialogue system quickly, it has low correlation with human in many cases. On the other hand, judgement by human can avoid this situation. Therefore, the evaluation method can get more effective result at the evaluation method that judge by human. We will adopt the evaluation method that judge by human in our research. The evaluation rules in STC task of NTCIR are designed for general reply evaluation method. So it do not evaluate the expression degree in each aspects of responses. To design an evaluation method that can effective judge dialogue system is our research objective. By observing and analyze the evaluation methods of STC task in NTCIR, we designed a process of designing questionnaire, and find out factors with different aspects to design evaluation questions with different aspects for different question and answer types. We use consistency to judge our evaluation questions. In the experiments, we picked the dialog data with emotional expression type and analyzed it. In general, about the emotional expression type is venting his emotion. People might have different thoughts, and then the content of responses contains of sympathy, comfort and ridicule. So it needs to consider these different aspects in evaluation method. We designed the new evaluation questionnaire with different aspects, and verify it whether easily achieve consistency or not. In questionnaire survey result, we found out that the different aspects are easily achieve consistency. Therefore, our presented the process of design questionnaire that can make a new questionnaire to evaluate dialogues system with different aspects. In different corpus also suitable for our process of design questionnaire.

並列關鍵字

Dialogues system ； Evaluation method ； Emotional expression ； NTCIR ； Chatbot

參考文獻

[1] Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka, Yusuke Miyao. "Overview of the NTCIR-12 short text conversation task." In Proceedings of NTCIR-12, 2016, pp. 473-484.

Google Scholar

[2] Lifeng Shang, Tetsuya Sakai, Hang Li, Ryuichiro Higashinaka, Yusuke Miyao, Yuki Arase, Masako Nomoto. "Overview of the NTCIR-13 Short Text Conversation Task." In Proceedings of NTCIR-13, 2017, pp. 194-210.

Google Scholar

[3] PAPINENI, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. p. 311-318.

Google Scholar

[4] Liu, C. W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., & Pineau, J. (2016). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.

Google Scholar

[5] Shih-Hung Wu, Wen-Feng Shih, Liang-Pu Chen and PingChe Yang, "CYUT Short Text Conversation System for NTCIR-12 STC." In Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies, June 7-10, 2016, Tokyo Japan, pp.541-546.

Google Scholar

國際替代計量

聊天機器人對話中的情緒表現之評估方法初探

主題瀏覽