常見問題集(frequently asked questions, FAQ)是在業務場景中客戶最常問的問題集合,本篇論文在建立一有效回答常見問題集的聊天機器人(chatbot)。首先,問題的答案經常會隨著時間而改變,為了語料的穩定性和模型建立的準確性,我們將回答 FAQ 的問題轉變為從候選中檢索出最合適的匹配對象。接著,我們使用 term frequency–inverse document frequency (TFIDF)作為聊天機器人檢索匹配對象的根據,我們發現到 TFIDF 並不能識別客戶對同一個標準問題所產生出的不同測試題(query)。所以我們提出使用 BERT 來提升模型識別問題語義的能力,我們探討了使用不同比對模式來微調 BERT的情況,我們的結果超越了傳統上使用 BERT 對 query 進行文本分類的結果。同時我們比較text classification with BERT、cross-encoder BERT、Siamese BERT,在小資料量資料集例如:公司常見問題集,準確率從text classification with BERT 的74.20%和Siamese BERT的74.50%提升到cross-encoder BERT的81.00%。但是在大資料量資料集例如:Yahoo! Answers,text classification with BERT則有最高的準確率。另外,我們使用了不同的資料擴增方法,reverse pair和繁簡增生在cross-encoder BERT上都能提高準確率。
Frequently Asked Questions (FAQ) is a collection of questions most frequently asked by customers in business scenarios. This paper built a chatbot that can effectively answer the frequently asked questions. First of all, the answers to questions often change over time. For the stability of the corpus and the accuracy of model prediction, we built the model by retrieving the most matching answer from the candidates. Next, we used the term frequency-inverse document frequency (TFIDF) as the basis for the chatbot to retrieve matching objects. We found that TFIDF cannot identify different test questions (queries) generated by customers on the same standard question. So we proposed to use BERT to improve the ability of the model to identify the semantics of the problem. We explored the use of different comparison modes to fine-tune the BERT. Our results surpassed the traditional use of BERT for text classification of queries. At the same time, we compared text classification with BERT, cross-encoder BERT, Siamese BERT. For small datasets such as the company's common problem sets, the accuracy increased from 74.20% for text classification with BERT and 74.50% for Siamese BERT to 81.00% for cross-encoder BERT. But in large datasets such as Yahoo! Answers, text classification with BERT has the highest accuracy. In addition, we used different data augmentation methods. Both reverse pair and Traditional Chinese Simplified Chinese conversion can improve accuracy on cross-encoder BERT.