作者(英):Chiu, Ching-Ya
論文名稱(英):Comparison of Applying Deep Learning Methods on Misinformation Detection
英文關鍵詞:Natural Language Processing (NLP)BERTCKIP-BERTRoBERTaChinese text classificationOnline rumor identification
隨著新媒體時代與網際網路的蓬勃發展,資訊流通的速度更快速卻也伴隨社群媒體上大量參雜不實資訊的網路謠言被迅速散播。一般民眾尤其高齡者不易辨認謠言真實與否,在新冠肺炎疫情蔓延之下,誤信不實謠言可能造成不良影響。現今有許多官方與民間的訊息查證平台,如: 衛生福利部疾病管制署-澄清專區、Cofacts和台灣事實查核中心等,將可疑訊息查證結果公布於網頁上供民眾查詢真偽,然而單純以人工方式查核不僅流程耗費大量人力與時間成本,且闢謠速度跟不上網路謠言在群組間轉傳的速度。
因此本研究以Cofacts開源資料庫為中文文本,微調Google BERT、CKIP-BERT和RoBERTa預訓練模型對網路謠言進行「真實訊息」與「虛假訊息」的辨識與分類。根據模型評估指標結果,三個模型皆達到平均85%的準確度,能夠正確判斷85%訊息內容的真偽,其中又以RoBERTa模型的分類能力最佳。說明Google BERT、CKIP-BERT和RoBERTa預訓練模型的分類性能對於本研究所蒐集的網路謠言資料集具有良好的成效。
With the development of the new media era and the Internet, the speed of information spreading is dramatically higher than before. But it is also accompanied by the rapid spread of a large number of online rumors mixed with fake information on social media. It is difficult for the general public, especially the elderly, to identify whether the rumors are true or not. Under the spread of the Covid-19 epidemic, the misleading of the fake news may cause public panic and serious consequences. Nowadays, there are many official or private rumor verification platforms, such as Taiwan Centers for Disease Control - Clarification Zone, Cofacts and Taiwan FactCheck Center, etc., publish suspicious information verification results on the website for public to check the authenticity. Not only does the process cost a lot of manpower and time, but also the speed of refuting rumors cannot keep up with the spreading which online rumors are transmitted among social media.
Therefore, this research uses Cofacts’s open sources as experimental corpus, and fine-tunes the Google BERT, CKIP-BERT and RoBERTa pre-training models to identify and classify "Truth Information" and "Fake Information" on online rumors. According to the results of the model evaluation indicators, the three models have achieved an average accuracy of 85%, and can correctly judge the authenticity of 85% of the message content. Among them, the RoBERTa model has the best classification ability. It shows that the identification performance of Google BERT, CKIP-BERT and RoBERTa pre-trained models have productive results for the rumor data set collected in this search.
目 錄 III
表目錄 IV
圖目錄 V
第壹章、 緒論 1
第一節 研究背景與動機 1
第二節 研究目的 2
第三節 研究流程 4
第貳章、 文獻探討 5
第一節 自然語言處理(Natural Language Processing) 5
第二節 注意力機制(Attention Mechanism) 7
第三節 Transformer模型 8
第四節 Google BERT 10
第五節 RoBERTa 14
第六節 中文文本分類(Text Classification)之文獻回顧 16
第參章、 研究方法 17
第一節 研究架構 17
第二節 資料蒐集與資料預處理 18
第三節 分析模型 23
第肆章、 實證分析 29
第一節 資料篩選流程 29
第二節 模型成效分析 32
第伍章、 結論與建議 36
第一節 結論 36
第二節 未來研究方向與建議 37
參考文獻 39
