透過您的圖書館登入
IP:3.14.132.123
  • 學位論文

基於機器學習與深度學習技術檢測虛假資訊

Detecting Misinformation Using Machine Learning and Deep Learning Techniques

指導教授 : 洪智力
本文將於2026/07/23開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


虛假資訊的快速傳播已成為現今社交媒體和新聞平台上的一個難題,本研究中運用方法是機器學習中的隨機森林(Random Forest, RF)、單純貝氏分類器(Naive Bayes classifier, NB)、極限梯度提升(eXtreme Gradient Boosting, XGBoost)方法,深度學習裡的遞迴神經網路(Recurrent Neural Network, RNN)、長短期記憶網路(Long Short-Term Memory, LSTM)、門控循環單元(Gated recurrent units, GRU)方法來進行分類文本。本研究嘗試透過兩階段的方式檢測虛假資訊,第一階段利用條件機率(Conditional Probability, CP)、點間交互資訊(Pointwise Mutual Information, PMI)的方式,收集其有利的特徵、文字特徵和真假資訊間潛在關係進而形成知識庫,於第二階段將知識庫結合原始資料集所統計出的文章特徵、詞性分析等特徵一同放入模型訓練,最後使用5折交叉驗證(5-fold Cross-Validation)將模型一般化,並利用準確率(Accuracy)、精確率(Precision)、召回率(Recall)和AUC(Area Under Curve)方法進行模型評估。在YelpCHI資料集的結果中,以實驗Word2Vec中的Skip-gram(All Feature)項目,結合機器學習XGBoost模型來說最佳,AUC為92.50%;深度學習中Word2Vec中的CBOW(All Feature)項目中結合GRU模型最佳,AUC為92.02%。在Kdnuggets real or fake news資料集的結果中,以實驗fastText中的Skip-gram(All Feature)項目,結合機器學習XGBoost演算法來說最佳,AUC為99.95%;深度學習中Word2Vec中的CBOW(100+POS+PMI)項目中結合GRU模型最佳,AUC為99.83%。最後實驗結果與比較基準相比,本研究使用CP與PMI研究方法有顯著提升。

並列摘要


The rapid spread of false information has become a significant problem on social media and news platforms today. In this study, we employ methods such as Random Forest (RF), Naive Bayes classifier (NB), and eXtreme Gradient Boosting (XGBoost) in machine learning, as well as Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU) in deep learning to classify text. Our approach attempts to detect false information through a two-stage process. In the first stage, we use Conditional Probability (CP) and Pointwise Mutual Information (PMI) to collect beneficial features, textual characteristics, and potential relationships between true and false information to form a knowledge base. In the second stage, we combine this knowledge base with statistical features from the original dataset, such as article features and part-of-speech analysis, and input them into model training. Finally, we generalize the model using 5-fold Cross-Validation and evaluate the model using Accuracy, Precision, Recall, and Area Under the Curve (AUC).The results on the YelpCHI dataset show that the best performance was achieved using the Skip-gram (All Feature) experiment in Word2Vec combined with the XGBoost model, with an AUC of 92.50%. In deep learning, the best performance was obtained using the CBOW (All Feature) experiment in Word2Vec combined with the GRU model, with an AUC of 92.02%. On the Kdnuggets real or fake news dataset, the best performance was achieved using the Skip-gram (All Feature) experiment in fastText combined with the XGBoost algorithm, with an AUC of 99.95%. In deep learning, the best performance was obtained using the CBOW (100+POS+PMI) experiment in Word2Vec combined with the GRU model, with an AUC of 99.83%. The experimental results show that the methods using CP and PMI in this study significantly improved compared to the baseline.

參考文獻


中文文獻
洪智力 (2023). 運用語言機率模型偵測假訊息. 國科會2024研究計畫提案.
曾韵. (2021). 探討網路寫手之偵測方法與研究. 電腦稽核, (43), 25-41.
鄭麗珍, 王毅, & 陳詳翰. (2023). 結合來源與內容之虛假資訊偵測機制. 電子商務學報, 25(1), 63-88.
鄭麗珍, 江彥孟, & 游政憲. (2019). 應用深度學習技術於網路虛假評論偵測. 電子商務學報, 21(2), 229-252.

延伸閱讀