基於機器學習與深度學習技術檢測虛假資訊

虛假資訊的快速傳播已成為現今社交媒體和新聞平台上的一個難題，本研究中運用方法是機器學習中的隨機森林(Random Forest, RF)、單純貝氏分類器(Naive Bayes classifier, NB)、極限梯度提升(eXtreme Gradient Boosting, XGBoost)方法，深度學習裡的遞迴神經網路(Recurrent Neural Network, RNN)、長短期記憶網路(Long Short-Term Memory, LSTM)、門控循環單元(Gated recurrent units, GRU)方法來進行分類文本。本研究嘗試透過兩階段的方式檢測虛假資訊，第一階段利用條件機率(Conditional Probability, CP)、點間交互資訊(Pointwise Mutual Information, PMI)的方式，收集其有利的特徵、文字特徵和真假資訊間潛在關係進而形成知識庫，於第二階段將知識庫結合原始資料集所統計出的文章特徵、詞性分析等特徵一同放入模型訓練，最後使用5折交叉驗證(5-fold Cross-Validation)將模型一般化，並利用準確率(Accuracy)、精確率(Precision)、召回率(Recall)和AUC(Area Under Curve)方法進行模型評估。在YelpCHI資料集的結果中，以實驗Word2Vec中的Skip-gram(All Feature)項目，結合機器學習XGBoost模型來說最佳，AUC為92.50%；深度學習中Word2Vec中的CBOW(All Feature)項目中結合GRU模型最佳，AUC為92.02%。在Kdnuggets real or fake news資料集的結果中，以實驗fastText中的Skip-gram(All Feature)項目，結合機器學習XGBoost演算法來說最佳，AUC為99.95%；深度學習中Word2Vec中的CBOW(100+POS+PMI)項目中結合GRU模型最佳，AUC為99.83%。最後實驗結果與比較基準相比，本研究使用CP與PMI研究方法有顯著提升。

關鍵字

虛假資訊檢測；條件機率；點間交互資訊；機器學習；深度學習

並列摘要

The rapid spread of false information has become a significant problem on social media and news platforms today. In this study, we employ methods such as Random Forest (RF), Naive Bayes classifier (NB), and eXtreme Gradient Boosting (XGBoost) in machine learning, as well as Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU) in deep learning to classify text. Our approach attempts to detect false information through a two-stage process. In the first stage, we use Conditional Probability (CP) and Pointwise Mutual Information (PMI) to collect beneficial features, textual characteristics, and potential relationships between true and false information to form a knowledge base. In the second stage, we combine this knowledge base with statistical features from the original dataset, such as article features and part-of-speech analysis, and input them into model training. Finally, we generalize the model using 5-fold Cross-Validation and evaluate the model using Accuracy, Precision, Recall, and Area Under the Curve (AUC).The results on the YelpCHI dataset show that the best performance was achieved using the Skip-gram (All Feature) experiment in Word2Vec combined with the XGBoost model, with an AUC of 92.50%. In deep learning, the best performance was obtained using the CBOW (All Feature) experiment in Word2Vec combined with the GRU model, with an AUC of 92.02%. On the Kdnuggets real or fake news dataset, the best performance was achieved using the Skip-gram (All Feature) experiment in fastText combined with the XGBoost algorithm, with an AUC of 99.95%. In deep learning, the best performance was obtained using the CBOW (100+POS+PMI) experiment in Word2Vec combined with the GRU model, with an AUC of 99.83%. The experimental results show that the methods using CP and PMI in this study significantly improved compared to the baseline.