透過您的圖書館登入
IP:18.217.158.184
  • 學位論文

運用意見探勘技術於類別不平衡資料集

Using opinion mining techniques for class imbalanced datasets

指導教授 : 洪智力

摘要


隨著資訊蓬勃發展,越來越多機器學習的應用,使用大量的資料去訓練模型,在醫學領域發現類別不平衡的問題,不平衡的結構式資料造成訓練模型無法正確分類,並透過採樣技術去解決此問題,本研究使用採樣技術對不平衡的非結構式資料探討改善成效,運用意見探勘技術於電影評論資料集,做出不同比例的不平衡資料集,使用TF-IDF 向量模型與過採樣技術 SMOTE、Borderline SMOTE、ADASYN 以及欠採樣技術 RandomUnderSampler、ClusterCentroids、NearMiss 來改善資料不平衡情形。機器學習的分類演算法使用 SVM 來訓練模型,並透過 AUC、精準率、召回率以及 f1-scroe 等驗證指標來探討改善成效,透過實驗結果發現六種採樣技術對不同比例的不平衡資料集都有改善,其中以 Borderline SMOTE 效果最好,AUC 最高達到 0.97,SMOTE 以及ADASYN 有達到 0.95,另外三種欠採樣技術則落在 0.73~0.90,雖然有改善不平衡,但訓練出來的模型沒有比平衡資料集訓練出來的模型還要好,對意見探勘領域,此研究發現可以減少蒐集少數類別樣本的成本,未來可以運用在其他產業的非結構式資料集。

關鍵字

意見探勘 類別不平衡 TF-IDF SMOTE SVM AUC

並列摘要


With the rapid development of information, more and more applications of machine learning use a large amount of data to train models. In the medical field, the problem of class imbalance is found. The imbalanced structural data causes the training model to fail to classify correctly. This problem is solved by sampling technology. This study uses the sampling technology to explore the improvement effect of unbalanced unstructured data and applies the opinion mining technology to the film review data set to make imbalanced data sets of different proportions. Using TF-IDF vector model and oversampling techniques SMOTE, Borderline SMOTE, ADASYN and undersamplingSampling techniques RandomUnderSampler, ClusterCentroids, NearMiss to improve data imbalance. The classification algorithm of machine learning uses SVM to train the model, and explores the improvement effect through validation metrics such as AUC, precision, recall, and f1-scroe. Through the experimental results, it is found that the six sampling techniques have improved the imbalanced data sets of different proportions. Among them, Borderline SMOTE has the best effect, with the highest AUC of 0.97, SMOTE and ADASYN have reached 0.95, and the other three undersampling techniques are 0.73~ 0.90, although the imbalance is improved, the trained model is no better than the model trained from the balanced dataset. For the field of opinion mining, this study found that the cost of collecting samples of a few categories can be reduced, and it can be used in other industries in the future.

並列關鍵字

opinion mining class imbalance TF-IDF SMOTE SVM AUC

參考文獻


英文文獻
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002, 1). SMOTE: synthetic
minority over-sampling technique. Journal of Artificial Intelligence Research, 16, pp.
321–357.
Chen, H., & Zimbra, D. (2010, 6 1). AI and Opinion Mining. IEEE Intelligent Systems, pp. 74-

延伸閱讀