隨著資訊蓬勃發展,越來越多機器學習的應用,使用大量的資料去訓練模型,在醫學領域發現類別不平衡的問題,不平衡的結構式資料造成訓練模型無法正確分類,並透過採樣技術去解決此問題,本研究使用採樣技術對不平衡的非結構式資料探討改善成效,運用意見探勘技術於電影評論資料集,做出不同比例的不平衡資料集,使用TF-IDF 向量模型與過採樣技術 SMOTE、Borderline SMOTE、ADASYN 以及欠採樣技術 RandomUnderSampler、ClusterCentroids、NearMiss 來改善資料不平衡情形。機器學習的分類演算法使用 SVM 來訓練模型,並透過 AUC、精準率、召回率以及 f1-scroe 等驗證指標來探討改善成效,透過實驗結果發現六種採樣技術對不同比例的不平衡資料集都有改善,其中以 Borderline SMOTE 效果最好,AUC 最高達到 0.97,SMOTE 以及ADASYN 有達到 0.95,另外三種欠採樣技術則落在 0.73~0.90,雖然有改善不平衡,但訓練出來的模型沒有比平衡資料集訓練出來的模型還要好,對意見探勘領域,此研究發現可以減少蒐集少數類別樣本的成本,未來可以運用在其他產業的非結構式資料集。
With the rapid development of information, more and more applications of machine learning use a large amount of data to train models. In the medical field, the problem of class imbalance is found. The imbalanced structural data causes the training model to fail to classify correctly. This problem is solved by sampling technology. This study uses the sampling technology to explore the improvement effect of unbalanced unstructured data and applies the opinion mining technology to the film review data set to make imbalanced data sets of different proportions. Using TF-IDF vector model and oversampling techniques SMOTE, Borderline SMOTE, ADASYN and undersamplingSampling techniques RandomUnderSampler, ClusterCentroids, NearMiss to improve data imbalance. The classification algorithm of machine learning uses SVM to train the model, and explores the improvement effect through validation metrics such as AUC, precision, recall, and f1-scroe. Through the experimental results, it is found that the six sampling techniques have improved the imbalanced data sets of different proportions. Among them, Borderline SMOTE has the best effect, with the highest AUC of 0.97, SMOTE and ADASYN have reached 0.95, and the other three undersampling techniques are 0.73~ 0.90, although the imbalance is improved, the trained model is no better than the model trained from the balanced dataset. For the field of opinion mining, this study found that the cost of collecting samples of a few categories can be reduced, and it can be used in other industries in the future.