結合特徵選取與樣本選取技術於情感分析之研究

近年來資訊科技的快速發展，許多創新性的商品陸續地推出，尤其智慧型手機與平板等較為明顯。手持裝置快速發展與普及衝擊到人與人之間交流方式，加劇人們上網的時間與頻率，新的社群模式因而產生。情感分析於現在資訊化的時代中，可運用的地方實在太多了。藉由判斷文字上的情感狀態來分析人們想表達的心情，以此來掌握網路上人們的情感狀態，加以分析與作為後續的參考手段。近年來情感分析研究是非常熱門的議題之一，許多期刊論文中各自提出不同的方法來提升分類準確度，在許多的方法中，本研究採用特徵選取結合樣本選取的方式來研究。本研究蒐集twitter上所公開的三個資料集，每個資料集的資料量都不同，加入特徵選取挑選特定的字詞作為代表以減少屬性維度，融合樣本選取來挑選出較具代表性的樣本以減少樣本數量。使用10折交叉驗證評估，並利用WEKA的J48、單純貝氏分類器、簡單邏輯思回歸三種預測模型，最後與傳統的方法作比較。本研究的實驗結果顯示若加入特徵選取與樣本選取，情感分類的準確度會顯著高於傳統的方法。影響準確度的因素可歸納為樣本資料量與樣本選取演算法兩種。若使用在大型的資料集並且選用樣本選取的DROP3演算法會呈現較好的結果。

關鍵字

特徵選取；樣本選取；情感分析

並列摘要

In recent years, due to advances of new technological product (e.g., smart phone, table, etc.), the mobile device has become more popular and created a new community model which affected everyone’s life. Sentiment analysis through analyze the text people write on the web then can guess the emotion they represented at that time can be applied to many different fields of research. Since, sentiment analysis is one of popular research topics today and many researchers devote to propose a way to increase classification accuracy. Hence, feature selection and instance selection in sentiment analysis are used in my research. Three datasets which is public and different sizes from twitter are collected in this research. We used feature selection to select attributes as a representative and reduce the dimension. Moreover, we added instance selection to select instance as a representative and reduce the number of instance. At last, we used J48, Naïve Bayes and simple logistic methods to build prediction models, and compare with base-line. The results show that, if we added feature selection and instance selection, the emotion classification accuracy will be better than traditional methods. And we also found that there are two factors which affect the results, one is the size of datasets and the other is instance selection. In addition, the better result we will get in big dataset when we choose DROP3 of instance selection.

並列關鍵字

feature selection ； instance selection ； sentiment analysis

參考文獻

Aggarwal, C. C., & Zhai, C. (2012). Mining text data: Springer.

Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37-66.

Bai, X. (2011). Predicting consumer sentiments from online text. Decision Support Systems, 50(4), 732-742.

Chenlo, J. M., Hogenboom, A., & Losada, D. E. (2014). Rhetorical Structure Theory for polarity estimation: An experimental study. Data & Knowledge Engineering, 94, 135-147.

Cho, H., Kim, S., Lee, J., & Lee, J.-S. (2014). Data-driven integration of multiple sentiment dictionaries for lexicon-based sentiment classification of product reviews. Knowledge-Based Systems, 71, 61-71.

國際替代計量

結合特徵選取與樣本選取技術於情感分析之研究

未授權

主題瀏覽