透過您的圖書館登入
IP:3.136.154.103
  • 學位論文

實證研究:使用詞向量做文本分類

Word Embedding for Text Classification: An Empirical Study

指導教授 : 柯士文 鍾斌賢

摘要


文本分類( Text Classification,TC )在自然語言處理( Nature language processing, NLP )領域中一直是相當受重視的研究議題,為了能夠從大量的文本資料中挖掘出有價值的訊息,研究學者們將文字轉換成數學上的數據形態,透過數學的表示也改變了以往對於文本的處理與分析方式。隨著網路資訊的發展,文本數據的來源也更容易從社群網站、電子郵件、報章雜誌之中取得,而在本論文中使用了四種不同的資料集進行探討,分別為20NewgGroups、R52、Amazon Fine Food Reviews以及SMS Spam Collection資料集,並且運用機器學習( Machine Learning,ML )的方式搭配不同的向量表示法來進行數據的分析與預測。 在本研究中文本的向量表示法會採用Bag-of-Word ( BoW )與Word2Vec兩種方式進行比較,其中也想了解Word2Vec在文本分類領域中的合適性。首先在第一個實驗中會先針對Word2Vec訓練模型的演算法來分析和討論,並且會以支持向量機( Support Vector Machine,SVM )、類神經網路( Neural Network,NN )兩種分類器進行文本的分類;第二個實驗中即是利用Bag-of-Word表示法一同和Word2Vec進行一系列的比較與觀察,除此之外也加入了深度學習( Deep Learning )的演算法—長短期記憶( Long Short-Term Memory,LSTM )來加以探討兩者的差異性。 結果表示,Bag-of-Word在本實驗中的預測準確率平均來說比Word2Vec來的優秀,在過程中也觀察到Word2Vec的詞向量表示會受到訓練語料庫( corpus )質量與數量的影響。然而,Word2Vec並非完全無法的應用於文本分類,因為20NewsGroups和R52資料集在深度學習架構之下也發現Bag-of-Word相較於Word2Vec表示法的效果還差,因此在未來希望能夠對於Word2Vec的詞向量表示法有更深入的探討,並且搭配完善的深度學習架構以提升文本分類的預測表現。

並列摘要


Text Classification (TC) has always been a highly regarded research topic in the field of Natural Language Processing (NLP). In order to be able to extract valuable information from a large amount of text data, research scholars convert text into mathematical data forms. Digitization has changed the way we process and analyze information. With the development of online in-formation, the source of text data is also easier to obtain from community websites, emails, newspapers and magazines. In this paper, four different data sets are used for discussion: 20NewgGroups, R52, Amazon Fine Food Reviews, and SMS Spam Collection data sets and use Machine Learning (ML) with different word embedding models for data analysis and pre-diction. In our study, the vector representation of the text is compared using bag-of-word and Word2Vec. Further, we also want to understand the suitability of Word2Vec in text classification. In the first experiment, we will analyze and discuss the algorithm of Word2Vec training model. Then we used two classifiers support vector machine(SVM) and artificial neural network(ANN) to classify the data sets. In the second experiment, we used bag-of-word and Word2Vec for a series of comparisons and observations. In addition, the Deep Learning algorithm, Long Short-Term Memory (LSTM) was added to evaluate and observe the difference of them. The results show that the prediction accuracy of bag-of-word in this experiment is better than Word2Vec on average. In the process, it is also observed that the word representation of Word2Vec will be affected by the quality and quantity of the training corpus. However, Word2Vec is not completely unsuitable for text classification, because 20NewsGroups and R52 under the deep learning architecture, bag-of-word is less effective than Word2Vec. Therefore, in the future, we hope to have a more in-depth discussion on word vector of Word2Vec, and with a sophisticated deep learning architecture to improve the predictive performance of text classifi-cation.

參考文獻


Cortes, Corinna, and Vladimir Vapnik. 1995. “Support-Vector Networks.” Machine Learning 20 (3): 273–97. https://doi.org/10.1007/BF00994018.
Drucker, H., Donghui Wu, and V. N. Vapnik. 1999. “Support Vector Machines for Spam Cate-gorization.” IEEE Transactions on Neural Networks 10 (5): 1048–54. https://doi.org/10.1109/72.788645.
Enríquez, Fernando, José A. Troyano, and Tomás López-Solaz. 2016. “An Approach to the Use of Word Embeddings in an Opinion Classification Task.” Expert Systems with Ap-plications 66 (December): 1–6. https://doi.org/10.1016/j.eswa.2016.09.005.
Fragoudis, Dimitris, Dimitris Meretakis, and Spiros Likothanassis. 2002. “Integrating Feature and Instance Selection for Text Classification.” In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 501–506. KDD ’02. New York, NY, USA: ACM. https://doi.org/10.1145/775047.775120.
Fürnkranz, Johannes. 1998. A Study Using N-Gram Features for Text Categorization.

延伸閱讀