透過您的圖書館登入
IP:18.224.5.46
  • 學位論文

文字情感分析: 利用病徵分析病患自撰之日誌

Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms

指導教授 : 柯士文

摘要


最近情緒分析(SA)越來越受歡迎。大多數以前的工作通過機器學習技術研究產品評論以預測情緒極性。他們專注於如何構建統計語言模型或從文本中提取語義特徵的模式。在本研究中我們將SA技術應用於在線醫療社區的病人自撰的文本。我們的數據集是來自知名醫療網站patientlikeme.com(PLM)的患者自撰文本(PAT)。患者可以在PLM上分享心情小語,症狀的嚴重程度,治療方式和生活質量。PAT更像一本反映病人本人的日記,而因PAT更個人化。 PLM數據集特有的另一個特點是症狀和疾病的討論。所以我們將討論情緒極性和症狀的關係。 許多研究使用bag of word來表示文本特徵,但是一些研究表明,bad of word將會失去意義的一部分。在我們的研究中,我們嘗試探索使用“詞向量”來表示文件的可能性。 Word2Vec是一個最想表達概念的工具,不僅訓練向量,而且還能找到相似的詞,而且具有多個層次的意義。在第一個實驗中,我們使用Word2Vec來生成單詞向量,並且使用六種不同的方法然後我們使用兩個分類器支持向量機(SVM)和cosine similarity的k-nearest neighbors (k-NN)來分類PAT的情緒極性。在第二個實驗中,我們準備了兩個語料庫將討論高質量或體積是否更有助於分類。在第三個實驗中我們觀察到“參考症狀的PAT狀對過去研究的分類有很大的影響。我們的觀察結果顯示負極性和參考症狀高度相關。因此,我們將使用構建另一個培訓模型,並根據這一觀察來評估結果。 結果表明,非正規化方法在識別正極性方面是最好的,情感方法在識別負極性方面是最好的。我們還發現,正規化方法比非正規化方法產生更差的分類結果。在第二個實驗中,我們使用了兩種不同類型的分類器。所有結果表明醫學corpora訓練的Word2Vec模型產生比Wikipedia語料庫更好的分類性能。

並列摘要


Recently, Sentiment analysis (SA) is gaining popularity. Most previous work studied product reviews with machine learning techniques to predict the sentiment polarity. They focused on how to build the patterns like statistical language models or to extract semantic features from texts. In this paper, we apply SA techniques to patient-authored text on online medical communities. Our datasets are patient-authored text (PAT) from a well-known medical website, patientslikeme.com (PLM). Patients can share mood phrases, severity of symptoms, treatment, and quality of life on PLM. PAT is more like a diary or journal reflecting on the patients themselves. There is another special point unique to the PLM datasets that is discussion of symptoms and diseases. So we will discuss the relationship of sentiment polarity and symptoms. Many studies used bag-of-word to represent document features but some studies showed that bag-of-word will lose the word a part of meaning. In our study, we attempted to explore the possibility of using “word vectors” to represent documents. Word2Vec is a tool which most want to express the concept is training the vector not only finding similar words, but also having multiple levels of meaning. In the first experiment, we used Word2Vec to generate word vectors and we used five different methods to generate sentence vector including the most-commonly used average method, no normalization method, the stop word method, and the sentiment method in the SA domain. Then we used two classifiers support vector machine (SVM) and k-nearest neighbors (k-NN) with Cosine Similarity to classify the sentiment polarity of the PATs. Some previous studies claimed that the corpus for training the Word2Vec model is very important, so we also wished to discuss the effect of corpus composition on the classification results. We prepared two corpora for second experiment which will discuss whether high quality or volume is more helpful for classification. We have observed that “PATs with reference to symptoms” have a large effect on classification from past studies. Our observation shows that negative polarity and reference to symptoms are highly correlated. Therefore we are going to use build another training model and evaluate the results based on this observation. The results show that the non-normalization method is the best in identifying positive polarity, the sentiment method is the best in identifying negative polarity. We also found that the normalization method produced worse classification results than the non-normalization method. In the second experiment, we used two different types of classifiers, i.e. SVM and k-NN. All results showed that the Word2Vec model trained on medical corpora yielded better classification performance than the Wikipedia corpus. This outcome indicated that the quality in the training corpus was more important than the volume when training Word2Vec models. In the future, we wish to further explore the usage of explicit and implicit references to symptoms in the PATs.

參考文獻


AKHTAR, M. S., GUPTA, D., EKBAL, A. & BHATTACHARYYA, P. 2017. Feature selection and ensemble construction: A two-step method for aspect based sentiment analysis. Knowledge-Based Systems, 125, 116-135.
BRAVO-MARQUEZ, F., FRANK, E. & PFAHRINGER, B. 2016. Building a Twitter opinion lexicon from automatically-annotated tweets. Knowledge-Based Systems, 108, 65-78.
DU, H., XU, X., CHENG, X., WU, D., LIU, Y. & YU, Z. 2016. Aspect-Specific Sentimental Word Embedding for Sentiment Analysis of Online Reviews. Proceedings of the 25th International Conference Companion on World Wide Web. Montréal, Québec, Canada: International World Wide Web Conferences Steering Committee.
ENR QUEZ, F., TROYANO, J. A. & L PEZ-SOLAZ, T. 2016. An approach to the use of word embeddings in an opinion classification task. Expert Systems with Applications, 66, 1-6.
ESULI, A. & SEBASTIANI, F. Sentiwordnet: A publicly available lexical resource for opinion mining. Proceedings of LREC, 2006. Citeseer, 417-422.

延伸閱讀