透過您的圖書館登入
IP:18.218.127.141
  • 學位論文

結合搭配詞與主題概念改善中文口碑分類

Integration of collocation and concepts for improvement of word of mouth classification

指導教授 : 洪智力

摘要


在這大數據時代,人們習慣在網路上表達對於某項產品或服務的使用經驗,由於龐大的資訊量導致使用者要取得符合自己需求,並消化資訊需要耗費相當多的時間,文字探勘中的情感分析能從電子本文中判斷出文章的情感傾向為正向推薦或負向不推薦,為了將非結構化的文章做有效的分析與分類,學者們常使用情感語料庫做為情感分類的依據,但目前來說學術上情感分析的研究大多是針對英文,以現有的情感語料庫為基礎,如SentiWordNet、SenticNet來輔助做特徵擷取與加權計算,然而中文語系目前還沒有附有情感極性及分數的情感語料庫同時上述的語料庫都屬靜態分數,不能因應不同領域及隨時間的演進更動其情感分數。因此本研究以中文領域文章內容為本,針對領域分別建置適應性中文情感語料庫,運用評論網站中用戶對某產品的使用經驗及其產品的評分,做詞彙與詞彙、詞彙與領域、詞彙與用戶評分間的關係來訂定詞彙情感傾向及分數,詞彙間的關聯計算以每個句子為單位找出特徵詞與意見詞,本研究運用關聯規則及交互資訊量來萃取特徵詞與其配對的意見詞兩者組合視為一搭配詞,詞彙與領域、詞彙與用戶評分間的關係計算則使用文章機率、相關係數及TF-ICF (Term Frequency -Inverse Class Frequency)來訂定詞彙極性分數。實驗結果顯示,詞彙的使用和分布會根據領域的不同變化,因此適合的情感標註方法也不同。在這項研究中,我們已經開發出多種不同的情緒標記技術,這些方法能夠適應來自不同領域的口碑文章。

並列摘要


Due to the fast development of big data, users post their experiences and opinions about brands, products, services, and companies on the Internet. Thus a great amount of information is produced, and the method to process and analyze that amount of information becomes an immense issue. In the field of text mining technique, sentiment analysis can solve and determine electronic text articles sentiment, which is positive or negative. In order to make effective sentiment classification for unstructured data, scholars often use sentiment corpora, which define fixed sentiment score for each word in the corpora in the tasks of sentiment classification. Most scholars only apply the corpora to the English language (e.g. SentiWordNet, SenticNet). However, there are two main problems that need to be solved. The first problem is the lack of sentiment polarity and scores, which are defined by the Chinese language Sentiment Corpora. The second problem is that the Sentiment score of each word defined in the sentiment corpora is fixed, which does not adapt to different domains and changes over time. In this study, we propose a way to build an adaptability Chinese sentiment corpus, which is based on the Chinese word of mouth documents. Using product review websites, which contain user experiences of products and product reviews, we define the sentimental tendencies and sentiment scores from analyzing the relationships of words, words and domains, and word and users ratings. Then they are calculated by the correlations between words to identify feature words and opinion words of each sentence. In this paper, we use association rules and mutual information to extract the feature words and their associated opinion words, namely collocation words. Three approaches, i.e. article probability, correction coefficient and term frequency-inverse class (TF-ICF) are used to extract word sentiment scores. Experimental results show that the usage and distribution of words are varied from different domains and thus their suitable sentiment tagging approaches are different. In this research, we have developed several different sentiment tagging techniques and these approaches are able to adapt to word of mouth documents from various domains.

參考文獻


羅佳玲(2009)。同步式關鍵字萃取方法應用於美妝評論。元智大學資訊管理學系學位論文,1-45。
簡之文(2012)。部落格文章情感分析之研究。淡江大學資訊管理學系學位論文,1-59。
戚玉樑與蔡明宏(2007)。以文件為對象的概念萃取程序建立知識本體的雛型架構。資訊管理學報,14(3),47-66。
李政儒、游基鑫、陳信希(2012)。廣義知網詞彙意見極性的預測。中文計算語言學期刊,17(2),21-36。
顏國偉與譚慧敏(1999)。基於知網的常識知識標注。中文計算語言學期刊,4(2),39-85。

延伸閱讀