發展不平衡語意分類之研究

在虛擬社群的快速發展下，部落客(Bloggers) 藉由網際網路快速傳遞對於任何產品或服務的評論，而一些負面評論往往會降低消費者的購買意願，並帶給企業嚴重的傷害。然而，這些非結構化的文本資訊，卻難以在短時間內理解。在某些情況下，負面評論通常少於正面評論，這些較少的負面評論傳播快速且具有極大殺傷力。在語意分類(Sentiment Classification)的領域中，學者往往未考慮到這種非等量的類別不平衡問題(Class Imbalance Problems)。而類別不平衡問題是指分類器對多數類別有高分類準確率，但對少數類別則會產生過高的分類錯誤率。因此，從大量的網路評論中偵測出少數的負面評論，已成為重要的議題。本研究目的是確定影響不平衡語意分類準確率的關鍵因子，使用田口方法(Taguchi Methods)以找出重要的影響因子。最後，根據所找出的關鍵因子提出「平衡類別特徵(Balanced Category Features, BCF)」方法，並結合「隱含語意索引(Latent Semantic Index, LSI)」，以改善不平衡語意分類效能。此外，本研究透過支持向量機(Support Vector Machines, SVM)、決策樹(Decision Tree, DT)來建構分類模型，並使用真實的網路評論進行實驗，以證實所提出方法的有效性。

關鍵字

決策樹；支持向量機；隱含語意索引；特徵選取；田口方法；不平衡語意分類

並列摘要

The fast development of virtual Social network, the bloggers send their comments to the products or services quickly via the internet. Some negative comments could reduce consumers’ purchase intentions and bring a great damage to enterprises. However, the text information in blogs are often unstructured and hard to comprehend in short time. In some cases, the negative comments are usually fewer than the positive opinions. These fewer negative comments spread very fast and are much harmful. In the Sentiment classification study, the researchers always didn’t consider the class imbalance problem. A classifier induced from an imbalanced data set has high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, to identify consumers’ negative sentiments effectively from a large number of online comments had become one of serious issues. So, this study aims to identify the key factors of imbalanced sentiment classification by using Taguchi method. Then, according to the discovered key factors, we proposed a new method “Balanced Category Features (BCF)” and combine the “Latent Semantic Index (LSI)” to improve the performance of the imbalanced sentiment classification. Moreover, Support Vector Machines (SVM) and Decision Tree (DT) have been employed to construct classifiers. Finally, one case study from real world blogs will be provided to illustrate the effectiveness of our proposed approach.

並列關鍵字

Decision Tree ； Support Vector Machines ； Latent Semantic Index ； Feature Selection ； Taguchi Method ； Imbalanced Semantic Classification

參考文獻

[1] 曾韋榮 (2006），結合潛在語意檢索及資訊粒化於資料探勘，碩士論文，國立臺北科技大學商業自動化與管理研究所，臺北。

[1] Abbasi, A., and Chen, H. (2005), “Applying authorship analysis to extremist-group web forum messages,” IEEE Intelligent Systems, vol. 20, no. 5, pp. 67–75.

[3] Arun Kumar, M., and Gopal, M. (2010), “A comparison study on multiple binary-class SVM methods for unilabel text categorization,” Pattern Recognition Letters, vol. 31, no. 11, pp. 1437-1444.

[5] Chaovalit, P., and Zhou, L. (2005), “Movie review mining: a comparison between supervised and unsupervised classification approaches,” In Proceedings of the 38th Hawaii International Conference on System Sciences, pp.1-9.

[6] Chen, E., Lin, Y., Xiong, H., Luo, Q., and Ma, H. (2010), “Exploiting probabilistic topic models to improve text categorization under class imbalance,” Information Processing and Management, vol. 47, no. 2, pp. 202-214.

被引用紀錄

林瑞裕（2013）。智慧型手機操弄評論辨識之研究〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-2712201314042747

薛仱芸（2014）。改善網路操弄評論分類績效之研究〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-0905201416542666

國際替代計量

發展不平衡語意分類之研究

主題瀏覽