在虛擬社群的快速發展下,部落客(Bloggers) 藉由網際網路快速傳遞對於任何產品或服務的評論,而一些負面評論往往會降低消費者的購買意願,並帶給企業嚴重的傷害。然而,這些非結構化的文本資訊,卻難以在短時間內理解。在某些情況下,負面評論通常少於正面評論,這些較少的負面評論傳播快速且具有極大殺傷力。在語意分類(Sentiment Classification)的領域中,學者往往未考慮到這種非等量的類別不平衡問題(Class Imbalance Problems)。而類別不平衡問題是指分類器對多數類別有高分類準確率,但對少數類別則會產生過高的分類錯誤率。因此,從大量的網路評論中偵測出少數的負面評論,已成為重要的議題。本研究目的是確定影響不平衡語意分類準確率的關鍵因子,使用田口方法(Taguchi Methods)以找出重要的影響因子。最後,根據所找出的關鍵因子提出「平衡類別特徵(Balanced Category Features, BCF)」方法,並結合「隱含語意索引(Latent Semantic Index, LSI)」,以改善不平衡語意分類效能。此外,本研究透過支持向量機(Support Vector Machines, SVM)、決策樹(Decision Tree, DT)來建構分類模型,並使用真實的網路評論進行實驗,以證實所提出方法的有效性。
The fast development of virtual Social network, the bloggers send their comments to the products or services quickly via the internet. Some negative comments could reduce consumers’ purchase intentions and bring a great damage to enterprises. However, the text information in blogs are often unstructured and hard to comprehend in short time. In some cases, the negative comments are usually fewer than the positive opinions. These fewer negative comments spread very fast and are much harmful. In the Sentiment classification study, the researchers always didn’t consider the class imbalance problem. A classifier induced from an imbalanced data set has high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, to identify consumers’ negative sentiments effectively from a large number of online comments had become one of serious issues. So, this study aims to identify the key factors of imbalanced sentiment classification by using Taguchi method. Then, according to the discovered key factors, we proposed a new method “Balanced Category Features (BCF)” and combine the “Latent Semantic Index (LSI)” to improve the performance of the imbalanced sentiment classification. Moreover, Support Vector Machines (SVM) and Decision Tree (DT) have been employed to construct classifiers. Finally, one case study from real world blogs will be provided to illustrate the effectiveness of our proposed approach.