透過您的圖書館登入
IP:3.139.81.58
  • 學位論文

發展適應性中文相似詞庫於口碑分類

Developing an Adaptive Chinese Near-Synonym Corpus for Word of Mouth Classification

指導教授 : 洪智力

摘要


口碑文章分類多數會透過字袋法,字袋法所形成的向量空間模型,缺點是高維度的資料雜訊問題。本研究透過字元比對、上下文比對、同音比對、詞庫混合,建立適應性中文相似詞庫。再以詞彙取代的方式,將適應性中文相似詞庫,運用於口碑文章分類。以口碑文章為本的適應性相似詞庫,對於不同時間所發展的新詞,有更好的應對能力。於分類評估階段,適應性相似詞庫比較對象為靜態語料庫的「教育部重編國語辭典修訂本」與「同義詞詞林擴展版」,評估方式為計算求全率(Recall)、求準率(Precision)、F-Measure、準確率(Accuracy)及ROC曲線下面積 (Area Under ROC Curve, AUC)。研究結果顯示本研究之適應性相似詞庫運用於電影、休閒旅遊、美食、美妝四種領域,其分類準確度皆優於靜態語料庫。未來若要使用適應性相似詞庫改善資料雜訊,其中電影優先選擇混合相似詞庫,休閒旅遊、美食、美妝三種領域優先選擇上下文相似詞庫。

並列摘要


The majority of word-of-mouth (WOM) articles are classified using the bag-of-words model. The bag-of-words vector space model bears the disadvantage of producing high-dimensional data noise. The research compared characters, context, and homophones, and integrated thesauruses to establish an adaptable Chinese near-synonym corpus. Subsequently, lexical replacement was applied, and the adaptable Chinese near-synonym corpus was created for classifying WOM articles. The WOM article-based adaptable near-synonym corpus exhibited superior adaptability to new terms developed during different periods of time. Two static corpora, the Ministry of Education’s Revised Mandarin Chinese Dictionary and the Extended Chinese Synonym Forest, were used as the benchmarks of comparison for the adaptable near-synonym corpus in the classification and evaluation stage. Evaluations were conducted by calculating recall, precision, F-measure, accuracy, and area under receiver operating characteristic curves (AUC). The results indicate that the classification accuracy of the adaptable near-synonym corpus proposed in the research exceeds that of static corpora when used in the fields of movie, leisure and travel, food, and cosmetics. To use the adaptable near-synonym corpus in reducing noise, the researcher recommends integrating near-synonym corpora for movie and using corpora with similar contexts for the fields of leisure and travel, gourmet food, and cosmetics.

參考文獻


樓逸軒(2016)。運用詞彙重組方法改善中文斷詞。中原大學資訊管理研究所學位論文,1–81。
凃守謙(2013)。利用ConceptNet於文章分類研究。中原大學資訊管理研究所學位論文,1–53。
簡瑋男(2011)。應用獨立成份分析於同義詞替換之研究。元智大學資訊管理學系學位論文,1–30。
陳品良、楊秉哲、谷圳(2016)。基於大數據的跨境電子商務分析架構-以商品口碑分類系統為例。電工通訊季刊,48–57。
曾元顯(2014)。自動化資訊組織與主題分析近二十年來的研究與發展。教育資料與圖書館學,51,3–26。

延伸閱讀