透過您的圖書館登入
IP:3.137.198.96
  • 學位論文

基於順序遷移學習開發繁體中文情感分析工具

Developing Sentiment Analysis Toolkit for Traditional Chinese Using Sequential Transfer Learning

指導教授 : 盧信銘

摘要


近年隨著論壇與社群平台的興起,許多人習慣在網路上分享自己對產品服務的看法,這些非結構化的資料中包含對個人或組織來說有價值的訊息,例如消費者能輔助做出購物決定、公司能從中找到改進產品的方向。為了要更快速準確地捕獲其中所蘊含的資訊,關鍵技術正是情感分析。在眾多文獻研究中,大多數著重於改善情感分析技術,較少看到專門研發情感分析工具的研究。我們認為有一套可直接執行情感分析的工具能帶來實質且具體的效益,因此將研究重點聚焦於開發情感分析開源工具。 本研究開發的工具希望能符合實用性與效能兩大目標。本研究透過探索過往情感分析文獻、訂立情感分析架構和調查現有情感分析工具,確立所要開發的工具特性,包含提供句子情感分類、屬性術語提取與屬性情感分類功能,處理繁體中文的分析,並主要基於順序遷移學習中的預訓練搭配微調模式,設計適合本研究的預訓練學習策略和微調模型架構,同時建立消費者評論資料集作為訓練測試數據。 藉由本研究制定的四類型實驗,分別驗證了預訓練策略的有效性、微調配置的合適性、所研發工具的可靠性,以及開發繁體中文工具的有用性,實驗結果證實我們設計的訓練策略與相關配置能勝過開源預訓練模型,並有助於提高模型能力;另外,與其它工具和經典論文方法進行比較,本研究所開發之工具senti_c在兩個資料集上的各項指標表現都優於比較對象,顯示senti_c對於處理情感分析問題能達到一定效能、提供更良好的分析結果;除此之外,透過測試各工具對於處理繁體與簡體中文文本的性能差異,可驗證本研究提供的繁體中文工具確實具有實用價值;最後,我們將經過完善測試的senti_c套件發佈至PyPI (pypi.org),一般大眾皆能自由下載運用。

並列摘要


Large amounts of user comments and reviews on products, services, and events are readily accessible on social media and e-commerce platforms. These text data contain valuable information for individuals or organizations. Sentiment analysis facilitates the analysis of large amounts of unstructured review data, and may benefit consumers and business alike. Previous studies have accumulated large amounts of technical approaches for sentiment analysis. However, to the best of our knowledge, few high-quality open-source sentiment analysis tools are available for Traditional Chinese. To fill this gap, this thesis aims at developing an open-source toolkit for analyzing sentiment in Traditional Chinese text. We conducted an extensive review on the sentiment analysis literature and developed a sentiment analysis framework. A review of existing tools using this framework allows us to establish the main functionality of senti_c, a high-quality open-source sentiment analysis toolkit. The senti_c toolkit is a Python-based library that provides three main functions: sentence-level sentiment classification, aspect terms extraction, and aspect-level sentiment classification. We developed our own training data and adopted the sequential transfer learning approach to develop the machine learning-based prediction module based on the transformer-based deep learning natural language models. We conducted extensive experiments based on different variations of pre-training and fine-tuning strategies. Our experimental results showed that the training strategies we designed delivered models that outperformed current state-of-the-art open-source pre-training models. Moreover, senti_c consistently performed better than other baseline methods and toolkits currently available. While the main training data is in traditional Chinese, senti_c also has good performance for simplified Chinese. The senti_c toolkit is available from PyPI (pypi.org).

參考文獻


Abdi, A., Shamsuddin, S. M., Hasan, S., Piran, J. (2019). Automatic sentiment-oriented summarization of multi-documents using soft computing. Soft Computing, 23(20), 10551-10568.
Abnar, S., Ahmed, R., Mijnheer, M., Zuidema, W. (2017). Experiential, distributional and dependency-based word embeddings have complementary roles in decoding brain activity. arXiv preprint arXiv:1711.09285.
Ahmed, S., Pasquier, M., Qadah, G. (2013). Key issues in conducting sentiment analysis on Arabic social media text. Paper presented at the 2013 9th International Conference on Innovations in Information Technology (IIT).
Al-Moslmi, T., Omar, N., Abdullah, S., Albared, M. (2017). Approaches to cross-domain sentiment analysis: A systematic literature review. IEEE Access, 5, 16173-16192.
Ambartsoumian, A., Popowich, F. (2018). Self-attention: A better building block for sentiment analysis neural network classifiers. arXiv preprint arXiv:1812.07860.

延伸閱讀