透過您的圖書館登入
IP:3.145.105.105
  • 學位論文

意在言外?微文本中情緒、合法性與反諷之辨識與分析

Beyond Literal Meanings: Recognition and Analysis of Emotions, Legality and Irony in Microtexts

指導教授 : 陳信希
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


傳統自然語言處理所偵測和辨識的目標,多是可與表面形式直接連結的元素,然而語言中亦存在著非字面或非詞彙的層面,這些現象並無法直接由字面解析的方式來了解。而在微文本(microtext)中,由於文字篇幅受到限制,因此這些方面的分析變得更加困難。不論在情感分析、意見探勘、問答系統、或對話系統等應用中,這些問題皆可能而造成障礙。在本研究中,我們探討線上溝通中三個超越字面層面的現象:情緒、合法性、反諷,並以微網誌平台上的訊息和政府公布的短語為語料來進行偵測演算法研究和相關語言分析。 在微網誌的情緒研究方面,我們採用微網誌作者常用的圖形化表情符號為情緒標記,來建立正面情緒和反面情緒微網誌資料集。情緒的偵測採用分類演算法進行,除了文字特徵之外,還加入社交關係、使用者行為、相關度等因素作為特徵。研究發現,若適當搭配文字特徵和特定非文字特徵,可達到最佳偵測結果。此外,我們也探討貼文者與回應者之間的情緒轉換,並就詞頻、語意、情感等方面來比較微網誌文字內容和線上長篇文章間的差異。 隨著近來線上行銷活動持續成長,大量與行銷相關的微文本內容亦在線上產生。這些文本可能包含不應讓使用者列入參考的不當資訊,但這種誤導性質經常無法簡單地從訊息本身看出。不論對於網站讀者、廣告主、廣告服務商和政府管理單位來說,如何辨識這些不當行銷資訊,都已成為一項重要的課題。本研究以政府公布的違規廣告敍述和購物網站商品描述作為非法與合法廣告資料集,透過單一分類和二元分類演算法進行合法性預測,並以單詞組、同義詞典、政府規範內容和相對頻率比率對數等特徵進行實驗。結果發現結合單詞組和相對頻率比率對數作為特徵,可得到最佳的結果。相對頻率比率對數也用於對非法廣告資料集進行動詞組的探勘,這些動詞組皆由動詞與受詞組成,所形成的非法廣告用詞表可讓廣告主和政府單位作為辨識廣告合法性的參考依據。此外,本研究也實作一套不當線上廣告辨識系統,希望能為相關機構和使用者提供自動辨識機制,以節省人力並減少此類不當行銷活動所帶來的危害。 反諷是一種少見但具有強烈效果的表達方式。英語的verbal irony一詞可指字面語意和真實語意相反或有程度差異的表達方式。本研究將焦點集中在以正面字面語意來表達負面實際語意的短語,並以微網誌作為語料,進行以下研究:(1) 中文反諷語料庫的建構 (2) 反諷語言結構的探討 (3) 反諷線索的歸納 (4) 反諷成分的辨識。 為了儘可能找到足夠的反諷文字型式,本研究以表情符號作為情緒極性標記,並以NTUSD意見詞典和微網誌正負面情緒詞典作為情緒判斷依據,透過反復式自助法(bootstrapping)來尋找反諷訊息,也就是先觀察特定反諷文字形式,接著以此形式為基礎透過半自動方法找出微網誌語料中的反諷訊息,再以其中新發現的反諷文字形式重複進行原步驟,直到無法發現新形式為止。以此方法,我們成功地建立了第一個中文反諷語料庫。在反諷訊息辨識上,則採用條件隨機域(CRF)作為演算法,並以中文詞以及其詞類標記作為特徵來進行。此辨識方法可以減少前述方法中人工介入的程度。在對反諷結構進行分析後,我們認為有三項成分構成反諷文字:(1) 反語 (2) 情境資訊 (3) 修辭成分。這些成分亦明確標示在我們的反諷語料庫中。

並列摘要


The non-literal or non-lexical aspects of communication cannot be interpreted directly and literally. The identification and analysis of real intent beyond literal meanings is a challenging task in natural language processing, especially when working on microtexts such as microblogs that are limited to 140 characters. The recognition and analysis of these components are crucial for many applications including sentiment analysis, opinion mining, question answering and chatterbots. In this study, emotion recognition, online advertising legality identification and verbal irony analysis are examined. In the emotion recognition experiments, the generation of user emotions on a microblogging platform is modeled from both writers’ and readers’ perspectives. Graphic emoticons, which are commonly used to express users’ emotions, serve as emotion labels so that microtext emotion datasets can be constructed. To build classifiers for the emotion identification task, support vector machine (SVM)-based algorithms are adopted. In addition to textual features, non-verbal factors, including social relation, user behavior and relevance degree, are also used as features. The experimental results show that the combination of textual, social and behavioral features can be used to achieve the best emotion-prediction performance. The emotional transitions from the poster to the responder in a conversation are also analyzed and predicted in this study. As online advertising continues to grow, Internet users, advertisers, online advertising platforms and the authorities all have the need to avoid or prevent the issues that false and/or misleading advertisements can potentially cause. Many of these false advertising messages are present in short texts, and their appropriateness cannot be easily interpreted. This problem is addressed by building one-class and two-class classifiers with datasets consisting of short illegal advertising statements published by the government and product descriptions from an online shopping website. The results show that the models using the log relative frequency ratio (logRF) combined with unigrams as features achieve the best performance. The logRF values are also used to mine verb phrases that are typically used in illegal advertisements. These verb phrases can be used as a reference for both the advertisers and the authorities. A web-based false advertisement recognition system was also built in this study using the techniques applied to the above experiments in order to reduce human effort in filtering false advertising messages and help protect Internet users from misleading advertising. In verbal irony, the literal meaning of an utterance can be the opposite of what is actually meant. For simplification, this study focuses on ironic expressions in which negative actual meanings are represented by positive words. Ironic messages in microblogs are infrequent and cannot be identified by simply examining the literal meanings of the words. To construct a Chinese irony corpus, ironic messages are collected from microblogs based on emoticon use, linguistic forms and sentiment polarity through a bootstrapping approach. Five types of irony patterns are found in the collected ironic messages. The structure of ironic expressions is also analyzed, and three types of elements are found to form an ironic expression. A conditional random field (CRF)-based approach is used to automatically identify irony elements and ironic messages and reduce the human effort in the bootstrapping approach of irony pattern discovery.

參考文獻


Aman, S. and Szpakowicz, S. 2007. Identifying Expressions of Emotion in Text. In Proceedings of 10th International Conference on Text, Speech and Dialogue. Lecture Notes in Computer Science 4629, pp. 196-205.
Chang, C.C. and Lin C.J. 2001. LIBSVM: a Library for Support Vector Machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Chen, H.H., Lin, C.C. and Lin, W.C. 2002. Building a Chinese-English WordNet for Translingual Applications. ACM Transactions on Asian Language Information Processing, 1(2): 103-122.
Colston, H.L. and O'Brien, J. 2000. Contrast of Kind Versus Contrast of Magnitude: the Pragmatic Accomplishments of Irony and Hyperbole. Discourse and Processes, 30(3):179-199.
Damerau, Fred J. 1993. Generating and Evaluating Domain-Oriented Multi-Word Terms from Text. Information Processing and Management, 29:433-477.

被引用紀錄


蔡易辰(2016)。三元決策理論應用於國道計程收費議題之情感分析研究〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2016.00892

延伸閱讀