透過您的圖書館登入
IP:3.14.251.243
  • 學位論文

幽默語料庫之建置

The Construction of Humor Corpus

指導教授 : 曾元顯
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


幽默為調劑生活的重要元素之一,隨著高壓狀態日益劇增,對於幽默的需求也逐漸提升,為尋求幽默內容的最大價值,本研究建構具一定規模、符合臺灣國情,並以正體中文為主的幽默語料庫,其主要目的為:(1)探討幽默語料庫的意義及價值;(2)研擬適合幽默語料庫的詮釋資料格式及語料量;(3)分析幽默語料庫建置流程,並加以典藏;(4)蒐集的語料作分類,並解決分類不一致問題;(5)探究幽默語料庫的擴展性及應用面向。 本研究詳述如下:首先歸納幽默語料庫所內含的相關理論與背景,包括「幽默」、「語料庫」及兩者的結合;再者蒐集多個來源的語料內容,且擬定合適的語料欄位與架構,其中會利用內容分析法及系統發展研究法,而語料處理作業(包括清理重複笑話、標註編目、主題一致性等等),則會以人工作業為主,程式作業為輔;最後依據初步的幽默語料庫統計各面向數量,分析其應用及未來研究展望,並設計預期加值欄位,如誘發笑話原因、負例、人物及幽默程度評分機制,加上語料擴充、語料檢索系統開發等,以促進聊天機器人或幽默生成辨識技術。 最終幽默語料庫語料量達3,691則笑話(截至2019年1月),為一個專門語料庫也同為監控語料庫,同時具備歷時性與共時性,擁有完整的建置流程,語料不限語種,但以正體中文為主,屬適用於臺灣國情的「幽默語料庫」,並符合幽默的五大特性,包含主觀性、地域性、文化性、時事性以及語言差異等。

並列摘要


Humor is one of the important elements of life. As pressure increases, the demand for humor is gradually increasing. In order to seek the greatest value of humorous content, this research constructs a humor corpus with a certain scale, in line with Taiwan's national conditions, and mainly in Traditional Chinese. The main purposes are: (1) to discuss the meaning and value of The Humor Corpus; (2) to develop the format of the metadata and the amount of corpus suitable for The Humor Corpus; (3) to analyze the process of building a humor corpus and archives of The Humor Corpus; (4) to classify the corpus and solve the problem of classification inconsistency; (5) to explore the extensibility and application orientation of The Humor Corpus. The research is detailed below: first, summarize all the relevant theories and backgrounds of The Humor Corpus, including "humor", "corpus" and a combination of the two; second, collect corpus content from multiple sources, and develop appropriate corpus fields and structures, which use content analysis and systems development in information systems research. The corpus processing tasks include cleaning up repeated jokes, labeling catalogs, topic consistency, etc., which will be based on manual work, and the program is assisted; finally, based on the preliminary humor corpus statistics, analyze the application and future research prospects, and design the expected value-added fields such as the causes of jokes, negative examples, characters and humor level scoring mechanism, plus corpus expansion, corpus retrieval system development, etc., to promote chatbot or humor identification or humor generation technology. In the end, The Humor Corpus content reached 3,691 jokes (as of January 2019). It is a specialized corpus and a monitor corpus with both diachronic and synchronic, with a complete construction process. The corpus is not limited to language, but it is mainly in Traditional Chinese. It is a " Humor Corpus " suitable for Taiwan's national conditions and conforms to the five characteristics of humor, including subjectivity, regional, cultural, topicality and language differences.

並列關鍵字

Humor Corpus construction of corpus corpus

參考文獻


游美惠(2000)。內容分析、文本分析與論述分析在社會研究的運用。調查研究,8,5-42。
FunHacks(2016)。Python 正則表達式 re 模塊。取自http://funhacks.net/2016/12/27/regular_expression/
JeanCheng(2015)。Python正则表達式匹配中文。取自https://blog.csdn.net/gatieme/article/details/43235791
Tommy Huang(2018)。機器學習應用-「垃圾訊息偵測」與「TF-IDF介紹」(含範例程式)。取自https://medium.com/@chih.sheng.huang821/機器學習應用-垃圾訊息偵測-與-tf-idf介紹-含範例程式-2cddc7f7b2c5
Vincent(2009)。使用MS Office OneNote辦識圖片中的文字。取自http://isvincent.pixnet.net/blog/post/30094176-%E4%BD%BF%E7%94%A8ms-office-onenote%E8%BE%A6%E8%AD%98%E5%9C%96%E7%89%87%E4%B8%AD%E7%9A%84%E6%96%87%E5%AD%97。

延伸閱讀


國際替代計量