透過您的圖書館登入
IP:3.133.141.6
  • 學位論文

批踢踢語料庫之建置與應用

PTT Corpus: Construction and Applications

指導教授 : 謝舒凱
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


近年來,語料庫為本與語料庫驅動之研究愈來愈受到關注與重視。 在台灣華語中,中央研究院平衡語料庫 (Chen et al., 1996) 以及中文十 億詞語料庫 (Huang et al., 2005) 為當今兩個最被廣泛使用的語料庫。然 而,這些語料庫並不是完全沒有限制。在語料的部份,這些語料庫大 多已經停止更新或尚未更新,也就是說,這些資料庫已經無法完全即 時反應當代台灣華語的使用狀況。對於眾多研究者來說,在蒐集新興 語料上更產生了一定的程度的難度與不便性。正因如此,本篇論文以 PTT(批踢踢)作為資料來源,試圖建立「批踢踢語料庫」— 一個具 有自動蒐集、更新、分析及後處理的動態語料庫。除此之外,該語料 庫亦會提供一個友善且便利的網路平台,提供研究者作使用。在批踢 踢語料庫中,語料的斷詞是透過 Jseg — 一個利用中央研究院平衡語 料為訓練基礎之中文斷詞器 — 所達成。而在詞性標註方面,則是採 用 Brill Tagger (Brill, 1992) 所使用的演算法,且以中文句結構樹資料庫 (Chen et al., 1999) 中約莫一萬中文句作為訓練的語料。批踢踢語料庫提 供了網路介面以供研究者使用,並包含許多根據批踢踢語料所發展出 來的應用,其中包括基本的詞語索引器 (Concordancer) 以及搭配詞抽 取器 (Collocation extractor),以及其他諸如表情符號偵測器 (Emoticon detector) 與情緒極性分類器 (Sentiment polarity classifier) 等等之應用。 最後,本研究之希望在批踢踢語料庫的建置後,在現代台灣華語中能 夠針對新興語料的部分作補充與更新,並且提供實質的語料庫工具, 以簡化資料蒐集上的繁瑣及能有系統地分析語料,使得研究者能更加 專注在語料本身的分析與發展。

並列摘要


In recent years, corpus-based and corpus-driven studies are getting considerable attentions. In Taiwan Mandarin, two of the most widely used corpora are Academia Sinica Balanced Corpus (Chen et al., 1996) and Chinese Gigawords (Huang et al., 2005). However, both of the corpora have some limitations on the source of the data, and they have not updated for some time, which makes it difficult to collect more recent examples of language uses. Therefore, the aim of this thesis attempts to establish a dynamic corpus, PTT Corpus, which can automatically collect, update and process data from PTT (批踢 踢), and provide the applications with a user-friendly interface for researchers. Corpora are segmented with Jseg, a Chinese segmentator trained with data from Sinica Corpus, and part-of-speech (POS) tagged by Brill Tagger (Brill, 1992), a POS tagger trained with data trained on the 9999 sentences in the Sinica Treebank (Chen et al., 1999). PTT Corpus provides a web interface with several applications, including Concordancer, Collocation extractor, Emoticon Detector, etc. To conclude, establishing PTT Corpus may be of importance in enriching the source of modern corpora, providing useful corpus tools, simplifying the analysis of recent language uses and changes in linguists in Taiwan Mandarin.

並列關鍵字

PTT dynamic corpus Taiwan Mandarin

參考文獻


4. Christopher D Manning and Hinrich Schutze.Foundations of statistical natural language processing.MIT press, 1999.
5. Adam Kilgarriff and Gregory Grefenstette.Introduction to the special issue on the web as corpus.Computational linguistics, 29(3):333-347, 2003.
7. Tsun-Jui Liu, Shu-Kai Hsieh, and Laurent Prevot.Observing features of ptt neologisms: A corpus-driven study withn-gram model.In ROCLING. Association for Computational Linguistics andChinese Language Processing (ACLCLP), Taiwan, 2013.
10. Hsi-Yao Su.The multilingual and multi-orthographic taiwan-based internet:Creative uses of writing systems on college-affiliated bbss.Journal of Computer-Mediated Communication, 9(1):0-0, 2003.
11. Antoinette Renouf.Webcorp: providing a renewable data source for corpus linguists.Language and Computers, 48(1):39-58, 2003.

延伸閱讀