透過您的圖書館登入
IP:3.237.65.102
  • 學位論文

微型語料庫的自動處理:賽夏語詞性標記、部份剖析及其應用

Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications

指導教授 : 宋麗梅

摘要


本論文旨在研究二萬詞以下的微型語料庫的詞性標記及部份剖析技術,並提出三項應用。   台大南島語語料庫是基於語調(intonation unit)的語料庫,其中賽夏語約有一萬二千詞。本文第一章介紹了當前處理南島語語料庫的難點,特別是因為規模太小,不能使用統計式自然語言處理,所以必須尋求其他方案。第二章介紹了新設計的標記集,以切實反應賽夏語的語言特點,並實際使用在詞性標記上,其中,詞彙法從田野調查記錄中抽取語法信息,得到約75%的正確率,再利用基於轉換的錯誤驅動學習(TBL)算則,進一步將正確率提升至85%。本章特別討論了賽夏語的主格及受格格標記(ka)難以區別的問題。   論文第三章介紹了賽夏語的二位部份剖析,部份剖析可以為抽取名詞詞組和一些其他應用創造條件。我們嘗試了基於Kullback-Leibler分歧值的最短路徑法和TBL法,前者在小句長度加長時,正確率就會快速下降,而且需要大量的計算時間,而後者約達70%的正確率,符合我們設定的需求。   第四章把標記過的語料庫同語言學研究、說本族語者及一般群眾連繫起來。機器幫助標註作業,讓語言學家較快速、較正確地處理採集到的語料;考慮到人民群眾和語言學家的不同需求,我們設計了在線多媒體語料庫的整合平台,並針對標準化、易及性、互換性三個特點,調整了細項設計。   最後,本論文嘗試從前、後期的維特根斯坦哲學的角度,討論自然語言處理的哲學意義。我們強調詞在語言中的使用和詞義的關聯性,並認為計算機不能突破語料庫中文本構成的微型宇宙的界限。

並列摘要


This thesis demonstrates an effective method to tag and parse a corpus with no more than twenty thousand words, along with three useful applications which take advantage of the manipulated corpus. The NTU corpus of Austronesian languages, an intonation-unit (IU) based corpus, is chosen to be processed. In Chapter 1, we introduce current problems in automatic processing of Austronesian languages. As small-scaled corpora limit the usage of statistical natural language processing, we are urged to find an alternative method to deal with Austronesian corpora. A new tag set is defined in Chapter 2 to reflect linguistic particularity of the object language of this thesis, SaiSiyat. Two methods to label part-of-speech tags, the gloss-based approach (accuracy rate 75%) and transformation-based error-driven learning (TBL, accuracy rate 85%), are evaluated and reported robust. Difficulties to distinguish between SaiSiyat nominative and accusative case markers are especially discussed. A partial parser is useful in preparing a corpus for noun-phrase extraction and further analyses. In Chapter 3, the tagged corpus is parsed into binary trees by a statistical approach, Kullback-Leibler divergence, and the TBL method. The former method declines quickly as IU length increases and needs huge computation time, while the accuracy rate of the latter method is a little less than 70%. Chapter 4 shows how an annotated corpus is related to linguistic research, native speakers of the object language and the public. Machine-aided annotation helps linguists to quickly rearrange collected data. An integrated platform of multimedia online corpora is also designed in this chapter, in order to serve both linguists and the public. In the last chapter, the natural language processing is discussed in early and late Wittgenstein's points of view. We agree with the idea that the meaning of a word is as many as its actual use. Thus, the computer cannot go beyond the boundary of the micro-cosmos composed by texts given in a corpus.

參考文獻


Li, Paul Jen-Kuei. 1978. A comparative vocabulary of Saisiyat dialects.
Rose, Tony, Nicholas Haddock, and Roger Tucker. 1997. The ects of corpus size and homogeneity on language model quality. In Proceedings of
Manning, Christopher D., and Hinrich Sch�utze. 1999. Foundations of
Huang, Shuan-Fan, Lily I-wen Su, and Li-May Sung. 2003. Syntax and
Lin, Zhemin, and Li-may Sung. 2004. Tiny corpus applications with

延伸閱讀