透過您的圖書館登入
IP:18.117.183.172
  • 學位論文

台語文處理技術:以變調及詞性標記為例

Processing Techniques for Written Taiwanese --Tone Sandhi and POS Tagging

指導教授 : 高成炎
共同指導教授 : 陳克健
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


台語是世界上重要的語言,可惜沒有受到應有的重視。在某些方面,台語文的特性與華文或英文相當不同。本論文主要討論台語文處理技術。 白話字(台語羅馬字)是台語文的重要書寫系統。我們先介紹白話字的字元編碼,提及白話字數字調號做為不同白話字字元編碼的內部表示法。針對白話字文本搜尋,我們提出兩階段搜尋策略,並提出白話字音節近似搜尋的方法。我們還描述白話字顯示方法、白話字文字處理相關應用程式以及漢羅台語文斷詞方法。 我們提出以規則方法處理變調問題的演算法。先將每個台語詞翻成華語詞,找出其詞類標記訊息,以詞類標記和變調規則來決定變調後的聲調。我們實作出台語變調系統。此系統在訓練資料及測試資料分別達到97.4%和89.0%的變調正確率。 此外,我們提出詞類標記方法。我們先開發語詞對齊檢查程式將逐段對齊的兩種台語文本做語詞對齊,之後利用HMM機率模型挑選最適當的華語對應詞,再利用MEMM分類器挑選出其詞性標記。我們的方法達到91.5%的正確率。 過去幾年,我們建立了一些有用的線上台語文工具。希望這些工具以及我們所做的初步研究成果,能讓台語文處理相關研究更加蓬勃發展。

並列摘要


Taiwan Southern Min (Taiwanese) is an important language that has received only a little attention in the world. The characteristic of written Taiwanese is quite different from Mandarin or English in some respects. We will focus on Taiwanese processing techniques in this dissertation. POJ is an important script of Taiwanese. We introduce character code of POJ, and mention the numbered POJ as the interchange code for various POJ encodings. Then, we propose a two-stage search strategy for POJ text search, and propose POJ syllable query expansion. We also describe the display method for POJ, POJ word processing utilities and word segmentation method for HR mixed script. We propose a rule-based tone sandhi algorithm. We translate every word into Mandarin, and obtain the POS information. Using the POS data and tone sandhi rules, we then tag each syllable with its post-sandhi tone marker. Finally we implemented a Taiwanese tone sandhi processing system. Our system achieves 97.4% and 89.0% accuracy rate with training and test data, respectively. Additionally, we propose a POS tagging method. We develop a word alignment checker to help the two Taiwanese scripts word alignment work, select the most adequate Mandarin word using Hidden Markov probabilistic model, and finally tag the word using Maximal Entropy Markov Model classifier. We achieve an accuracy rate of 91.5% on Taiwanese POS tagging work. We have established some useful online written Taiwanese tools for past several years. Based on these tools and preliminary research results, we hope the written Taiwanese processing related research can be promoted.

參考文獻


Lin, C.-j., & Chen, H.-h. (1999). A Mandarin to Taiwanese Min Nan Machine Translation System with Speech Synthesis of Taiwanese Min Nan. International Journal of Computational Linguistics and Chinese Language Processing, 4(1), 59-84.
Tsay, J. S. (2007). Construction and Automatization of a Minnan Child Speech Corpus with some Research Findings. International Journal of Computational Linguistics & Chinese Language Processing, 12(4), 411-442.
Brill, E. (1993). Automatic grammar induction and parsing free text: A transformation-based approach, Proceedings of the DARPA Speech and Natural Language Workshop (pp. 237-242).
Chhong-bi Memorial Foundation. TBTS Taiwanese Writing Forum. Retrieved 12/1, 2008, from http://chhongbi.org/index2.html
Fung, P., & Wu, D. (1995). Coerced Markov Models for cross-lingual lexical tag relations, Sixth International Conference on Theoretical and Methodological Issues in Machine Translation (Vol. 1, pp. 240-255). Leuven, Belgium.

延伸閱讀