台語是世界上重要的語言,可惜沒有受到應有的重視。在某些方面,台語文的特性與華文或英文相當不同。本論文主要討論台語文處理技術。 白話字(台語羅馬字)是台語文的重要書寫系統。我們先介紹白話字的字元編碼,提及白話字數字調號做為不同白話字字元編碼的內部表示法。針對白話字文本搜尋,我們提出兩階段搜尋策略,並提出白話字音節近似搜尋的方法。我們還描述白話字顯示方法、白話字文字處理相關應用程式以及漢羅台語文斷詞方法。 我們提出以規則方法處理變調問題的演算法。先將每個台語詞翻成華語詞,找出其詞類標記訊息,以詞類標記和變調規則來決定變調後的聲調。我們實作出台語變調系統。此系統在訓練資料及測試資料分別達到97.4%和89.0%的變調正確率。 此外,我們提出詞類標記方法。我們先開發語詞對齊檢查程式將逐段對齊的兩種台語文本做語詞對齊,之後利用HMM機率模型挑選最適當的華語對應詞,再利用MEMM分類器挑選出其詞性標記。我們的方法達到91.5%的正確率。 過去幾年,我們建立了一些有用的線上台語文工具。希望這些工具以及我們所做的初步研究成果,能讓台語文處理相關研究更加蓬勃發展。
Taiwan Southern Min (Taiwanese) is an important language that has received only a little attention in the world. The characteristic of written Taiwanese is quite different from Mandarin or English in some respects. We will focus on Taiwanese processing techniques in this dissertation. POJ is an important script of Taiwanese. We introduce character code of POJ, and mention the numbered POJ as the interchange code for various POJ encodings. Then, we propose a two-stage search strategy for POJ text search, and propose POJ syllable query expansion. We also describe the display method for POJ, POJ word processing utilities and word segmentation method for HR mixed script. We propose a rule-based tone sandhi algorithm. We translate every word into Mandarin, and obtain the POS information. Using the POS data and tone sandhi rules, we then tag each syllable with its post-sandhi tone marker. Finally we implemented a Taiwanese tone sandhi processing system. Our system achieves 97.4% and 89.0% accuracy rate with training and test data, respectively. Additionally, we propose a POS tagging method. We develop a word alignment checker to help the two Taiwanese scripts word alignment work, select the most adequate Mandarin word using Hidden Markov probabilistic model, and finally tag the word using Maximal Entropy Markov Model classifier. We achieve an accuracy rate of 91.5% on Taiwanese POS tagging work. We have established some useful online written Taiwanese tools for past several years. Based on these tools and preliminary research results, we hope the written Taiwanese processing related research can be promoted.