近年來斷詞方面有許多的研究,不過都是利用中文漢字的字串,將句子斷詞,得到詞串,而本文則是利用注音作為斷詞的依據,使用斷詞的方法,希望以注音斷詞能得到與以漢字斷詞相當的效果。由於近年來斷詞方法的進步,中文斷詞的正確率已經可以達到96%,所以本論文希望將斷詞方法應用在注音轉漢字的問題上,以改善注音轉漢字的正確率。我們利用基於BIES詞位標籤的特製化隱藏式馬可夫模型,將以「注音」寫成的句子作為斷詞的依據,轉換為「注音詞」組合的句子,再以轉換後的「注音詞」比對具有標記注音的辭典,將「注音詞」轉為漢字詞,文中稱此方法為「兩階段注音轉漢字」方法。另外,我們利用改良的隱藏式馬可夫模型,直接將注音轉為漢字。與傳統隱藏式馬可夫模型不同的地方,是將以往固定不變的狀態,使其隨著觀測符號的不同而改變,減少因為狀態數過多,導致運算效率低落的情形,我們將此方法命名為「一階段注音轉漢字」。
The purpose of this thesis is to see whether the recent segmentation technique, M-HMM (a specialized Hidden Markov Model), can help the transformation from syllable sequence to character sequence. Basically, this transformation is the task of keyboard input method for entering Chinese character into a computer using phonetic symbols. Unlike the usual word segmentation that segments a character sequence into word sequence, we here group syllables (in phonetic symbols) into word sequence (in group of syllables). Based on BIES, position labels of a Chinese character in a word, M-HMM gives rise to best segmentation candidate that group syllables into words, and then the groups of syllables are transformed into words in characters. This is a two stage approach. For comparison, we also study a one stage approach of M-HMM without using the BIES labels. The finding is that the one stage approach gives better result of 94.60% correctness.