透過您的圖書館登入
IP:216.73.216.100
  • 學位論文

以音斷詞與注音轉漢字

Word Segmentation over Syllable Sequence with Application to Transformation from Phonetic Symbols to Chinese Characters

指導教授 : 江永進

摘要


近年來斷詞方面有許多的研究,不過都是利用中文漢字的字串,將句子斷詞,得到詞串,而本文則是利用注音作為斷詞的依據,使用斷詞的方法,希望以注音斷詞能得到與以漢字斷詞相當的效果。由於近年來斷詞方法的進步,中文斷詞的正確率已經可以達到96%,所以本論文希望將斷詞方法應用在注音轉漢字的問題上,以改善注音轉漢字的正確率。我們利用基於BIES詞位標籤的特製化隱藏式馬可夫模型,將以「注音」寫成的句子作為斷詞的依據,轉換為「注音詞」組合的句子,再以轉換後的「注音詞」比對具有標記注音的辭典,將「注音詞」轉為漢字詞,文中稱此方法為「兩階段注音轉漢字」方法。另外,我們利用改良的隱藏式馬可夫模型,直接將注音轉為漢字。與傳統隱藏式馬可夫模型不同的地方,是將以往固定不變的狀態,使其隨著觀測符號的不同而改變,減少因為狀態數過多,導致運算效率低落的情形,我們將此方法命名為「一階段注音轉漢字」。

並列摘要


The purpose of this thesis is to see whether the recent segmentation technique, M-HMM (a specialized Hidden Markov Model), can help the transformation from syllable sequence to character sequence. Basically, this transformation is the task of keyboard input method for entering Chinese character into a computer using phonetic symbols. Unlike the usual word segmentation that segments a character sequence into word sequence, we here group syllables (in phonetic symbols) into word sequence (in group of syllables). Based on BIES, position labels of a Chinese character in a word, M-HMM gives rise to best segmentation candidate that group syllables into words, and then the groups of syllables are transformed into words in characters. This is a two stage approach. For comparison, we also study a one stage approach of M-HMM without using the BIES labels. The finding is that the one stage approach gives better result of 94.60% correctness.

參考文獻


[15] 方心伶,「中文斷詞與注音」,國立清華大學統計研究所碩士論文,2008。
[8] Rabiner. L. R “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition ,”Proceedings of the IEEE, Vol.77, No.2, pp.257-286, 1989
[14] 林千翔,張嘉惠,“基於特製隱藏式馬可夫模型之中文斷詞研究”,國立中央大學資訊工程學系碩士論文,2006。
[1] Chen K. J. And S. H. Liu, “Word Identification for Mandarin Chinese Sentences,” Proceeding of COLING-92, 14th Int. Conf. On Computational Linguistics, pp. 101-107, 1992.
[2] Fan, C. K. and W. H. Tsai, “Automatic Word Identification in Chinese Sentences by the Relaxation Technique,” Computer Processing of Chinese and Oriental Languages, Vol. 2, No. 4, pp. 33-56, 1988.

延伸閱讀


國際替代計量