中文斷詞在中文自然語言的處理上,是相當基礎且重要的工作。新近發展的基於詞位標籤的特製化隱藏式馬可夫模型(Specialized Hidden Markov Model)斷詞法,理論與實作合理簡單,效果優於傳統的長詞優先法(Maximum Matching Algorithm, MM)。本論文的研究目的是要利用詞位標籤斷詞法來提高中文轉注音的正確率,也就是在斷詞之後,使用詞串轉注音會比字串轉注音的正確率高。第一階段,使用各種斷詞法斷詞;第二階段,再使用中文斷詞後的詞串轉換為注音。實驗發現,其結果比單字轉注音的正確率高。而第三階段,利用第二階段M-HMM斷詞轉注音的結果,再尋求某些特定的注音轉換規則,提升注音的正確率,再以第二階段詞串轉注音的正確率為比較基礎,實驗結果也證實了確實可再提升注音的正確率。
Chinese word segmentation is an important and fundamental task. A recent advance in Chinese word segmentation is using a specialized Hidden Markov Model, called M-HMM, based on BIES, labels of the position of a constituent character in a word. The main purpose of this thesis is to see if the M-HMM will improve the pronunciation annotation. Firstly, a character sequence (sentence without word boundary mark-space) is segmented into word sequence, and secondly, the words are transformed into pronunciation annotation. Our experiment shows that M-HMM does help. As a third stage, we apply some transformation rules to further improve the correctness of the pronunciation annotation.