Exploiting Pinyin Constraints in Pinyin-to-Character Conversion Task: A Class-Based Maximum Entropy Markov Model Approach

The Pinyin-to-Character Conversion task is the core process of the Chinese pinyin-based input method. Statistical language model techniques, especially ngram-based models, are mostly adopted to solve that task. However, the ngram model only focuses on the constraints between characters, ignoring the pinyin constraints in the input pinyin sequence. This paper improves the performance of the Pinyin-to-Character Conversion system through exploitation of the pinyin constraints. The MEMM framework is used to describe the pinyin constraints and the character constraints. A Class-based MEMM (C-MEMM) model is proposed to address the MEMM efficiency problem in the Pinyin-to-Character Conversion task. The C-MEMM probability functions are strictly deduced and well formulized according to the Bayes rule and the Markov property. Both the cases of hard class and soft class are well discussed. In the experiments, C-MEMM outperforms the traditional ngram model significantly by exploitation of the pinyin constraints in the Pinyin-to-Character Conversion task. In addition, C-MEMM can well utilize the syntax and semantic information in word class and further improve the system performance.

並列關鍵字

Pinyin-to-Character Conversion ； MEMM ； Class-Based

參考文獻

Berger, A,S. D. Pietra,V. D. Pietra(1996).A maximum entropy approach to natural language processing.Computational Linguistics.22(1),39-71.

Google Scholar

Brown, P. F.,V. J. D. Pietra,P. V. deSouza,J. C. Lai,R. L. Mercer(1992).Class-based n-gram models of natural language.Computational Linguistics.18(4),467-479.

Google Scholar

Chen, Y.(1997).Chinese Language Processing.(Shang Hai education publishing company).

Google Scholar

Chen, L. Z.,T. Y. Huang(1999).A Novel Word Clustering Algorithm And Van-gram Language Model.Journal of Computer Sciences.22(9),942-948.

Google Scholar

Chen, Z,K. F. Lee(2000).A New Statistical Approach To Chinese Pinyin Input.Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL2000).(Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL2000)).:

Google Scholar

被引用紀錄

Jiang, T. J. (2012). Syllable Word Segmentation for Mandarin Chinese via Double Ranking of the Left and Right Context [doctoral dissertation, National Tsing Hua University]. Airiti Library. https://doi.org/10.6843/NTHU.2012.00487

國際替代計量

Exploiting Pinyin Constraints in Pinyin-to-Character Conversion Task: A Class-Based Maximum Entropy Markov Model Approach

全文下載

主題瀏覽