Reduced N-Grams for Chinese Evaluation

Theoretically, an improvement in a language model occurs as the size of the n-grams increases from 3 to 5 or higher. As the n-gram size increases, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams' approach previously developed by O' Boyle and Smith [1993] can be applied. A reduced n-gram language model, called a reduced model, can efficiently store an entire corpus's phrase-history length within feasible storage limits. Another advantage of reduced n-grams is that they usually are semantically complete. In our experiments, the reduced n-gram creation method or the O' Boyle-Smith reduced n-gram algorithm was applied to a large Chinese corpus. The Chinese reduced n-gram Zipf curves are presented here and compared with previously obtained conventional Chinese n-grams. The Chinese reduced model reduced perplexity by 8.74% and the language model size by a factor of 11.49. This paper is the first attempt to model Chinese reduced n-grams, and may provide important insights for Chinese linguistic research.

並列關鍵字

Reduced n-grams ； reduced n-gram algorithm/identification ； reduced model ； Chinese reduced n-grams ； Chinese reduced model

參考文獻

Ha, L. Q.,E. I. Sicilia-Garcia,J. Ming,F. J. Smith(2002).Extension of Zipf`s Law to Words and Phrases.Proceedings of the 19th International Conference on Computational Linguistics.1,315-320.

Baayen, H.(2001).Word Frequency Distributions.

Google Scholar

Evert, S.(2004).A Simple LNRE Model for Random Character Sequences.Proceedings of the 7èmes Journées Internationales d'Analyse Statistique des Données Textuelles.411-422.

Google Scholar

Ferrer I Cancho, R.,R. V. Solé(2002).Two Regimes in the Frequency of Words and the Origin of Complex Lexicons.Journal of Quantitative Linguistics.7(3),165-173.

Google Scholar

Francis, W. N.,H. Kucera(1964).Manual of Information to Accompany A Standard Corpus of Present-Day Edited American English, for use with Digital Computers.

Google Scholar

國際替代計量

Reduced N-Grams for Chinese Evaluation

全文下載

主題瀏覽