Extension of Zipf's Law to Word and Character N-grams for English and Chinese

It is shown that for a large corpus, Zipf's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf's law approximately with the slope close to -1 on a log-log plot for all n-grams, down to the lowest frequencies in both languages. This behaviour is also found for English 2-byte and 3-byte word fragments. It only happens when all n-grams are used, including semantically incomplete n-grams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.

並列關鍵字

Zipf's law ； Chinese character ； Chinese compound word ； n-grams ； phrases

參考文獻

Baayen, R. Harald(1991).Proceedings of 29th Annual Meeting of the Association for Computational Linguistics.

Google Scholar

Baayen, R. Harald(2001).Word Frequency Distributions.

Google Scholar

Booth, A. D.(1967).A law of Occurrences for Words of Low Frequency.Information and Control.10(4),386-393.

Google Scholar

Carroll, J. B.(1969).Research Bulletin-Educational Testing Service.

Google Scholar

Fedorowicz, J.(1982).A Zipfian Model of an Automatic Bibliographic System: an Application to MEDLINE.Journal of American Society of Information Science.33,223-232.

Google Scholar

國際替代計量

Extension of Zipf's Law to Word and Character N-grams for English and Chinese

全文下載

主題瀏覽