透過您的圖書館登入
IP:216.73.216.41
  • 期刊

Application of N-grams in Language Model, Genomes and COVID-19 Virus

摘要


In information theory, n-gram was defined as any n-long sub-sequences of consecutive tokens in a sequence. Since late 1940s, it has developed and applied in multiple fields of technology. This paper introduced the main three applications of n-grams in the prediction of English language model, genomes, and COVID-19 virus. In predicting English, entropy of n-gram was used to calculate the uncertainty on average of the next English letter when the previous N-1 English letters were known. It was also used to categorize and characterize genomes, etc. Under serious COVID-19 condition, n-gram precisely identified the origins of COVID-19 from different locations as well as presented the psychological effect of COVID-19 virus through social media. N-gram is a promising method for future use in multiple areas.

關鍵字

n-grams Entropy Information Theory Genome COVID-19

參考文獻


Tomović, A., P. Janičić, and V. Kešelj, n-Gram-based classification and unsupervised hierarchical clustering of genome sequences. Computer Methods and Programs in Biomedicine, 2006. 81(2): p. 137-153.
Shannon, C.E., A Mathematical Theory of Communication. Bell System Technical Journal, 1948. 27(4): p. 623-656.
Shannon, C.E., Prediction and Entropy of Printed English. Bell System Technical Journal, 1951. 30(1): p. 50-64.
Ganapathiraju, M.K., et al., Suite of Tools for Statistical N-Gram Language Modeling for Pattern Mining in Whole Genome Sequences. Journal of Bioinformatics and Computational Biology, 2012. 10(06).
Rani, T.S. and R.S. Bapi, Analysis of n-Gram based Promoter Recognition Methods and Application to Whole Genome Promoter Prediction. In Silico Biology, 2009. 9(1,2): p. S1-S16.

延伸閱讀