Longest Tokenization

Sentence tokenization is the process of mapping sentences from character strings into strings of tokens. This paper sets out to study longest tokenization which is a rich family of tokenization strategies following the general principle of maximum tokenization. The objectives are to enhance the knowledge and understanding of the principle of maximum tokenization in general, and to establish the notion of longest tokenization in particular. The main results are as follows: (1) Longest tokenization, which takes a token n-gram as a tokenization object and seeks to maximize the object length in characters, is a natural generalization of the Chen and Liu Heuristic on the table of maximum tokenizations. (2) Longest tokenization is a rich family of distinct and unique tokenization strategies with many widely used maximum tokenization strategies, such as forward maximum tokenization, backward maximum tokenization, forward-backward maximum tokenization, and shortest tokenization, as its members. (3) Longest tokenization is theoretically a true subclass of critical tokenization, as the essence of maximum tokenization is fully captured by the latter. (4) Longest tokenization is practically the same as shortest tokenization, as the essence of length-oriented maximum tokenization is captured by the latter. Results are obtained using both mathematical examination and corpus investigation.

並列關鍵字

sentence tokenization ； tokenization disambiguation ； maximum tokenization ； critical tokenization ； word segmentation ； word identification.

參考文獻

Aho, Alfred V.(1972).The Theory of Parsing Translation, and Compiling〈1〉.

Google Scholar

William, Chilin, Richard, Nancy N., N.(1996).A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.Computational Linguistics.22(3),377-404.

Google Scholar

Guo, Jin(1997).Critical Tokenization and its Properties.Computational Linguistics.23(4),569-596.

Google Scholar

Guo, Jin(1993).PH-A Free Chinese Corpus.Communications of COLIPS: an International Journal of the Chinese and Oriental Language Information Processing Society.3(1),45-48.

Google Scholar

Kolman, Bernard(1987).Discrete Mathematical Structures for Computer Science.

Google Scholar

國際替代計量

Longest Tokenization

全文下載

主題瀏覽