透過您的圖書館登入
IP:3.149.239.110

並列摘要


Sentence tokenization is the process of mapping sentences from character strings into strings of tokens. This paper sets out to study longest tokenization which is a rich family of tokenization strategies following the general principle of maximum tokenization. The objectives are to enhance the knowledge and understanding of the principle of maximum tokenization in general, and to establish the notion of longest tokenization in particular. The main results are as follows: (1) Longest tokenization, which takes a token n-gram as a tokenization object and seeks to maximize the object length in characters, is a natural generalization of the Chen and Liu Heuristic on the table of maximum tokenizations. (2) Longest tokenization is a rich family of distinct and unique tokenization strategies with many widely used maximum tokenization strategies, such as forward maximum tokenization, backward maximum tokenization, forward-backward maximum tokenization, and shortest tokenization, as its members. (3) Longest tokenization is theoretically a true subclass of critical tokenization, as the essence of maximum tokenization is fully captured by the latter. (4) Longest tokenization is practically the same as shortest tokenization, as the essence of length-oriented maximum tokenization is captured by the latter. Results are obtained using both mathematical examination and corpus investigation.

參考文獻


Aho, Alfred V.(1972).The Theory of Parsing Translation, and Compiling〈1〉.
William, Chilin, Richard, Nancy N., N.(1996).A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.Computational Linguistics.22(3),377-404.
Guo, Jin(1997).Critical Tokenization and its Properties.Computational Linguistics.23(4),569-596.
Guo, Jin(1993).PH-A Free Chinese Corpus.Communications of COLIPS: an International Journal of the Chinese and Oriental Language Information Processing Society.3(1),45-48.
Kolman, Bernard(1987).Discrete Mathematical Structures for Computer Science.

延伸閱讀


國際替代計量