透過您的圖書館登入
IP:18.227.161.132

並列摘要


Sentence tokenization is the process of mapping sentences from character strings into strings of tokens. This paper sets out to study longest tokenization which is a rich family of tokenization strategies following the general principle of maximum tokenization. The objectives are to enhance the knowledge and understanding of the principle of maximum tokenization in general, and to establish the notion of longest tokenization in particular. The main results are as follows: (1) Longest tokenization, which takes a token n-gram as a tokenization object and seeks to maximize the object length in characters, is a natural generalization of the Chen and Liu Heuristic on the table of maximum tokenizations. (2) Longest tokenization is a rich family of distinct and unique tokenization strategies with many widely used maximum tokenization strategies, such as forward maximum tokenization, backward maximum tokenization, forward-backward maximum tokenization, and shortest tokenization, as its members. (3) Longest tokenization is theoretically a true subclass of critical tokenization, as the essence of maximum tokenization is fully captured by the latter. (4) Longest tokenization is practically the same as shortest tokenization, as the essence of length-oriented maximum tokenization is captured by the latter. Results are obtained using both mathematical examination and corpus investigation.

參考文獻


黃居仁,陳克健 Keh-Jiann, Keh-Jiann(1996).Proceedings of the 16th International Conference on Computational Linguistics (COLING'96).
Su, Keh-Yih,Chiang, Tung-Hui,Chang, Jing-Shin(1996).An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing.中文計算語言學期刊.1(1),101-158.
Aho, Alfred V.(1972).The Theory of Parsing Translation, and Compiling〈1〉.
William, Chilin, Richard, Nancy N., N.(1996).A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.Computational Linguistics.22(3),377-404.
Guo, Jin(1997).Critical Tokenization and its Properties.Computational Linguistics.23(4),569-596.

延伸閱讀


國際替代計量