透過您的圖書館登入
IP:3.145.17.46
  • 期刊

A Dictionary-based Maximum Match Algorithm Via Statistical Information for Chinese Word Segmentation

摘要


The first step of natural language processing in Chinese is to divide a string of Chinese characters into a sequence of words. Mechanical word segmentation is based on word string matching, which has the advantages of fast word segmentation, simple algorithm and easy implementation. However, it does not have the "ambiguity" processing ability. The word segmentation effect is poor. Based on the statistics method, word segmentation quantitatively describes the closeness between Chinese characters through the frequency of Chinese characters co-occurrence and other relevant information in the corpus, which serves as the basis for word segmentation. The advantage of this method is that it has strong "ambiguity" processing ability, but it is slow and the realization process is complex, so it is mainly used for "ambiguity" elimination at present. In this paper, we propose a new maximum match algorithm for Chinese word segmentation. In the word segmentation stage, the dictionary-based maximum match method is used for initial word recognition. In the process of subsequent word segmentation, the training set co-occurrence dictionary is constructed according to the statistical information of the training set data, and the subsequent word is automatically identified according to the co-occurrence, which not only improves the word segmentation efficiency, but also better disambiguates the word segmentation. Experimental results show that the new method can effectively improve the word segmentation accuracy and recall rate, and it is suitable for Chinese text information mining.

延伸閱讀