A Dictionary-based Maximum Match Algorithm Via Statistical Information for Chinese Word Segmentation

The first step of natural language processing in Chinese is to divide a string of Chinese characters into a sequence of words. Mechanical word segmentation is based on word string matching, which has the advantages of fast word segmentation, simple algorithm and easy implementation. However, it does not have the ＂ambiguity＂ processing ability. The word segmentation effect is poor. Based on the statistics method, word segmentation quantitatively describes the closeness between Chinese characters through the frequency of Chinese characters co-occurrence and other relevant information in the corpus, which serves as the basis for word segmentation. The advantage of this method is that it has strong ＂ambiguity＂ processing ability, but it is slow and the realization process is complex, so it is mainly used for ＂ambiguity＂ elimination at present. In this paper, we propose a new maximum match algorithm for Chinese word segmentation. In the word segmentation stage, the dictionary-based maximum match method is used for initial word recognition. In the process of subsequent word segmentation, the training set co-occurrence dictionary is constructed according to the statistical information of the training set data, and the subsequent word is automatically identified according to the co-occurrence, which not only improves the word segmentation efficiency, but also better disambiguates the word segmentation. Experimental results show that the new method can effectively improve the word segmentation accuracy and recall rate, and it is suitable for Chinese text information mining.

關鍵字

Character Co-occurrence ； Chinese Word Segmentation ； Dictionary-based Maximum Match Method ； Statistical Information

國際替代計量

全文下載

主題瀏覽

A Dictionary-based Maximum Match Algorithm Via Statistical Information for Chinese Word Segmentation

摘要

關鍵字

延伸閱讀

國際替代計量

本網站使用Cookies