STATISTICAL WORD SEGMENTATION

A Chinese sentence has no word delimiters, like white space, between ＂words＂. Therefore, it is important to identify word boundaries before any processing can proceed. The same is true for other languages, like Japanese. When forming words, traditional heuristic approaches tend to use dictionary lookup, morphological rules and heuristics, such as matching the longest matchable dictionary entry. Such approaches may not be applied to a large system due to the complicated linguistic phenomena involved in Chinese morphology and syntax. In this paper, the various available features in a sentence are used to construct a generalized word segmentation formula; the various probabilistic models for word segmentation are then derived based on the generalized word segmentation model. In general, the likelihood measure adopted in a probabilistic model does not provide a scoring mechanism that directly indicates the real ranks of the various candidate segmentation patterns. To enhance the baseline models, the parameters of the models are further adjusted with an adaptive and robust learning algorithm. The simulation shows that cost-effective word segmentation could be achieved under various contexts with the proposed models. By incorporating word length information into a simple context-independent word model and applying a robust adaptive learning algorithm to the segmentation problem, it is possible to achieve accuracy in word recognition at a rate of 99.39% and sentence recognition at a rate of 97.65% in the test corpus. Furthermore, the assumption that all lexical entries can be found in the system dictionary is usually not true in real applications. Thus, such an ＂unknown word problem＂ is examined for each word segmentation model used here. Some prospective guidelines to the unknown word problem will be suggested.

關鍵字

無資料

並列摘要

中文詞與詞之間並無類似空白符號之類的分隔符號，故在進行中文訊息處理之前，需先界定詞的界線。傳統的分詞方法主要是利用詞典訊息，輔以一些經驗法則，如長詞優先法，來找出中文的分詞點。由於中文構詞及句法相當複雜，這樣的作法，對於大型系統而言，未必能適用。本文重點主要在於利用中文句中所有可資運用的特徴，發展一套一般化的中文分詞公式，從而推導出各種的統計分詞模式。在估計統計參數的估計値時，一般是以最大似然度作爲估計標準。但這種估計標準並未能反應出各種可能的分詞樣型間相對的排名順序。因此，我們採用具有強健性的調適性學習法，來調整參數的估計値，以提昇系統的效能。實驗結果顯示，我們所提議的分詞模式在各種情況下均能經濟而有效地達到分詞的效果。在使用詞長度訊息及應用強健性的調適性學習法於一簡單的統計模式之下，對測試語料而言，以詞爲單位的分詞辨認率達 99.39%，以句爲單位的辨認率則達97.65%。此外，在一般情況下，並非所有詞彙都可以在系統的詞典內找到。這類的「新詞」或「未知詞」往往嚴重影響分詞的辨認率。因此，我們也針對此一「未知詞問題」提出一些可行的解決方法。

並列關鍵字

無資料

國際替代計量

STATISTICAL WORD SEGMENTATION

全文下載

主題瀏覽