An Unsupervised Iterative Method for Chinese New Lexicon Extraction

An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-merging-filtering-and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likelihood of the input corpus. On the other hand, the joint character association metric (which reflects the global character association characteristics across the corpus) is derived by integrating several commonly used word association metrics, such as mutual information and entropy, with a joint Gaussian mixture density function; such integration allows the filter to use multiple features simultaneously to evaluate character association, unlike traditional filters which apply multiple features independently. The proposed method then allows the contextual constraints and the joint character association metric to enhance each other; this is achieved by iteratively applying the joint association metric to truncate unlikely unknown words in the augmented dictionary and using the segmentation result to improve the estimation of the joint association metric. The refined augmented dictionary and improved estimation are then used in the next iteration to acquire better segmentation and carry out more reliable filtering. Experiments show that both the precision and recall rates are improved almost monotonically, in contrast to non-iterative segmentation-merging-filtering-and-disambiguation approaches, which often sacrifice precision for recall or vice versa. With a corpus of 311,591 sentences, the performance is 76% (bigram), 54% (trigram), and 70% (quadragram) in F-measure, which is significantly better than using the non-iterative approach with F-measures of 74% (bigram), 46% (trigram), and 58% (quadragram).

並列關鍵字

Unknown Word Identification ； New Lexicon Extraction ； Unsupervised Method ； Iterative Enhancement ； Chinese ； Lexicon

參考文獻

Appelt, Douglas E.,Hobbs, Jerry R.,Bear, John,Israel, David,Tyson, Mabry(1993).Proc. IJCAI-93.

Google Scholar

Behavior Design Corporation=BDC(1993).The BDC Chinese-English Dictionary: Version 2.

Google Scholar

Jyun-Sheng J. S., J. S.(1991).Proceedings of ROCLING-IV.

Google Scholar

Chen, K. J.,Hsu, H. L.,Huang, C. R.,Chang, L. P.(1995).Proceedings of ROCLING〈8〉.

Google Scholar

Chen, K. J.,Lee, L. J.,Chen, C. J.(1986).Technical Report, TR-86-004.

Google Scholar

被引用紀錄

Chen, R. C. (2013). 資訊保存與自然語言處理的應用 [doctoral dissertation, National Taiwan University]. Airiti Library. https://doi.org/10.6342/NTU.2013.02469

Sung, C. L. (2010). 由後綴陣列與序列排比探索有意義的中文文句樣式 [doctoral dissertation, National Taiwan University]. Airiti Library. https://doi.org/10.6342/NTU.2010.03352

詹景傑（1997）。WWW主題導向資訊伺服器之建置〔碩士論文，元智大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0009-0112200611292892

國際替代計量

An Unsupervised Iterative Method for Chinese New Lexicon Extraction

全文下載

主題瀏覽