局部最長連續共同子序列與新詞組收集

時代在變，用詞在變，詞典的詞條應該也跟著變，跟不上時代的詞典代表跟不上的基礎文化。針對單篇文章或者二篇文章，我們提出局部最長連續共同子序列 (locally longest consecutive common subsequence: LLCCS) 的方法，近似出名的最長共同子系列 (longest common subsequence: LCS) 算程，可以有效率擷取文章中的重複使用的字串。由此所擷取出的字串我們再進一步處理篩選，得到較合語法意義的新詞組，以及新詞。因為網路上可以自動收集大量新聞或文章，新詞組、新詞的擷取應可快速幫助詞典新詞條的累積。

關鍵字

未知詞；新詞組；局部最長共同子序列

並列摘要

Adapting from the well-known longest common subsequence (LCS) algorithm, we propose an efficient algorithm that is capable of extracting locally longest consecutive common subsequence (LLCCS) from one or two different articles. Further processing on the extracted subsequence makes them closer to syntatical phrases/words. With world wide web full of adundant articles, we hope this is an efficient way to enrich the entries of Chinese lexicon.

並列關鍵字

Unknown word ； New phrase ； Locally longest common consecutive subsequence

參考文獻

[2] K. J. Chen and W. Y. Ma (2002). “Unknown Word Extraction for Chinese Documents”. COLING, pp.169-175.

[3] Fuchun Peng, Fangfang Feng and Andrew McCallum (2004). “Chinese Segmentation and New Word Detection Using Conditional Random Fields”. COLING, pp.562-568.

[1] K. J. Chen and M. H. Bai (1998). “Unknown Word Detection for Chinese by a Corpus-based Learning Method”. International Journal of Computational linguistics and Chinese Language Processing, Vol.3, #1, pp.27-44.

Google Scholar

[4] T. H. Chang and C. H. Lee (2003). “Automatic Chinese unknown word extraction using small-corpus-based method”, Proceedings of IEEE International Conference on Natural language processing and knowledge engineering, pp.459-464.