透過您的圖書館登入
IP:13.59.50.189
  • 學位論文

利用Google互聯網分類新聞語料之新詞自動擷取技術支援詞庫式中文斷詞系統

New Word Extraction Utilizing Google News Corpuses for Supporting Lexicon-based Chinese Word Segmentation Systems

指導教授 : 洪欽銘
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


無資料

並列摘要


Chinese word segmentation in a Chinese sentence is an essential step in the processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is the most widely used method, which can correctly identify Chinese sentences as distinct words from Chinese-language texts for real-word applications. However, the word identification ability of the lexicon-based scheme is highly dependent with a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words. In particular, this scheme cannot perform Chinese word segmentation process well for highly changeable texts with time, such as newspaper articles and web documents. This is because highly changeable documents often contain many new words that cannot be identified by the lexicon-based Chinese word segmentation systems with a constant lexicon. Moreover, to maintain the lexicon by manpower is an inefficient and time-consuming job. Based on the problems, this study proposes a novel statistics-based scheme for new word extraction based on the categorized corpuses of Google news retrieved from the Google news site automatically to promote the word identification ability for the lexicon-based Chinese word segmentation systems. Compared with another proposed method, the experimental results indicated that the proposed new word extraction scheme not only can more correctly retrieve news words from the categorized corpuses of Google news, but also obtain has larger amount of new words.

參考文獻


[1] Mao-yuan Zhang , Zheng-ding Lu , Chun-yan Zou,“A Chinese word segmentation based on language situation in processing ambiguous words,” Information Sciences: an International Journal, vol. 162 no. 3-4, pp.275-285, June 2004.
[2] Foo, S. and Li, H. “Chinese word segmentation and its effect on information retrieval,” Information Processing and Management, vol. 40 Issue 1, pp.161-190, 2004.
[3] Chen, K.J. and S.H. Liu,“Word Identification for Mandarin Chinese Sentences,” Proceedings of COLING , pp.101-107, 1992.
[6] CKIP, web available at: http://ckipsvr.iis.sinica.edu.tw/
[8] Chen, K.J. and Wei-Yun Ma, “Unknown Word Extraction for Chinese Documents,” Proceedings of COLING 2002, pp. 169-175.

被引用紀錄


簡立(2012)。中文意見探勘系統設計〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2012.01235
林渝翔(2011)。一個產生長詞與新詞的中文混合斷詞系統〔碩士論文,元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2011.00155

延伸閱讀