讓格書寫下 之 斷詞探討

中文斷詞是資訊處理基礎動作，但中文詞的定義模糊，使得應用因此受限。台灣主要的斷詞標準是中研院 CKIP規範(CKIP, 1997[8])，此標準是以語意、語法及使用頻率為基準所建構。本文提出新的斷詞標準，主要想法是避免單字詞落單，減少瑣碎的斷詞結果，增加字數做為斷詞標準的所能扮演的角色，使斷詞標準能夠更加簡潔好用。在新提出的斷詞標準下，我們準備了一份近3萬字元的網路文章，加以讓格，再加以 (新標準)斷詞，然後撰寫簡單的斷詞系統，結果斷詞F-量度可以達到 98%。相對的，簡單的最長詞匹配法只有70%左右；而傳統書寫的傳統斷詞使用大量語料訓練模型效率可到96%。本文方法使用簡單，實作也簡單。關鍵字：中文斷詞、斷詞標準、避免單字詞落單、讓格

關鍵字

中文斷詞；斷詞標準；避免單字詞落單；讓格

並列摘要

The concept of words in Mandarin Chinese is not really well defined. And as a result the important basic word segmentation module of the natural language processing of Chinese becomes somewhat difficult to implement. The primary standard of word segmentation in Taiwan is the CKIP standard of Academia Sinica, which uses semantics, syntax, and usage frequency to define a word. We propose an added principle of singleton-avoiding that dictates minimizing single character word in a segmented text. More specifically, two character string and three character string are principally treated as a word. By making use of the number of characters in defining a word, the standard becomes easy to follow. Furthermore, by writing the Chinese sentences with spaces between simple short phrases (called LangGeh orthography) instead of traditional way of no spaces in-between, and the segmentation module becomes much easier to implement. An implemented segmentation module written in programming language Python is tested on a testing text corpus of around 30000 characters, collected from internet and transformed into LangGeh orthography. The resulting performance is 98% in F-measure, and compared quite favorably to the traditional word segmentation of about 96% using large amount of training data. For marginalized languages such as Taiwanese and Hakka, LangGeh and the new segmentation standard seem to be the way to follow. Keywords: Chinese word segmentation, singleton-avoiding principle, LangGeh orthography, segmentation standard.

並列關鍵字

無資料

參考文獻

[3]李佳鴻(2010), “讓格書寫的台語自動標音初探”，國立清華大學統計學研究所碩士論文，新竹市。

[4]陳建忠(2010), “延複詞與延複詞類初探”，國立清華大學統計學研究所碩士論文，新竹市。

[5]謝博行(2013), “局部最長連續共同子序列與收集新詞”，國立清華大學統計學研究所碩士論文，新竹市。

[6]林千翔(2006), “基於特製隱藏式馬可夫模型之中文斷詞研究”，國立中央大學資訊工程研究所碩士論文，桃園縣。

[1]Hongmei Zhao and Qun Liu. 2010.“The CIPS-SIGHAN CLP 2010 Chinese Word Segmentation Bakeoff”. In Proceedings of the First CPS-SIGHAN Joint Conference on Chinese Language Processing. Beijing, China.

Google Scholar

被引用紀錄

陳薇婷（2014）。從　無間書寫　到　讓格寬格書寫〔碩士論文，國立清華大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0016-2912201413492040

國際替代計量

讓格書寫下之斷詞探討

主題瀏覽