在本論文中,我們提出一個串聯式的字典暨條件隨機域之斷詞與詞性標記法,來為中文斷詞及標記詞性。該方法的第一步先透過查詢豐富的字典資訊以及使用常見的語言規則,將一個中文句子所有可能的斷詞結果列舉出來,第二步再使用條件隨機域學習到的句法結構,從所有可能的斷詞結果中,選出最合適者,並標上正確的詞性標籤。 串聯式字典暨條件隨機域之斷詞與詞性標記法可以同時解決中文斷詞議題中常見的三個問題:斷詞歧異性、斷詞準則不一、未知詞,來得到很好的斷詞效能。我們解決這三個問題的作法分別是(1)使用條件隨機域在具有詞性標記的訓練語料中,學習句法結構,以解決斷詞歧異性的問題。(2)提出一套標準流程,使得整合多類型字典資訊時,不會有斷詞準則不一的問題。(3)透過多個不同種類的字典以及常見的語言規則來解決未知詞的問題。同時,這個方法除了提供彈性(例如:動態納入所需字典資訊)以及實用性(例如:依照不同的文章類型來選用字典與系統設定,以達到最好的斷詞結果以及詞性標記)之外,更因為它易於實作的特性,大大降低了進入中文語言處理的門檻。 關於實驗資源,我們共收集16部不同類型的字典,並使用SIGHAN bakeoff 1中由中央研究院提供的訓練集以及測試集,來進行斷詞與詞性標記的實驗。我們也收集一份醫學類型的語料,以針對不同類型的文章進行斷詞實驗及分析。 實驗中證明,僅使用少量(7,229句)訓練集的串聯式字典暨條件隨機域之斷詞與詞性標記法即可達到良好的斷詞及詞性標記效能。若使用由46個詞性標記而成的訓練集,斷詞及詞性標記的效能分別可以達到F分數0.964及0.922;而若使用由簡化後共10個詞性標記而成的訓練集,則斷詞及詞性標記的效能亦可達到F分數0.954以及0.939。此外,我們的實驗數據也顯示,若能依照不同類型文章的特性來選用合適的字典與系統設定,就可以達到最好的斷詞效能。
This paper proposes a dictionary-CRF-combined approach for Chinese word segmentation and part of speech tagging. This approach proposes all probable sentences by looking up dictionaries and selects the best sentence utilizing a CRF model. This approach can incorporate as many dictionaries as possible to solve new term problem without re-training the model. Moreover, a practical method which adds terms in the system’s dictionary without causing any inconsistence of segmentation rules is also proposed. Most usefully, this approach is able to select dictionaries and segmentation settings according to the document type. Training and testing collections of SIGHAN bakeoff 1 and a medical document collection are used in the experiments. This approach achieves an f-score 0.964 in segmentation, and 0.922 in part of speech tagging, which is satisfactory. Moreover, the training process uses only 7,229 lines in the training file, and this shows that it is easy to build this model by small training data. This approach achieves an f-score 0.954 in segmentation and 0.939 in part of speech tagging even 10 simplified parts of speech are used for training. The simplicity, practicability and flexibility are the superiorities of this approach.