透過您的圖書館登入
IP:18.222.117.109
  • 期刊
  • OpenAccess

以構詞律與相似法為本的中文動詞自動分類研究

A Hybrid Approach for Automatic Classification of Chinese Unknown Verbs

摘要


本論文合併兩種方法預測未知動詞的詞類。第一種方法為規則法,即從訓練語料中歸納出未知動詞組成的構詞規律,分成兩個主要的判斷方式:一、依照未知動詞的組成的關鍵字決定其分類。二、依照未知動詞的構成組合決定其分類。 關鍵字法首先將動詞依長度分為四組。第一組為二字詞、三字詞、四字詞、五字以上的詞彙。在對實際語料的觀察下,發現不同詞長的動詞結構相異,因此將語料依詞長分組。例如:三字詞可訓練出「好」、「出」兩條規則決定動詞的詞類,其他長度的未知動詞並沒有這兩條規則,另外「化」規則不適用於二字動詞。 規則法的第二部分為依照構成組合決定其分類。在觀察未知動詞時,發現有部分未知動詞的組合很具有規律,我們就將訓練語料中未知動詞的組合做個歸納,得到九種組合。在十次實驗中,規則法可以處理的未知動詞平均約為23.19%,猜測正確的比例為91.67%。 二、相似法為利用與未知動詞相似的例子來預測未知動詞的詞類。相似法主要利用知網與中央研究院中文句結構樹資料庫1.0作為語意與詞類相似度測量的工具。藉由計算未知動詞與已知動詞的相似度來預測未知動詞的詞類,未知動詞的詞類為與其相似度最高的相似例子的詞類。使用相似法的好處在於相似法所尋找的的相似詞,若相似度高的話,不僅可以預測詞類分類,同時也可以預測語意與結構分類。當兩個辭彙相似度高時,表示這兩個辭彙的詞類、語意類與結構必定相似。在十次實驗中,使用相似法預測動詞的正確率約為71.05%。 規則法的優點在於判斷正確率高,缺點為可處理的未知動詞數量有限;相似法的優點為可以處理大部分的未知動詞,但正確率不如規則法高。最後,我們結合這兩種處理方法來預測未知動詞的分類,將兩個方法同時應用在最後的測試語料中,規則法的正確率為87.25%,而相似法的正確率為65.04%,兩著者結合後的正確率為70.80%。

關鍵字

無資料

並列摘要


In this paper we present a hybrid approach for automatic classification of Chinese unknown verbs. The first method of the hybrid approach utilizes a set of morphological rules summarized from the training data, i.e. the set of compound verbs extracted from Sinica corpus, to determine the category of an unknown compound verb. If the morphological rules are not applicable, then the instance-based categorization using the k-nearest neighbor method for the classification is employed. It was observed that some suffix morphemes are frequently occurred in compound verbs and also uniquely determine the syntactic categories of the resultant compound verbs. By processing and calculating the training data, 15 suffix rules with coverage over 2% and category prediction accuracy higher than 80% were derived. In addition to the above type of morphological rules, the reduplication rules are also useful for category prediction, such as some famous Chinese reduplication rules, like ”aa” in two characters word, ”aab”, ”abb” and ”aab” in three characters word etc. For instance, ”喝喝茶” has the same category as ”喝茶,” and ”研究研究” has the same category as ”研究” As a result, nine reduplication patterns are generated. Experimenting on the training data, it is found that the overall accuracy of the morphological rule classifier is 91.67% and its coverage is 23.19% only. Since the coverage of the morphological rule classifier is low, an instance-based categorization method is employed to taking care the uncovered cases. The instance-based categorization utilizes similar examples to predict the category of an unknown verb. The lexical similarity was measured by both the semantic similarity and syntactic similarity. The semantic similarity between two words is measured by the semantic distance of their HowNet definitions and the syntactic similarity is measured by the distance of their syntactic categories. The distance between two syntactic categories is their cosine measure of their grammatical feature vectors derived from the Sinica Treebank. The category of an unknown verb is predicted as the same as the examples, which are most similar to the unknown verb according to the above criteria of the similarity. For testing on the training data, the optimal accuracy of instance-based categorization is 71.05%, when the similar examples are from unknown verbs and verbs in the dictionary (known verbs). Both the morphological rule classifier and the instance-based categorization have the advantages of not only predicting the syntactic categories of the unknown words but also recognizing their morphological structures and major semantic classes. The advantage of the morphological rule classifier is its higher accuracy and for the instance-based categorization is its higher coverage. However, both of the methods have their own drawback; the former cannot be applied to most unknown verbs, but the latter suffers from low accuracy. For open test, 1000 unknown verbs that are unseen in the training process were tested. The accuracy of the linguistic rule is 87.25%, and the instance-based categorization is 65.04%. Finally, the overall accuracy of the hybrid approach is 70.80%.

並列關鍵字

無資料

參考文獻


陳克健 Keh-Jiann, Keh-Jiann(1997).Proceedings of the Natural Language Processing Pacific Rim Symposium.
陳克健 Keh-Jiann, Keh-Jiann(2000).Proceedings of the Second Chinese Language Processing Workshop.
陳克健 Keh-Jiann, Keh-Jiann(1998).Unknown Word Detection for Chinese by a Corpus-based Learning Method.中文計算語言學期刊.3(1),27-44.
Meteer, Marie,Weischedel, Ralph,Schwartz, Richard,Ramshaw, Lance,Palmucci, Jeff(1993).Coping with Ambiguity and Unknown Words Through Probabilistic Model.Computational Linguistics.19,359-382.
Resnik, P. S.(1995).Proceedings of IJCAI - 95.

延伸閱讀