透過您的圖書館登入
IP:18.219.22.107
  • 學位論文

應用支持向量機於判斷中文停用字之研究

The Analysis of identification of Chinese Stop Characters with Support Vector Machine

指導教授 : 黃乾綱

摘要


在中文語言學的範疇內,中文詞彙字可分類成實詞與虛詞兩類。虛詞不能獨立構成句子,只能配合實詞以完成語法結構。因此虛詞的用法常成為語言學家的研究對象以及判斷語句結構的重要依據之一。發展自動判斷虛詞的工具,是中文自然語言處理一項重要的議題。在本論文中所談之中文停用字的判斷,目的即為中文單字虛字的判斷。 本論文提出一個自動判斷虛詞的方法,結合單類支持向量機與二元分類支持向量機,運用人工判斷的資料來訓練機器學習核心,建立自動判斷中文虛字的工具。針對每個中文字建立四十五個特徵參數。不論單類支持向量機與二元分類支持向量機,以及特徵選取的工具的實作上皆應用LIBSVM工具。 實驗語料庫為CBETA佛典語料庫中法華部類中的十六部經,取其中的《薩曇分陀利經》與《佛說法華三昧經》兩部經文中取得訓練樣本及測試樣本來進行實驗。訓練樣本共3660個字,其中正例資料為289字,另外測試樣本共3228個字,其中正例資料為223個字。實驗結果顯示,本論文所提出的方法,在參數最佳化之後,可達到精確率0.947且召回率0.920,然而在獨立測試實驗的精確率為0.311且召回率為0.318。 由於獨立測試的結果其正確率較差,本論文亦探討造成此結果的因素。分析其可能因素有二: 一為訓練資料與測試資料間朝代不同以致用字遣詞與文體差異,二為訓練資料數量不足。

並列摘要


In Chinese linguistics studies, the Chinese vocabulary can be classified as content words and function words. The role of function words is attached or connected. Function words can not form sentence, and it only cooperate with the content words to complete grammatical structure. Therefore, the function words are often studied by the linguists because of its grammatical function. It is an important research topic in Chinese Language Processing. In this paper, the identification of function words is limited to the identification of single Chinese character which is function word. In this thesis, we proposed a method which combined two type of SVM (Support Vector Machine), one-class SVM and 2-class SVM, to identify the Chinese function word. Using the function characters which are curated by human, we trained the machine learning model to build the automatic identification tool for function characters. For every sample characters, we generated a feature vector of 45 features. LIBSVM tool is applied in three parts, includes one-class SVM, 2-class SVM, and feature selection. The training data and testing data are selected Buddhist scriptures which are from the FaHua division in CBETA corpus. The training data contains 3660 characters, which includes 289 function characters. Besides, the test data contains 3228 single words, which includes 223 function characters. According to our leave-one-out cross validation experiment, with the optimization process, the precision and the recall can achieve 0.947 and 0.920, respectively. However, in the independent test experiment, the precision and recall drop to 0.311 and 0.318, respectively. We discussed two reasons which may cause the performance gap between leave-one-out cross validation experiment and independent test experiment. One reason is the differences in the styles of articles and the variation of usage from different dynasties between training data and test data, and the other is the insufficient training data.

參考文獻


1. Luhn, H.P., Key word‐in‐context index for technical literature (kwic index). American Documentation, 1960. 11(4): p. 288-295.
7. Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995. 20(3): p. 273-297.
8. Chien, L.F. PAT-tree-based keyword extraction for Chinese information retrieval. 1997. ACM.
9. Farach, M. Optimal suffix tree construction with large alphabets. 1997. IEEE.
10. Chien, L., PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. Information processing & management, 1999. 35(4): p. 501-21.

被引用紀錄


賴映仲(2014)。應用支持向量機於鯨豚哨音分類之研究〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2014.01630

延伸閱讀