專有詞彙相關解釋句自動分類組織技術之研究

本論文針對專有詞彙相關解釋句自動分類組織之技術進行研究。當使用者輸入一個欲查詢之專有詞彙，本論文方法以指定之PDF電子書做為知識來源，先以語句檢索系統搜尋出與該專有詞彙相關的句子，再擷取這些相關解釋句的句型樣式作為分類特徵，並提出兩種方法將解釋句分類成概述(Overview)、詳細描述(Detail Description)、及用途(Usage)三種類別。第一種方法為以語言模型為基礎的貝式分類法，除了採用bigram模型表示句型樣式中相鄰兩字詞的語意關聯，並採用比重加總法線性合併bigram和unigram模型建立機率式分類模型。第二種方法則以專有詞彙在句型樣式中前後固定範圍內出現的字詞以及相鄰兩字詞作為分類特徵，建立支持向量機(Support Vector Machine, SVM)分類器進行句子分類。實驗結果顯示，在控制的測試資料集當中，比較貝氏分類器及SVM分類器兩者的整體正確率，以SVM的分類效果較佳；而貝氏分類器則較能適性地應變訓練資料量的多寡，當訓練資料減少，貝氏分類器在整體正確率幾乎沒有下降。

關鍵字

句子分類；機率模型； SVM

並列摘要

This thesis mainly aims to study the technology for supporting automatic classification of informative sentences for domain-specific terms. Given a domain-specific term as a query, we use a set of PDF e-books as the source for discovering and organizing related informative sentences of the query term. First, we got the informative sentences which are highly relevant to the domain-specific term. Whether a sentence is relevant to the domain-specific term is evaluated by the sentence retrieval system. After that, the lexical patterns are extracted from the informative sentences and two classification methods are proposed for automatically classifying the sentences into three pre-defined categories — Overview, Detail Description, and Usage. The first classification method is similar to a Bayes classifier, which is designed based on the concept of language model. We not only apply the bigram language model but also use a weighted combination of the unigram and bigram probabilities. The other one classification method takes the unigram and bigram tokens appearing before or after the domain-specific term in a lexical pattern as classification features, and uses the Support Vector Machine (SVM) to construct a classification model. The experimental results show that, the performance of the SVM-based classifier is better than the Bayes-based classifier for the controlled testing data set. However, when the size of training data set is reduced, the accuracy of the Bayes-based classifier almost keeps stable.

並列關鍵字

sentence classification ； probabilistic model ； SVM

參考文獻

[1] S. Appavu, M. Pandian, R. Rajaram. “Detection of e-mail concerning criminal activities using association rule-based decision tree,” International Journal of Electronic Security and Digital Forensics 2007 - Vol. 1, No.2, pp. 131 - 145, May 2007.

[2] D. Bollegala, Y. Matsuo and M. Ishizuka, “Measuring the Similarity between Implicit Semantic Relations using Web Search Engines,” in Proceedings of the second ACM International Conference on Web Search and Data Mining (WSDM), 2009.

[4] H. Cui, M. Kan and T. Chua, ”Soft Pattern Matching Models for Definitional Question Answering,” ACM Transactions on Information Systems, Vol. 25, No. 2, Article 8, April 2007.

[5] B. Carterette and P. Chandar, “Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval,” in Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.

[8] M. Grineva, M. Grinev and D. Lizorkin, “Extracting Key Term from Noisy and Multi-theme Documents,” in Proceedings of the 18th International Conference on World Wide Web (WWW), 2009.

國際替代計量

專有詞彙相關解釋句自動分類組織技術之研究

主題瀏覽