簡易檢索 / 詳目顯示

研究生: 林倚禛
Yi-Jhen Lin
論文名稱: 專有詞彙相關解釋句自動分類組織技術之研究
Classification of Informative Sentences for Domain-Specific Terms
指導教授: 柯佳伶
Koh, Jia-Ling
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 65
中文關鍵詞: 句子分類機率模型SVM
英文關鍵詞: sentence classification, probabilistic model, SVM
論文種類: 學術論文
相關次數: 點閱:71下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文針對專有詞彙相關解釋句自動分類組織之技術進行研究。當使用者輸入一個欲查詢之專有詞彙,本論文方法以指定之PDF電子書做為知識來源,先以語句檢索系統搜尋出與該專有詞彙相關的句子,再擷取這些相關解釋句的句型樣式作為分類特徵,並提出兩種方法將解釋句分類成概述(Overview)、詳細描述(Detail Description)、及用途(Usage)三種類別。第一種方法為以語言模型為基礎的貝式分類法,除了採用bigram模型表示句型樣式中相鄰兩字詞的語意關聯,並採用比重加總法線性合併bigram和unigram模型建立機率式分類模型。第二種方法則以專有詞彙在句型樣式中前後固定範圍內出現的字詞以及相鄰兩字詞作為分類特徵,建立支持向量機(Support Vector Machine, SVM)分類器進行句子分類。實驗結果顯示,在控制的測試資料集當中,比較貝氏分類器及SVM分類器兩者的整體正確率,以SVM的分類效果較佳;而貝氏分類器則較能適性地應變訓練資料量的多寡,當訓練資料減少,貝氏分類器在整體正確率幾乎沒有下降。

    This thesis mainly aims to study the technology for supporting automatic classification of informative sentences for domain-specific terms. Given a domain-specific term as a query, we use a set of PDF e-books as the source for discovering and organizing related informative sentences of the query term. First, we got the informative sentences which are highly relevant to the domain-specific term. Whether a sentence is relevant to the domain-specific term is evaluated by the sentence retrieval system. After that, the lexical patterns are extracted from the informative sentences and two classification methods are proposed for automatically classifying the sentences into three pre-defined categories — Overview, Detail Description, and Usage. The first classification method is similar to a Bayes classifier, which is designed based on the concept of language model. We not only apply the bigram language model but also use a weighted combination of the unigram and bigram probabilities. The other one classification method takes the unigram and bigram tokens appearing before or after the domain-specific term in a lexical pattern as classification features, and uses the Support Vector Machine (SVM) to construct a classification model. The experimental results show that, the performance of the SVM-based classifier is better than the Bayes-based classifier for the controlled testing data set. However, when the size of training data set is reduced, the accuracy of the Bayes-based classifier almost keeps stable.

    目錄 i 附圖目錄 iii 附表目錄 iv 第一章 緒論 1 1-1 研究動機 1 1-2 研究目的 2 1-3 研究範圍與限制 2 1-4 論文方法 4 1-5 論文架構 6 第二章 文獻探討 7 2-1 文字資訊檢索技術 7 2-1.1 統計語言模型(Statistical Language Model)概述 7 2-1.2 字詞與查詢字相關程度估算及語意關聯 10 2-2 文字內容分群技術 11 2-3 文字內容分類技術 13 第三章 系統架構與流程 17 第四章 資料蒐集及前處理 20 4-1 分類項目之定義 20 4-2 訓練資料蒐集 21 4-3 訓練資料前處理 22 4-3.1 句型樣式擷取 22 4-3.2 統計詞彙頻率 28 4-4 測試資料蒐集及前處理 30 第五章 解釋句分類方法 31 5-1 以語言模型為基礎的貝式分類模型 31 5-1.1 Bigram語言模型 32 5-1.2 比重加總unigram及bigram語言模型 33 5-2 支持向量機分類器(SVM) 34 5-2.1 特徵擷取 35 5-2.2 建立句子特徵向量 37 第六章 分類效果評估 39 6-1 實驗設置 39 6-2 實驗結果 42 6-2.1 控制測試資料 42 6-2.2 開放測試資料 50 第七章 結論與未來研究方向 53 參考文獻 55

    [1] S. Appavu, M. Pandian, R. Rajaram. “Detection of e-mail concerning criminal activities using association rule-based decision tree,” International Journal of Electronic Security and Digital Forensics 2007 - Vol. 1, No.2, pp. 131 - 145, May 2007.

    [2] D. Bollegala, Y. Matsuo and M. Ishizuka, “Measuring the Similarity between Implicit Semantic Relations using Web Search Engines,” in Proceedings of the second ACM International Conference on Web Search and Data Mining (WSDM), 2009.

    [3] H. Cui, M. Kan and T. Chua, “Generic Soft Pattern Models for Definitional Question Answering,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2005.

    [4] H. Cui, M. Kan and T. Chua, ”Soft Pattern Matching Models for Definitional Question Answering,” ACM Transactions on Information Systems, Vol. 25, No. 2, Article 8, April 2007.

    [5] B. Carterette and P. Chandar, “Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval,” in Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.

    [6] C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

    [7] R.-E. Fan, P.-H.Chen, and C.-J. Lin. “Working set selection using the second order information for training SVM,” Journal of Machine Learning Research 6, 1889-1918, 2005.

    [8] M. Grineva, M. Grinev and D. Lizorkin, “Extracting Key Term from Noisy and Multi-theme Documents,” in Proceedings of the 18th International Conference on World Wide Web (WWW), 2009.

    [9] J. Guo, G. Xu, X. Cheng and H. Li, “Named Entity Recognition in Query,” in Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009.

    [10] X. Hu, N. Sun, C. Zhang and T. S. Chua, “Exploiting Internal and External Semantics for the Clustering of Short Texts using World Knowledge,” in Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.

    [11] X. Hu, X. Zhang, C. Lu, E. K. Park and X. Zhou, ”Exploiting Wikipedia as External Knowledge for Document Clustering,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2009.

    [12] X. Liu and W. B. Croft, “Cluster-Based Retrieval Using Language Models,” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2004.

    [13] X. Liu, Z. Nie, N. Yu and J.Wen, “BioSnowball: Automated Population of Wikis,” in Proceedings of the 16thACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2010.

    [14] S. Momtazi and D. Klakow, “A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering,” in Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.

    [15] X. Ni, X. Wu and Y. Yu, “Automatically Identification of Chinese Weblogger’s Interests based on Text Classification,” in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2006.

    [16] N. O'Hare, M. Davy, A. Bermingham, P. Ferguson, P. Sheridan, C. Gurrin, and A. F. Smeaton, “Topic-dependent sentiment analysis of financial blogs,” in TSA 2009 - 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion Measurement, 6 November 2009, Hong Kong, China.

    [17] Z. Pei-ying and L. Cun-he, “Automatic Text Summarization based on Sentences Clustering and Extraction,” in Proceedings of the 2nd IEEE International Conference on Computer Science and Information Technology (ICCSIT), 2009.

    [18] M. J. Paul, C. Zhai and R. Girju, ”Summarizing Contrastive Viewpoints in Opinionated Text,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010.

    [19] S. Quarteroni, “Personalized, Interactive Question Answering on the Web,” in Proceedings of the 23rd of International Conference on Computational Linguistics (COLING), 2008.

    [20] D. Ramage, P. Heymann, C. D. Manning and H. Garcia-Molina, “Clustering the Tagged Web,” in Proceedings of the second ACM International Conference on Web Search and Data Mining (WSDM), 2009.

    [21] K. Seki and K. Uehara, “Adaptive Subjective Triggers for Opinionated Document Retrieval,” in Proceedings of the secondACM International Conference on Web Search and Data Mining (WSDM), 2009.

    [22] A. Tombros, J.M. Jose and I. Ruthven, “Clustering Top-Ranking Sentences for Information Access,” in Proceedings of the 7th European Conference on Digital Libraries (ECDL), 2003.

    [23] V. Vapnik, “Principles of risk minimization for learning theory,” Advance in Neural Information Proceeding Systems, p831-p838, 1992.

    [24] G. P. Zhang, “Neural Networks for Classification: A Survey,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Review, vol. 30, no. 4, pp. 451-462, 2000.

    [25] L. Zhang and Y. Zhang, “Interactive Retrieval Based on Faceted Feedback,“ in Proceedings of the 33rdAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2010.

    [26]卓晉緯,「專有詞彙之定義式問題答案句自動擷取系統」,國立臺灣師範大學,碩士論文,民國99年。

    [27] StanfordPOS Tagger, http://nlp.stanford.edu/software/tagger.shtml

    下載圖示
    QR CODE