Title

應用語意相依關係及超空間模擬語言模型於網頁文本探勘及資訊檢索之研究

Translated Titles

A Study on Semantic Dependencies and HAL Modeling for Web Text Mining and Information Retrieval

Authors

禹良治

Key Words

文件探勘 ; 資訊檢索 ; 自然語言處理 ; Information Retrieval ; Text Mining ; Natural Language Processing

PublicationName

成功大學資訊工程學系學位論文

Volume or Term/Year and Month of Publication

2008年

Academic Degree Category

博士

Advisor

吳宗憲

Content Language

英文

Chinese Abstract

資訊檢索之目的在幫助使用者快速且有效地搜尋有用的資訊。傳統的資訊檢索系統通常使用字袋(Bag-of-words)表示法來處理使用者查詢及文件,因此在檢索時僅使用文字層面的資訊,而忽略了較高層次的結構化資訊。然而,文件在結構上通常包含許多主題(Topics),這些主題資訊有助於深入理解使用者的查詢需求,進而達到更精確的檢索結果。因此,本論文提出使用文本探勘方法擷取使用者查詢及文件中的主題資訊,並利用所擷取的主題資訊來提升檢索之精確度。 本論文以中文網路精神科文件作為實驗語料。精神科文件係由網路使用者提出的憂鬱問題及相對應的專家建議所組成。這些文件主要包含三項憂鬱症相關之主題資訊,如:負面生活事件、憂鬱症狀及症狀之間的關係。本論文之目的即在實現精神科文件檢索系統,使其具備憂鬱主題分析功能,以協助使用者快速找到與其憂鬱問題相關的文件。 上述三項主題皆使用不同的方法判斷之。對於憂鬱症狀來說,其在文字的表現上通常由單一句子或多句所構成,因此本論文提出使用語意相依關係(Semantic Dependencies)來分析文句之語意結構,逐句判斷其所包含的憂鬱症狀;憂鬱症狀之間的關係,如:因果關係與時間先後關係,係使用領域知識本體(Domain Ontology)來判斷;負面生活事件係由語意樣式(Semantic Pattern)所構成,語意樣式定義為在語意上可表示負面生活事件之字詞組合,因此本論文結合超空間模擬語言模型(Hyperspace Analog to Language, HAL)及演化式計算方法自動從未標記之網路精神科文件擷取語意樣式。 最後本論文提出一檢索模型,能根據使用者查詢與文件所包含的負面生活事件、憂鬱症狀及症狀之間的關係來計算兩者之相似度。在實驗評估上,本論文使用以文字為基礎之檢索模型如:向量空間模型(Vector Space Model, VSM)及Okapi模型為比較對象,實驗結果顯示考慮主題資訊可達到較為精確的檢索結果。

English Abstract

Information retrieval (IR) attempts to retrieve documents relevant to a user’s query from a large collection of documents. Instead of using keyword-based approaches, recent IR systems have been presented to enable natural language queries. Users can thus express their information needs naturally. These systems usually adopt a bag-of-words approach to represent a query and a document, which means that they can exploit only word-level information during the retrieval process. The high-level structural information in documents is often neglected. However, a document is generally structured; that is, it can be characterized as a set of topics and inter-topic relations. Such topic information is beneficial for better understanding users’ information needs so as to obtain more precise retrieval results. Therefore, this dissertation proposes the use of text mining techniques to extract the topic information contained in both queries and documents to improve retrieval performance. The experimental corpora used herein are Chinese psychiatry web resources, a large collection of psychiatric documents produced by Internet users and psychiatrists. Each psychiatric document contains a user’s depressive problems and an expert’s suggestions to alleviate the depressive problems. The psychiatric documents thus contain rich depressive-related topic information, including negative life events, depressive symptoms and semantic relations between symptoms. Therefore, this dissertation attempts to help people to efficiently and effectively locate the psychiatric documents relevant to their depressive problems according to the depressive-related topic information. The topics are extracted using different approaches. For depressive symptoms, the information is often embedded in a single sentence or multiple sentences. This dissertation proposes a text mining framework integrating the semantic dependencies of a sentence (intra-sentence) and the strength of lexical cohesion between sentences (inter-sentence) to mine the symptoms. Once the symptoms are identified, the semantic relations such as cause-effect and temporal relations between symptoms can then be identified. This dissertation uses a domain ontology to mine the semantic relations between extracted symptoms. For negative life events, the information is often represented by meaningful patterns. A pattern refers to a semantically plausible combination of words. Therefore, this dissertation proposes a framework integrating a cognitive motivated model such as Hyperspace Analog to Language (HAL), and evolutionary computation to induce variable-length patterns from unannotated psychiatry web resources. Finally, a retrieval model is designed to calculate the similarity between input queries and psychiatric documents by combining the similarities of negative life events, depressive symptoms and semantic relations within them. Experiments are conducted to compare the performance of the topic-aware model and conventional word-based models such as the vector space model (VSM) and Okapi model. Experimental results show that the use of topic information can provide more precise information about users’ depressive problems, thus improving the retrieval precision.

Topic Category 基礎與應用科學 > 資訊科學
電機資訊學院 > 資訊工程學系
Reference
  1. [Ahlgrena and Grönqvist 2008] P. Ahlgrena and L. Grönqvist, “Evaluation of Retrieval Effectiveness with Incomplete Relevance Data: Theoretical and Experimental Comparison of Three Measures,” Information Processing & Management, vol. 44, no. 1, pp. 212-225, 2008.
    連結:
  2. [Atkinson-Abutridy et al. 2003] J. Atkinson-Abutridy, C. Mellish, and S. Aitken, “A Semantically Guided and Domain-Independent Evolutionary Model for Knowledge Discovery From Texts,” IEEE Trans. Evolutionary Computation, vol. 7, no. 6, pp. 546-560, 2003.
    連結:
  3. [Au et al. 2003] W. H. Au, K. C. C. Chan, and X. Yao, “A Novel Evolutionary Data Mining Algorithm with Applications to Churn Prediction,” IEEE Trans. Evolutionary Computation, vol. 7, no. 6, pp. 532-545, 2003.
    連結:
  4. [Barzilay and Lapata 2008] R. Barzilay and M. Lapata, “Modeling Local Coherence: An Entity-based Approach,” Computational Linguistics, vol. 34, no. 1, pp. 1-34, 2008.
    連結:
  5. [Brostedt and Pedersen 2003] E. M. Brostedt and N. L. Pedersen, “Stressful Life Events and Affective Illness,” Acta Psychiatrica Scandinavica, vol. 107, no. 3, pp. 208-215, 2003.
    連結:
  6. [Burstein et al. 2003] J. Burstein, D. Marcu and K. Knight, “Finding the WRITE Stuff: Automatic Identification of Discourse Structure in Student Essays,” IEEE Intelligent Systems, vol. 18, no. 1, 2003, pp. 32-39.
    連結:
  7. [Chan 2004] S.W.K. Chan, “Extraction of Salient Textual Patterns: Synergy between Lexical Cohesion and Contextual Coherence,” IEEE Transactions on Systems, Man and Cybernetics, Part A, vol. 34, no. 2, 2004, pp. 205-218.
    連結:
  8. [Chang and Chen 2006] Y. C. Chang and S. M. Chen, “A New Query reweighting Method for Document Retrieval Based on Genetic Algorithms,” IEEE Trans. Evolutionary Computation, vol. 10, no. 5, pp. 617-622, 2006.
    連結:
  9. [Chien 2006] J. T. Chien and M. S. Wu, “Association Pattern Language Modeling,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1719-1728, 2006.
    連結:
  10. [Chien and Wu 2008] J. T. Chien, “Adaptive Bayesian Latent Semantic Analysis,” IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 198-207, 2008.
    連結:
  11. [Fellbaum 1998] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998.
    連結:
  12. [Gao and Suzuki 2003] J. F. Gao and H. Suzuki, “Unsupervised Learning of Dependency Structure for Language Modeling,” In Proceedings of the 41st Annual Meeting of the ACL, 2003, pp. 521-528.
    連結:
  13. [Hamilton 1960] M. Hamilton, “A Rating Scale for Depression,” Journal of Neurology, Neurosurgery and Psychiatry, vol. 23, pp. 56-62, 1960.
    連結:
  14. [He and Ounis 2007] B. He and I. Ounis, “Combining Fields for Query Expansion and Adaptive Query Expansion,” Information Processing & Management, vol. 43, no. 5, pp. 1294-1307, 2007.
    連結:
  15. [Kim and Moldovan 1995] J. T. Kim and D. I. Moldovan, “Acquisition of Linguistic Patterns for Knowledge-Based Information Extraction,” IEEE Trans. Knowledge and Data Engineering, vol. 7, no. 5, pp. 713-724, 1995.
    連結:
  16. [Kullback 1959] S. Kullback, Information Theory and Statistics. New York: John-Wiley & Sons, 1959.
    連結:
  17. [Landauer et al. 1998] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to latent semantic analysis,” Discourse Processes, vol. 25, no. 2&3, pp. 259-284, 1998.
    連結:
  18. [Lau et al. 2008] R. Y. K. Lau, P. D. Bruza, and D. Song, “Towards a Belief-revision-based Adaptive and Context-Sensitive Information Retrieval System,” ACM Trans. Information Systems, vol. 26, no. 2, pp. 8-38.
    連結:
  19. [Leroy and Chen 2001] G. Leroy and H. Chen, “Meeting medical terminology needs-the Ontology-Enhanced Medical Concept Mapper,” IEEE Trans. Information Technology Biomedicine, vol. 5, no. 4, pp. 261-270, 2001.
    連結:
  20. [Lin et al. 2003] C. C. Lin, Y. M. Bai, and J. Y. Chen, “Reliability of Information provided by Patients of a Virtual Psychiatric Clinic,” Psychiatric Services, vol. 54, no. 8, pp. 1167-1168, 2003.
    連結:
  21. [Mann and Thompson 1988] W.C. Mann and S.A. Thompson, “v Structure Theory: Toward a Functional Theory of Text Organiza-tion,” Text, vol. 8, no. 3, 1988, pp. 243-281.
    連結:
  22. [Michalewicz 1996] Z. Michalewicz, Genetic Algorithms + Data Structure = Evolution Programs. New York: Springer-Verlag, 1996.
    連結:
  23. [Navigli et al. 2003] R. Navigli, P. Velardi, and A. Gangemi, “Ontology Learning and Its Application to Automated Terminology Translation,” IEEE Intelligent Systems, vol. 18, no. 1, pp. 22-31, 2003.
    連結:
  24. [Navigli and Velardi 2005] R. Navigli and P. Velardi, “Structural Semantic Interconnections: A Knowledge-Based Approach to Word Sense Disambiguation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 7, pp. 1075-1086, 2005.
    連結:
  25. [Okabe et al. 2005] M. Okabe, K. Umemura, and S. Yamada, “Query Expansion with the Minimum User Feedback by Transductive Learning,” in Proc. of HLT/EMNLP, Vancouver, Canada, pp. 963-970, 2005.
    連結:
  26. [Osinski and Weiss 2005] S. Osinski and D. Weiss, “A Concept-Driven Algorithm for Clustering Search Results,” IEEE Intelligent Systems, vol. 20, no. 3, pp. 48-54, 2005.
    連結:
  27. [Pagano et al. 2004] M. E. Pagano, A. E. Skodol, R. L. Stout, M. T. Shea, S. Yen, C. M. Grilo, C. A. Sanislow, D. S. Bender, T. H. McGlashan, M. C. Zanarini, and J. G. Gunderson, “Stressful Life Events as Predictors of Functioning: Findings from the Collaborative Longitudinal Personality Disorders Study,” Acta Psychiatrica Scandinavica, vol. 110, pp. 421-429, 2004.
    連結:
  28. [Rodríguez et al. 1998] H. Rodríguez, S. Climent, P. Vossen, L. Bloksma, W. Peters, A. Alonge, F. Bertagna, and A. Roventint, “The top-down strategy for building EeuroWordNet: Vocabulary coverage, base concepts and top ontology,” Comput. Humanities, vol. 32, pp. 117–159, 1998.
    連結:
  29. [Salton and Buckley 1988] G. Salton and C. Buckley, “Term-weighting Approaches in Automatic Text Retrieval,” Information Processing Management, vol. 24, no. 5, pp. 513-523, 1988.
    連結:
  30. [Stevens et al. 2002] R. Stevens, C. Goble, I. Horrocks, and S. Bechhofer, “Building a Bioinformatics Ontology Using OIL,” IEEE Trans. Information Technology Biomedicine, vol. 6, no. 2, pp. 135-141, 2002.
    連結:
  31. [Voorhees and Harman 2000] E. M. Voorhees and D. K. Harman. “Overview of the Sixth Text REtrieval Conference (TREC-6),” Information Processing & Management, vol. 36, no. 1, pp. 3-35, 2000.
    連結:
  32. [Wolf and Gibson 2005] F. Wolf and E. Gibson, “Representing Discourse Coherence: A Corpus-based Analysis,” Computational Linguistics, vol. 31, no. 2, 2005, pp. 249-288.
    連結:
  33. [Wolfe and Goldman 2003] M. B. W. Wolfe and S. R. Goldman. “Use of Latent Semantic Analysis for Predicting Psychological Phenomena: Two Issues and Proposed Solutions,” Behaviour Research Methods, vol. 35, no. 1, pp. 22-31, 2003.
    連結:
  34. [Wu et al. 2006a] C. H. Wu, Z. J. Chuang, and Y. C. Lin, “Emotion Recognition from Text Using Semantic Labels and Separable Mixture Models,” ACM Trans. Asian Language Information Processing, vol. 5, no. 2, pp. 165-182, 2006.
    連結:
  35. [Yeh et al. 2004] J. F. Yeh, C. H. Wu, M. J. Chen, and L. C. Yu, “Automated Alignment and Extraction of Bilingual Domain Ontology for Cross-Language Domain-Specific Applications,” in Proc. of the 20th International Conference on Computational Linguistics (COLING ‘04), Geneva, Switzerland, 2004, pp.1140-1147.
    連結:
  36. [Yeh et al. 2008a] J. F. Yeh, C. H. Wu, and M. J. Chen, “Ontology-based Speech Act Identification in a Bilingual Dialog System Using Partial Pattern Trees,” Journal of the American Society for Information Science and Technology, vol. 59, no. 5, pp. 684-694, 2008.
    連結:
  37. [Yeh et al. 2008b] J. F. Yeh, C. H. Wu, L. C. Yu, and Y. S. Lai, “Extended Probabilistic HAL with Close Temporal Association for Psychiatric Consultation Query Retrieval,” to appear in ACM Trans. Information Systems, 2008.
    連結:
  38. [Yu et al. 2008] L. C. Yu, C. H. Wu, J. F. Yeh, and F. L. Jang, “HAL-based Evolutionary Inference for Pattern Induction from Psychiatry Web Resources,” IEEE Trans. Evolutionary Computation, vol. 12, no. 2, pp. 160-170, 2008.
    連結:
  39. [Agrawal and Srikant 1994] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 487–499, 1994.
  40. [Baeza-Yates and Ribeiro-Neto 1999] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-Wesley, Reading, MA, 1999.
  41. [Bai et al. 2001] Y. M. Bai, C. C. Lin, J. Y. Chen, and W. C. Liu, “Virtual Psychiatric Clinics,” American Journal of Psychiatry, vol. 158, no. 7, pp. 1160-1161, 2001.
  42. [Bompada et al. 2007] T. Bompada C. C. Chang, J. Chen, R. Kumar, and R. Shenoy, “On the Robustness of Relevance Measures with Incomplete Judgments,” in Proc. of the 30th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 359-366, 2007.
  43. [Bradley and Lang 1999] M.M. Bradley and P. J. Lang, “Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings,” Technical Report C-1, Center for Research in Psychophysiology, University of Florida, 1999.
  44. [Cancedda et al. 2003] N. Cancedda, E. Gaussier, C. Goutte, and J. M. Renders, “Word-Sequence Kernels,” Journal of Machine Learning Research, vol. 3, no. 6, pp. 1059-1082, 2003.
  45. [Chen et al. 2001] K. J. Chen, C. R. Huang, F. Y. Chen, C.C. Luo, M.C. Chang and C.J. Chen, “Sinica Treebank: Design Criteria, Representational Issues and Implementation,” In Anne Abeille, editor, Building and Using Syntactically Annotated Corpora, Kluwer, 2001, pp. 29-37.
  46. [Coelho et al. 2004] T. A. S. Coelho, P. Calado, L. V. Souza, B. Ribeiro-Neto, and R. Muntz, “Image Retrieval Using Mul-tiple Evidence Ranking,” IEEE Trans. Knowledge Data Engeneering, vol. 16, no. 4, pp. 408-417, 2004.
  47. [Devitt and Ahmad 2007] A. Devitt and Khurshid Ahmad, “Sentiment Polarity Identification in Financial News: A Cohesion-based Approach,” In Proceedings of the 45th Annual Meeting of the ACL, 2007, pp. 984-991.
  48. [Fisher and Roark 2007] S. Fisher and B. Roark, “The utility of parse-derived features for automatic discourse segmentation,” In Proceedings of the 45th Annual Meeting of the ACL, 2007, pp. 488-495.
  49. [Han and Kamber 2001] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmamn publishers, 2001.
  50. [Hobbs 1985] J.R. Hobbs, “On the Coherence and Structure of Discourse,” Report No. CSLI-85-37, Center for the Study of Language and Information, Stanford University, 1985.
  51. [Jarvelin and Kekalainen 2000] K. Jarvelin and J. Kekalainen, “IR Evaluation Methods for Retrieving Highly Relevant Documents,” in Proc. of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 41-48, 2000.
  52. [Jarvelin and Kekalainen 2002] K. Jarvelin and J. Kekalainen, “Cumulated Gain-based Evaluation of IR Techniques,” ACM Trans. Information Systems, vol. 20, no. 4, pp. 422-446, 2002.
  53. [Lehnert et al. 1992] W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, E. Riloff, and S. Soderland, “University of Massachusetts: Description of the CIRCUS System used for MUC-4,” Proc. Fourth Message Understanding Conference (MUC-4), pp. 282-288, 1992.
  54. [Lodhi et al. 2002] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text Classification Using String Kernels,” Journal of Machine Learning Research, vol. 2, no. 3, pp. 419-444, 2002.
  55. [Manning and Schütze 1999] C. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, Mass.: MIT Press, 1999.
  56. [Morris et al. 2003] J. Morris, C. Beghtol and G. Hirst, “Term relationships and their contribution to text semantics and information literacy through lexical cohesion,” In Proceedings of the 31st Annual Conference of the CAIS, 2003, pp. 153-168.
  57. [Morris and Hirst 1991] J. Morris and G. Hirst, “Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text,” Computational Linguistics, vol. 17, no. 1, 1991, pp. 21-48.
  58. [Muslea 1999] I. Muslea, “Extraction Patterns for Information Extraction Tasks: A Survey,” Proc. AAAI Workshop on Machine Learning for Information Extraction, pp. 1-6, 1999.
  59. [Power et al. 2003] R. Power, D. Scott and N. Bouayad-Agha, “Document Structure,” Computational Linguistics, vol. 29, no. 2, 2003, pp. 211-260.
  60. [Robertson et al. 1995] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford, “Okapi at TREC-3,” in Proc. of the Third Text REtrieval Conference (TREC-3), NIST, 1995.
  61. [Robertson et al. 1996] S. E. Robertson, S. Walker, M. M. Beaulieu, and M. Gatford, “Okapi at TREC-4,” in Proc. of the fourth Text REtrieval Conference (TREC-4), NIST, 1996.
  62. [Salton and McGill 1983] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
  63. [Soderland 1999] S. Soderland, “Learning Information Extraction Rules for Semi-Structured and Free Text,” Machine Learning, vol. 34, no. 1-3, pp. 233-272, 1999.
  64. [Voorhees 2001] E. M. Voorhees, “Evaluation by Highly Relevant Documents,” in Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74-82, 2001.
  65. [Wu et al. 2006b] C. H. Wu, J. F. Yeh, and Y. S. Lai, “Semantic Segment Extraction and Matching for Internet FAQ Retrieval,” IEEE Trans. Knowledge and Data Engineering, vol. 18, no. 7, pp. 930-940, 2006.
Times Cited
  1. 黃巧媛(2016)。應用潛在語意分析於試題相似度比對 -以中華民國物流協會認證題庫為例。臺中科技大學流通管理系碩士班學位論文。2016。1-63。