A Study on Semantic Dependencies and HAL Modeling for Web Text Mining and Information Retrieval



文件探勘 ; 資訊檢索 ; 自然語言處理 ; Information Retrieval ; Text Mining ; Natural Language Processing



資訊檢索之目的在幫助使用者快速且有效地搜尋有用的資訊。傳統的資訊檢索系統通常使用字袋(Bag-of-words)表示法來處理使用者查詢及文件,因此在檢索時僅使用文字層面的資訊,而忽略了較高層次的結構化資訊。然而,文件在結構上通常包含許多主題(Topics),這些主題資訊有助於深入理解使用者的查詢需求,進而達到更精確的檢索結果。因此,本論文提出使用文本探勘方法擷取使用者查詢及文件中的主題資訊,並利用所擷取的主題資訊來提升檢索之精確度。 本論文以中文網路精神科文件作為實驗語料。精神科文件係由網路使用者提出的憂鬱問題及相對應的專家建議所組成。這些文件主要包含三項憂鬱症相關之主題資訊,如:負面生活事件、憂鬱症狀及症狀之間的關係。本論文之目的即在實現精神科文件檢索系統,使其具備憂鬱主題分析功能,以協助使用者快速找到與其憂鬱問題相關的文件。 上述三項主題皆使用不同的方法判斷之。對於憂鬱症狀來說,其在文字的表現上通常由單一句子或多句所構成,因此本論文提出使用語意相依關係(Semantic Dependencies)來分析文句之語意結構,逐句判斷其所包含的憂鬱症狀;憂鬱症狀之間的關係,如:因果關係與時間先後關係,係使用領域知識本體(Domain Ontology)來判斷;負面生活事件係由語意樣式(Semantic Pattern)所構成,語意樣式定義為在語意上可表示負面生活事件之字詞組合,因此本論文結合超空間模擬語言模型(Hyperspace Analog to Language, HAL)及演化式計算方法自動從未標記之網路精神科文件擷取語意樣式。 最後本論文提出一檢索模型,能根據使用者查詢與文件所包含的負面生活事件、憂鬱症狀及症狀之間的關係來計算兩者之相似度。在實驗評估上,本論文使用以文字為基礎之檢索模型如:向量空間模型(Vector Space Model, VSM)及Okapi模型為比較對象,實驗結果顯示考慮主題資訊可達到較為精確的檢索結果。

Information retrieval (IR) attempts to retrieve documents relevant to a user’s query from a large collection of documents. Instead of using keyword-based approaches, recent IR systems have been presented to enable natural language queries. Users can thus express their information needs naturally. These systems usually adopt a bag-of-words approach to represent a query and a document, which means that they can exploit only word-level information during the retrieval process. The high-level structural information in documents is often neglected. However, a document is generally structured; that is, it can be characterized as a set of topics and inter-topic relations. Such topic information is beneficial for better understanding users’ information needs so as to obtain more precise retrieval results. Therefore, this dissertation proposes the use of text mining techniques to extract the topic information contained in both queries and documents to improve retrieval performance. The experimental corpora used herein are Chinese psychiatry web resources, a large collection of psychiatric documents produced by Internet users and psychiatrists. Each psychiatric document contains a user’s depressive problems and an expert’s suggestions to alleviate the depressive problems. The psychiatric documents thus contain rich depressive-related topic information, including negative life events, depressive symptoms and semantic relations between symptoms. Therefore, this dissertation attempts to help people to efficiently and effectively locate the psychiatric documents relevant to their depressive problems according to the depressive-related topic information. The topics are extracted using different approaches. For depressive symptoms, the information is often embedded in a single sentence or multiple sentences. This dissertation proposes a text mining framework integrating the semantic dependencies of a sentence (intra-sentence) and the strength of lexical cohesion between sentences (inter-sentence) to mine the symptoms. Once the symptoms are identified, the semantic relations such as cause-effect and temporal relations between symptoms can then be identified. This dissertation uses a domain ontology to mine the semantic relations between extracted symptoms. For negative life events, the information is often represented by meaningful patterns. A pattern refers to a semantically plausible combination of words. Therefore, this dissertation proposes a framework integrating a cognitive motivated model such as Hyperspace Analog to Language (HAL), and evolutionary computation to induce variable-length patterns from unannotated psychiatry web resources. Finally, a retrieval model is designed to calculate the similarity between input queries and psychiatric documents by combining the similarities of negative life events, depressive symptoms and semantic relations within them. Experiments are conducted to compare the performance of the topic-aware model and conventional word-based models such as the vector space model (VSM) and Okapi model. Experimental results show that the use of topic information can provide more precise information about users’ depressive problems, thus improving the retrieval precision.

基礎與應用科學 > 資訊科學
電機資訊學院 > 資訊工程學系
