以網際網路語料為基礎之相關性量測研究及其在社群偵測與查詢詞推薦之應用

在自然語言處理中，做統計計算時資源是最重要的，現今有很多現成的語料及被驗証過的語言模型幾乎是隨手可得，而以語料庫為基礎的各種不同的研究中，總是會面臨到語料庫是否可以反應出最新詞彙相關性意義的麻煩。因為語言是活的，新字以及新詞彙每天都會被創造出來，如何知道新字以及新詞彙的相關性意義是一個非常重要的研究議題。在這篇論文中我們定義了一個新穎的以網頁為基礎的相關性量測方法，並且把網頁當成是一種語料，而我們也會探討不同的網頁領域上對這個方法所帶來的影響。兩個詞彙的關係分數會根據這兩個詞的網頁內容以及詞彙的頻率資訊來加以獲得，兩個詞彙真正的相關性分數是把他們的關係分數透過一個轉移函式的計算而得到的，本論文中一共提出四種轉移函式，分別為卜瓦松函式、對數凸函式、冪凸函式以及岡帕氏函式，在實驗中我們分別使用三個有名的測試集來測試這四個模型，並詳細的與各研究團隊做比較。在以往的研究中人名一直是非重要的語料資源，我們會利用關係分數來判斷兩個人名是否有關係，在這個關係的辨識中我們提出三個策略：分別為直接關聯法、關聯矩陣法以及純量關聯矩陣法來驗證我們的相關性量測是合理的。我們會利用上述的相關性量測方法去建立一個社群網路，並對這個社群網路的每個配對利用馬可夫隨機程序去標記他們的類別，而且會試著從網頁中抽出關鍵詞當成他們的關係。在論文中我們也利用我們的相關性量測做查詢詞的推薦，這個查詢詞的推薦與傳統的查詢詞推薦不同，傳統的查詢詞推薦是根據被查詢的記錄檔，我們的查詢詞推薦是從網頁中直接抽取出來。在實驗中我們所提出的方法證明有高度的認同值。

關鍵字

相關性量測；社群偵測；查詢詞推薦；類別標記；關係標記；演化中的社群網路

並列摘要

In statistical natural language processing, resources used to compute the statistics are indispensable. Different kinds of corpora have made available and many language models have been experimented. One major issue behind the corpus-based approaches is: if corpora adopted can reflect the up-to-date usage. As we know, languages are live. New terms and phrases are used in daily life. How to capture the new usages is an important research topic. This thesis defines a novel web-based relatedness measure and explores snippets in various web domains as corpora. Mutual dependency score between two objects is calculated according to content information and frequent information of the two objects. The relatedness score of the two objects is defined as projecting the dependency score by a transfer function. Four transfer functions based on Poisson, Log-concave Power-concave and Gompertz function are considered. Three famous benchmark datasets, including WordSimilarity-353, Miller-Charles and Rubenstein-Goodenough, verify the four transfer functions. Named entities are common foci of searchers. We apply the dependency score to evaluate named level association by three strategies, direct association, association matrix and scalar association matrix. Modeling and naming general entity-entity relationships is challenging in construction of social networks. Given a seed denoting a person name, we utilize Google search engine, NER (Named Entity Recognizer) parser, and the web-based relatedness measure to construct an evolving social network. For each entity pair in the network, we apply Markov chain random process to extract potential categories defined in the ODP. Moreover, for labeling their relationships, we try to combine the tf×idf scores of noun phrases extracted from snippets and the rank scores of the categories. Different from traditional query suggestion which is extracted from query logs,we extract suggestion terms from snippets. We apply our relatedness measures to the query suggestion. Using the proposed relatedness measures, our query suggestion extracted shows a high agreement of relatedness.

並列關鍵字

Relatedness Measure ； Community Chain Detection ； Query Suggestion ； Category Labeling ； Relationships Labeling ； Evolving Social Network

參考文獻

Lin, M. S., Chen, C. P. and Chen, H. H. 2005. An Approach of Using the Web as a Live Corpus for Spoken Transliteration Name Access. In Proceedings of 17th ROCLING Conference, p.361-370.

Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A., Arpinar, I., Joshi, A. and Finin, T. 2006. Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection. In Proceedings of the 15th international conference on World Wide Web, p.407-416.

Alvarez, M. A. and Lim, S. 2007. A Graph Modeling of Semantic Similarity between Words. In Proceedings of the International Conference on Semantic Computing, p.355-362.

Bagga, A. and Baldwin, B. 1998. Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In Proceedings of 36th COLING-ACL Conference, p.79-85.

Bekkerman, R. and McCallum, A. 2005. Disambiguating Web Appearances of People in a Social Network. In Proceedings of the 14th international conference on World Wide Web, p.463-470.

國際替代計量

以網際網路語料為基礎之相關性量測研究及其在社群偵測與查詢詞推薦之應用

全文下載

主題瀏覽