透過您的圖書館登入
IP:3.87.209.162
  • 學位論文

從搜尋結果進行人際關係辨識

Identify Human Rellationship From Retrieved Snippets

指導教授 : 梁婷

摘要


實體間的關係辨識一直是篇章處理中的重要工作。目前所辨識的關係,有人物與組織間的工作關係、疾病和藥的關係、作者與作品的關係、蛋白質間的交互關係或是名詞間的等價關係。所使用的方法多以學習模組或樣本分析進行辨識;少部分則是利用剖析樹從句法結構中來辨識目標關係。基本上這些方法所使用的語料可分為固定的語料及動態更新的語料(如網路搜尋結果)。雖然從固定語料辨識關係可獲得較高的正確率,然而透過搜尋引擎的搜尋結果可以得到較新的資訊。在本篇論文中,我們考量人際關係常有更新,因此在搜尋引擎結果中辨識人際關係。因此我們利用Wikipedia建置開發語料,整理出親屬關係及工作關係的關係樣板。此外,為辨識每個人物實體所對應的領域及領域詞彙,我們利用bootstrapping方式從開發語料中抽取出線索詞,用以擴充查詢詞,以擷取出相關的搜尋結果。為了加速篇章處理,我們採用簡單的人名及詞性標記,並進行人稱代詞的消解。我們提出兩階段的辨識程序,第一階段透過比對樣板,第二階段從支援向量機(support vector machine, SVM)透過抽取7種特徵進行辨識。特徵包括線索詞的數量與位置、人物的mutual information、及實體間的相似度。最後所提的方法在396個親屬關係案例的實驗的F-score可達到0.86;在175個工作關係案例中的F-score則有0.75。

並列摘要


Identifying relation among entities is an important task in document processing. The relations identified in previous researches include co-working relations between persons and organizations, relations among diseases and medicines, relations between authors and artifacts, the interactions between proteins, and the equivalence relations among nominals etc... Most identification methods are based on machine learning algorithms or pattern matching and few are based on parsing result. Besides, the corpora used for relation identification can be static and dynamic (like search engine results). Although identifying relations from static corpus generally outperforms the methods using dynamic corpora, yet dynamic corpora contain more updated information. In this thesis, we employ retrieved snippets to identify human relationships and Wikipedia to construct developing corpus. We extract domain words from developing corpus by the bootstrapping algorithm and expand queries for accurate search results. To speed up document processing, simple methods are implemented for part-of-speech tagging, person name tagging and pronominal anaphor resolution. The proposed kinship identification is implemented by pattern matching and support vector machine (SVM). The Features to be used at identification includes the amount and position of clue words and cosine similarity of entities related to persons. The kinship identifier yields 0.86 f-score in the experiment containing 396 kinship instances and the co-working identifier yields 0.75 f-score on 175 co-working instances.

參考文獻


[1]. Gang Wang, Yong Yu and Haiping Zhu (2007), “PORE Positive-Only Relation Extraction from Wikipedia Text” in “Lecture Notes in Computer Science”, 2007, Volume 4825/2007, Pages 580-594
[6]. Longhua Qian, Guodong Zhou, Fang Kong and Qiaoming Zhu (2009), “Semi-Supervised Learning for Semantic Relation Classification using Stratified Sampling Strategy” in “Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing”, Pages 1437-1445
References
[2]. Dat P.t. Nguyen, Yutaka Matsuo and Misuru Ishizuka(2007), “Relation Extraction from Wikipedia Using Subtree mining”, Proceeding AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2, Pages1414-1421
[3]. Yutaka Matsuo, Junichiro Mori and Masahiro Hamasaki (2007), “POLYPHONET: An Advanced Social Network Extraction System from the Web”, in “Web Semantics: Science, Services and Agents on the World Wide Web” volume 5, issue 4, December 2007, Pages 262-278

延伸閱讀