透過您的圖書館登入
IP:3.138.105.128
  • 期刊
  • OpenAccess

基於知網的常識知識標注

General Knowledge Annotation Based on How-net

摘要


知網是個雙語的常識知識庫,描述概念與概念之間種種不同的關係,包括上下位關係、近義關係、反義關係、部件與整體間的關係、屬性與宿主之間的關係、材料與成品之間的關係、對逆關係、動態角色關係和概念同現關係。本文利用知網標注了三萬目詞的語料。我們的語料來自中央研究院平衡語料庫(第三版)中有關社會犯罪的報章報導。茲將標注方法以及標注過程中所發現的問題和我們的解決方案摘要報告。

關鍵字

無資料

並列摘要


How-net is a bilingual general knowledge-base describing relations between concepts and relations between the attributes of concepts. It covers over 62,000 concepts in the Chinese language and close to 73,000 English equivalents. The relations include hyponymy, synonymy, antonymy, meronymy, attribute-host, material-product, converse, dynamic role, and concept co-occurrence. The philosophy behinds the design of How-net is its ontological view that all physical and non-physical matters undergo a continual process of motion and change in a specific space and time. The motion and change are usually reflected by a change in state that in turn, is manisfested by a change in the value of some attributes. The top-most level of classification in How-net thus includes: Entity, Event, Attribute, and Attribute Value. It adopts a bottom-up approach in deriving a total of over 1400 sememes. These sememes are extracted from abou 6000 Chinese characters. They are organized hierarchically and their robustness is carefully evaluated by checking their adequacy in describing over 62,000 concepts in Chinese. The experiment concluded that the set of sememes is stable and robust enough to describe all kinds of concepts, whether current or new. In this paper, we described the use of How-net in annotating a corpus of newspaper texts covering the crime domain. The corpus consists of 30,000 words that are extracted from the Sinica corpus, version 3.0. Our goals are: (i) to create a linguistic resource rich in both syntactic and general knowledge information for the computational Chinese linguistic community; (ii) to develop a Chinese text understanding approach directly based on How-net. In our work, we developed a tool to help human annotators in selecting the correct how-net definitions. We described in details the methodology used to differentiate the different definitions of a word-form. We categorized unregistered concepts and explained how we defined these concepts. We also performed an experiment to compare the degree of consistency among four human annotators. Our experiement showed that the inter-annotators consistency is over 90% and the average annotation speed is 7.4 concepts per minute. Our work verified the robustness of How-net. There are only 4.2% unregistered concepts and 1.0% concepts have missing definitions.

並列關鍵字

無資料

參考文獻


中國社會科學語言研究所詞典編輯室(1990)。現代漢語詞典
殷鴻翔(1983)。同義詞詞林
詞庫小組。中央研究院平衡語料庫的內容與說明
董振東。知網
顏國偉。基於知網的常識知識標注手冊

被引用紀錄


謝文軒(2006)。語意解析垃圾郵件過濾器〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2006.00908
曹又心(2015)。結合搭配詞與主題概念改善中文口碑分類〔碩士論文,中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu201500823
林孚嘉(2008)。中文資訊檢索之詞彙資源效益〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2008.01017

延伸閱讀