基於知網的常識知識標注

知網是個雙語的常識知識庫，描述概念與概念之間種種不同的關係，包括上下位關係、近義關係、反義關係、部件與整體間的關係、屬性與宿主之間的關係、材料與成品之間的關係、對逆關係、動態角色關係和概念同現關係。本文利用知網標注了三萬目詞的語料。我們的語料來自中央研究院平衡語料庫（第三版）中有關社會犯罪的報章報導。茲將標注方法以及標注過程中所發現的問題和我們的解決方案摘要報告。

關鍵字

無資料

並列摘要

How-net is a bilingual general knowledge-base describing relations between concepts and relations between the attributes of concepts. It covers over 62,000 concepts in the Chinese language and close to 73,000 English equivalents. The relations include hyponymy, synonymy, antonymy, meronymy, attribute-host, material-product, converse, dynamic role, and concept co-occurrence. The philosophy behinds the design of How-net is its ontological view that all physical and non-physical matters undergo a continual process of motion and change in a specific space and time. The motion and change are usually reflected by a change in state that in turn, is manisfested by a change in the value of some attributes. The top-most level of classification in How-net thus includes: Entity, Event, Attribute, and Attribute Value. It adopts a bottom-up approach in deriving a total of over 1400 sememes. These sememes are extracted from abou 6000 Chinese characters. They are organized hierarchically and their robustness is carefully evaluated by checking their adequacy in describing over 62,000 concepts in Chinese. The experiment concluded that the set of sememes is stable and robust enough to describe all kinds of concepts, whether current or new. In this paper, we described the use of How-net in annotating a corpus of newspaper texts covering the crime domain. The corpus consists of 30,000 words that are extracted from the Sinica corpus, version 3.0. Our goals are: (i) to create a linguistic resource rich in both syntactic and general knowledge information for the computational Chinese linguistic community; (ii) to develop a Chinese text understanding approach directly based on How-net. In our work, we developed a tool to help human annotators in selecting the correct how-net definitions. We described in details the methodology used to differentiate the different definitions of a word-form. We categorized unregistered concepts and explained how we defined these concepts. We also performed an experiment to compare the degree of consistency among four human annotators. Our experiement showed that the inter-annotators consistency is over 90% and the average annotation speed is 7.4 concepts per minute. Our work verified the robustness of How-net. There are only 4.2% unregistered concepts and 1.0% concepts have missing definitions.