  • 學位論文


Corpus-driven Linguistic Approaches to Sense Prediction

指導教授 : 黃居仁 安可思


在這個研究當中,我使用以語料庫為驅動的操作當作是詞義預測的主要方法。我著重藉由使用語料庫觀察個別的語義特徵以預測還沒有分析的詞彙的詞義,在本論文中,所使用的語料庫,如:中文十億詞語料庫 (Chinese Gigaword Corpus), 知網 (HowNet), 中文詞網 (Chinese Wordnet), and 現代漢語辭典 (XianDai HanYu CiDian)。使用這些語料庫,我可以藉由詞形比對和概念比對的分析來確定四個目標詞彙 --- 吃、玩、換、燒的共現詞彙群組。 這四個目標詞彙都是及物動詞,他們都有超過兩個以上的詞義。他們的共現詞彙對於這個詞義預測研究非常有用,也扮演著很重要的角色。當我進行詞形相似成群的分析時,我使用這些共現詞彙的相同詞素,是為了要將他們放入相同的群組。因此,在這個詞義預測的研究,以語料庫為主和計算機計算的方法裡,有兩個主要的策略,分別是:(1) 詞形相似成群的分析,和 (2) 概念相似成群的分析。又在(2)的分析當中,透過知網以探究 (a) 義原之間的相似,和 (b) 概念之間的相似。在這個詞義預測研究,我先預測不同群組詞彙可以表達不同的詞義,再透過以語料庫為主和計算機計算的方法的詞形相似成群分析和概念相似成群分析來檢測這四個目標詞彙的準確率。然後,我再透過中文詞網和現代漢語辭典來評估這四個目標詞彙,以證明我可以利用自動計算的程式來預測吃、玩、換、燒的不同詞義。 利用以語料庫為主和計算機的方法在這個詞義預測研究之後,我以紙筆的測驗來測試受試者的直覺知識以驗證不同群組的詞彙可以表達不同的詞義。因此,為了測驗這四個目標詞彙的相關共現詞彙,我使用了有多項選擇的任務(multiple-choice task, Burton et al. 1991)。此外,因為實驗的刺激語料收集是來自以語料庫為主和計算機計算的詞形相似成群的方法,所以我將靠著這些在詞義預測研究中所表現的結果來驗證本研究方法的可行性。


In this study, I proposed using corpus-driven distribution as the main method of prediction. I concentrated on individual semantic features to predict the senses of non-defined words by using corpora and tools, such as Chinese Gigaword Corpus, HowNet, Chinese Wordnet, and XianDai HanYu CiDian (Xian Han). Using these corpora, I determined the collocation clusters of the four target words--- chi1 “eat”, wan2 “play”, huan4 “change” and shao1 “burn” through character similarities and concepts similarities. The four target words are all transitive verbs and they each have more than two senses. The collocation words of the four target words are very useful and play an important role in this sense prediction study. When conducting the character similarity clustering analysis, I employed identical morphemes of some of the collocation words in order to cluster them into the same cluster. Therefore, there are two main strategies of the corpus-based and computational approach used in this sense prediction study: (1) character similarity clustering analysis; and (2) concept similarity clustering analysis, which encompasses via HowNet (a) similarity between sememes, and (b) similarity between concepts. In this sense prediction study, I first predicted that different clusters can represent different senses, and I examined the accuracy rates of the four target words via the character similarity clustering analysis and the concept similarity clustering analysis of the corpus-based and computational approach. Then, I evaluated the four target words via sense divisions in Chinese Wordnet and in Xiandai Hanyu Cidian and was able to employ automatically computational programming to predict different senses for chi “eat”, wan2 “play”, huan4 “change”, and shao1 “burn”. After the corpus-based and computational approach used in this sense prediction study, I demonstrated that I was able to use off-line tasks to test my participants’ intuition, which supports the theory that different clusters can represent different senses when using the corpus-based and computational approach. Therefore, in order to examine the related collocation words for the lexically ambiguous target words, I employed a multiple-choice task (Burton et al. 1991). In addition, because the stimuli were collected from the character similarity clustering analysis of the corpus-based and computational approach, I demonstrated the viability of this approach by the results presented in this sense prediction study.


Chen, Hao, Tingting He, Donghong Ji, and Changqin Quan. 2005. “An Unsupervised Approach to Chinese Word Sense Disambiguation Based on Hownet.” Computational Linguistics and Chinese Language Processing. 10:4, pp. 473–482.
Chen, Hsin-Hsi, Guo-Wei Bian, and Wen-Cheng Lin. 1999. “Resolving Translation Ambiguity and Target Polysemy in Cross-Language Information Retrieval.” International Journal of Computational Linguistics and Chinese Language Processing, 4(2), August 1999, pp. 21–38.
Huang, Chu-Ren, Kathleen Ahrens, Chang Li-Li, Chen Keh-Jiann, Liu Mei-Chun, and Tsai Mei-Chih. 2000. “The Module-Attribute Representation of Verbal Semantics: From Semantics to Argument Structure.” In Biq (ed.) Special Issue on Chinese Verbal Semantics. Computational Linguistics and Chinese Language Processing. 5.1: 19-46.
Hong, Jia-Fei, Chu-Ren Huang and Kathleen Ahrens. 2007. The Polysemy of Da3: An ontology-based lexical semantic study. In the Proceedings of the 21st Pacific Asia Conference on Language,Information and Computation (PACLIC 21). November 1-3, Seoul National University. pp. 155-162.
Li, Wanyin, Qin Lu, and Ruifeng Xu. 2005. Similarity Based Chinese Synonym Collocation Extraction. Computational Linguistics and Chinese Language Processing. 10.1: 123–44.
