透過您的圖書館登入
IP:3.144.237.77
  • 學位論文

使用非正向蘊涵語言現象研究文本蘊涵

Analyses of Negative Entailment Phenomena for Textual Entailment Recognition

指導教授 : 陳信希

摘要


處理英文文本蘊涵辨識問題的RTE(Recognising Textual Entailment)系列研究已行之有年,在2011年已經舉辦第7屆。2011年在日本NTCIR-9大會中也首度舉辦了文本蘊涵關係辨識的任務:RITE(Recognizing Inference in TExt),並且也是第一次有研究團體發佈繁體中文及簡體中文的文本蘊涵關係研究所需的資料集。足見文本蘊涵辨識研究(Textual Entailment Recognition)廣泛受到注目。 本論文研究哪些語言現象在處理文本蘊涵辨識問題上最為有效,運用Mark Sammons等人在RTE-5 測試資料集上所標記的語言現象作分析與實驗後,我們發現以非正面蘊涵語言現象(Negative Entailment Phenomena)這分類下的語言現象作為特徵值時,對解決文本蘊涵辨識問題相當有效,準確率超過90%。同時,此分類下有五個語言現象又特別突出,分別是支離關聯(Disconnected Relation)、排他引數(Exclusive Argument)、排他關聯(Exclusive Relation)、失蹤引數(Missing Argument)和失蹤關聯(Missing Relation)。 於是我們嘗試用兩種自動化方法:以規則為基礎的方法(rule-based method)及以機器學習為基礎的方法(machine-learning based method),去擷取在文本假說對中的這五個語言現象,並分別測試這兩種自動化方法擷取出的語言現象,在文本蘊涵辨識問題上的效能。與人工標記的結果相比,雖然效能仍有相當大的差距,但也給我們一個努力的方向。 由於過去所有的討論分析都是在英文語料上,我們嘗試對NTCIR-9 RITE任務中所發佈的資料集人工標記,分析後發現在中文語料內也有這些語言現象,並且同樣在文本蘊涵辨識上具有決定性的影響。

並列摘要


The researches on Textual Entailment (TE) have attracted much attention in recent years. RTE (Recognising Textual Entailment in short), a series of evaluations which focus on the developments of English textual entailment recognition technologies, has been held 7 times up to 2011. In 2011, the 9th NTCIR Workshop Meeting first introduced a Textual Entailment task called RITE (Recognizing Inference in TExt in short) into the IR series evaluation. RITE focuses on the Textual Entailment researches in Traditional Chinese, Simplified Chinese, and Japanese. The first ground truth and text-hypothesis pair data set in both Traditional and Simplified Chinese have been distributed. In this thesis, we concentrate on what kind of phenomena in text-hypothesis pairs would be powerful features to deal with the textual entailment problem. After analyzing and experiments on the dataset distributed by Mark Sammons et al., which is annotated with linguistic phenomena defined by Mark’s research group, we found that the Negative Entailment Phenomena is the most powerful aspect in textual entailment. Accuracy more than 90% was achieved. In this aspect, there are five outstanding phenomena including Disconnected Relation, Exclusive Argument, Exclusive Relation, Missing Argument, and Missing Relation. Then, we tried to extract the linguistic phenomena from text-hypothesis pairs automatically. Two automatic methods, i.e., rule-based method and machine-learning method, were employed. After applying the phenomena extracted by these automatic methods as features in the TE experiments, the results show that there is a large gap between the human-annotated phenomena and machine-extracted phenomena. There is still room for improvement with this kind of features. All the above analyses and experiments were made on English data. We aim at knowing whether these important phenomena are also effective in dealing the TE problems on Chinese data or not. Following the similar scheme in English, we annotate the BC-CT text-hypothesis pairs distributed by NTCIR-9 RITE task with the five phenomena. The experiments on both human-annotated and machine-extracted features show these negative entailment phenomena still do well in Chinese.

參考文獻


[1] Ido Dagan, O.G.a.B.M.: ‘The PASCAL Recognising Textual Entailment Challenge’, Lecture Notes in Computer Science, 2006, 3944/2006, pp. 177-190
[3] Danilo Giampiccolo, H.T.D., Bernardo Magnini, Ido Dagan, Elena Cabrio.: ‘The Fourth PASCAL Recognizing Textual Entailment Challenge’. Proc. TAC 2008, Gaithersburg, Maryland, USA2008 pp. Pages
[4] Hideki Shima, H.K., Cheng-Wei Lee, Chuan-Jie Lin, Teruko Mitamura, Yusuke Miyao, Shuming Shi and Koichi Takeda: ‘Overview of NTCIR-9 RITE: Recognizing Inference in TExt’. Proc. NTCIR-9 Workshop Meeting, Tokyo, Japan2011 pp. 291-301
[5] Luisa Bentivogli, I.D., Hoa Trang Dang, Danilo Giampiccolo, Bernardo Magninil: ‘The Fifth PASCAL Recognizing Textual Entailment Challenge’. Proc. TAC 2009, Gaithersburg, Maryland, USA2009 pp. 14-24
[10] Adrian Iftene, M.-A.M.: ‘UAIC Participation at RTE5’. Proc. TAC 2009, Gaithersburg, Maryland, USA2009 pp. Pages

延伸閱讀