具詞義區分之中文搭配詞資源建構及其應用

隨著語料庫的規模越來越大，除了提供上下文檢索功能之外，有必要更進一步自動處理大型語料庫的資料，以提供更多資訊，例如搭配詞和詞義訊息。本論文建構具詞義區分之繁體中文和簡體中文搭配詞資源，並藉由自然語言處理的任務表現以評估提出之搭配詞資源。為了自動擷取具詞義標記的搭配詞，本研究分別利用 Stanford Parser 和 SyntaxNet Parser 從具詞義標記的句子中擷取搭配詞組合，並依其$logDice$分數高低進行排序。在繁體中文資料的詞義標記上，本文嘗試以半自動化方式標記詞義，從中研院平衡語料庫4.0的句子中找出接近標記詞義的候選句。先以 Stanford Parser (以及SyntaxNet Parser) 剖析語料庫中的句子，然後根據剖析出的依存句法資訊將該句子投射至語義向量空間中。同樣地，詞典 (中文詞彙網路) 中每個詞義的例句也經句法剖析投射至語義空間，然後將在語義空間中接近欲標記詞義例句的中研院語料庫句子優先抽取出來，方便標記者優先標記該詞義可能的候選句，以加速標記工作的進行，而不需從語料庫中一句句地尋找可標記詞義之句子。簡體中文的詞義標記資料則來自於2007年的語義評估任務，共有40個詞，其詞義標記在2,686個句子中。為了能與簡體中文的詞義標記進行比較，在繁體中文的詞典 (中文詞彙網路) 中選取17個也在簡體中文資料出現的詞當標記目標，並在中研院平衡語料庫中共標記了1,646個含該17個詞的句子。本搭配詞資源及其詞義標記已在網站上釋出 (http://lopen.linguistics.ntu.edu.tw/collocation.htm)，以提供使用者查詢。藉由詞義消歧任務的外部評估，結果證明運用 SyntaxNet Parser 擷取的搭配詞資料，可訓練支持向量機之分類器達到現今最佳的簡體中文詞義消歧準確率 P=75.98%，以及詞義區分較細的繁體中文準確率 P=58.35%。相對於深度學習模型，本研究用較透明的模型僅配合基本的語言特徵，就能得到當今最好的詞義消歧表現，表示詞的搭配行為幾乎就能決定該詞在句中的詞義。

關鍵字

搭配詞；依存句法剖析器；語義空間；詞義標記；詞義消歧

並列摘要

With the size of corpora growing larger and larger, it is of urgent necessity to automatically process big corpora to provide further information beyond concordance, such as collocation and sense information. In this dissertation, a collocation resource with sense distinction in Simplified Chinese and Traditional Chinese is constructed, and the results are evaluated by an NLP (Natural Language Processing) task. To automatically extract collocation with sense annotation, the Stanford Parser and SyntaxNet Parser are exploited respectively to extract collocation candidates from sense-annotated sentences. These collocation candidates are later ranked by their logDice score. For Traditional Chinese sense annotation, a semi-automatic approach is investigated to facilitate the work of sense annotation, by bootstrapping sense instance candidates from the sentences in Academia Sinica Balanced Corpus 4.0. The sentences in the corpus are first parsed by the Stanford Parser (or by the SyntaxNet Parser alternatively), and each sentence is mapped to the vector space according to the dependency parsing information. Similarly, the example sentences of each sense in the dictionary (Chinese Wordnet) are also parsed to the same vector space. Then the sentence candidates in the corpus are ranked by their distances to the intended CWN sense to annotate in the vector space, so that the annotator can begin with the most likely sense instances to annotate, and does not have to examine the corpus sentence-by-sentence to find good sense instances. For Simplified Chinese, the data comes from the SemEval-2007 dataset with 40 word types annotated in 2,686 sentences. To be comparable with the Simplified Chinese data, 17 word types in the Traditional Chinese sense inventory (i.e., Chinese Wordnet) overlapping with the SemEval-2007 word types are selected to annotate in 1,646 sentences from the Sinica Corpus. The proposed collocation resource with sense annotation in Simplified Chinese and Traditional Chinese has been released on a web interface (http://lopen.linguistics.ntu.edu.tw/collocation.htm) for users to query. The extrinsic evaluation by the task of word sense disambiguation (WSD) shows that the collocation data extracted by the SyntaxNet Parser can train an SVM (Support Vector Machine) classifier to achieve the state-of-the-art WSD precision P=75.98% in Simplified Chinese, and P=58.35% in the more fine-grained Traditional Chinese sense inventory (Chinese Wordnet). The state-of-the-art WSD performance based on the proposed transparent approach with only linguistic features (compared to deep learning models) implies that, the collocational behavior of a word can mostly determine the word sense in a sentence.

並列關鍵字

collocation ； dependency parser ； semantic space ； sense annotation ； word sense disambiguation

參考文獻

Abeillé, A. (Ed.). (2003). Treebanks: Building and Using Parsed Corpora. New York: Springer.

Google Scholar

Agirre, E., de Lacalle, O. L., Fellbaum, C., Hsieh, S.-K., Tesconi, M., Monachini, M., … Segers, R. (2010). SemEval-2010 task 17: All-words word sense disambiguation on a specific domain. In Proceedings of the 5th international workshop on semantic evaluation (pp. 75–80). Los Angeles, California: Association for Computational Linguistics.

Google Scholar

Alberti, C., Andor, D., Bogatyy, I., Collins, M., Gillick, D., Kong, L., … Weiss, D. (2017). SyntaxNet Models for the CoNLL 2017 Shared Task. arXiv:1703.04929v1

Google Scholar

Ambati, B. R., Reddy, S., & Kilgarriff, A. (2012). Word Sketches for Turkish. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA).

Google Scholar

Bahumaid, S. (2006). Collocation in English-Arabic Translation. Babel, 52(2), 133–152.

Google Scholar

國際替代計量

具詞義區分之中文搭配詞資源建構及其應用

主題瀏覽