利用區域上下文資訊來強化類神經網路關鍵字擷取模型

由於關鍵字能為一篇文章提供精簡、扼要的內容整理，自動化關鍵字擷取的方法在近二十年來已經被廣泛的研究。傳統的關鍵字擷取方法非常倚賴人工定義的特徵去優化效能，而抽取出有效的特徵是一件非常耗時的事情。一篇最新的論文提出了一個利用深度循環神經網路模型來抽取關鍵詞片語的方法，省去了人工抽取特徵的過程。然而，該篇論文所提出來的模型有三個缺點：首先，他們沒有考慮到每個字的重要性會受到同句子內其他字的影響。一篇文章中的每個句子會呈現出不一樣的重要性，而這樣的重要性差異通常來自於句子中有某些具有重要意涵的字，因此句子的語意應該在擷取關鍵字的時候一起被考慮。接著，他們的模型沒有針對那些不曾在訓練資料中出現的單字做處理。測試文章中大約有百分之十五的單字是在訓練資料中沒有的，而這些未知字詞會降低模型的準確度，因此字符資訊應該要被考慮進來。最後，他們使用了詞向量作為輸入，因而導致一定程度上句法資訊的遺失。然而句法資訊已經被許多研究證實能夠有效的幫助抽取關鍵字詞，應該要被更有效的利用以解決這類的問題。在這篇論文中，我們使用了兩組卷積神經網路所組成的區域語意資訊，以及搭配線性轉換的詞性向量所組成的句法資訊，來強化類神經網路關鍵字擷取模型。最後，我們使用了兩個公開資料集去驗證我們的方法，實驗結果顯示我們的方法能夠顯著的超越非監督式和監督式方法中的頂尖技術。

關鍵字

關鍵字擷取；區域上下文資訊；遞迴神經網路；卷積神經網路；詞向量

並列摘要

Keywords can provide condensed information about a document and hence automatic keyword extraction has attracted the interest of researchers in recent decades. Traditional methods largely rely on handcrafted features to optimize the performance, which is usually a time-consuming procedure. A recent work proposes a novel deep recurrent neural network (RNN) model to extract keyphrases without manual feature engineering. However, there are three drawbacks of the previous work. To begin with, the fact that importance of a word may be influenced by other words in the same sentence is not considered in their work. It is known that each sentence possesses distinct influence on the document and such discrepancy is usually caused by some sentences contain meaningful words. Hence sentence semantics should be taken into consideration. Then, their work does not deal with these words not appearing in training data explicitly. About 15 percent of words in testing documents are not seen in training corpus, and these out-of-vocabulary (OOV) words lower the performance. For this reason, character-level information should be regarded. Finally, syntactic information is lost to some extent when they choose to use word embeddings as input. Nevertheless, syntactic information has been proved to be effective in extracting key terms, and ought to be fully utilized when it comes to such problems. In this work, we enhance neural keyword extraction with local semantic information and syntactic information, which are composed by two convolutional neural networks and POS embeddings with linear transformation, respectively. The experimental results show that our proposed model outperforms both unsupervised and supervised state-of-the-art baselines on two datasets significantly.

並列關鍵字

Keyword Extraction ； Local Context Information ； Recurrent Neural Network ； Convolutional Neural Network ； Word Embedding

參考文獻

[1] Abilhoa,W.D.anddeCastro,L.N.(2014).Akeywordextractionmethodfromtwitter messages represented as graphs. Applied Mathematics and Computation, 240:308–325.

[3] Barskar, R., Ahmed, G. F., and Barskar, N. (2012). An approach for extracting exact answers to question answering (qa) system for english sentences. Procedia Engineer- ing, 30:1187 – 1194.

[4] Bracewell, D. B., Ren, F., and Kuriowa, S. (2005). Multilingual single document keyword extraction for information retrieval. In 2005 International Conference on Natural Language Processing and Knowledge Engineering, pages 517–522.

[5] Cheng, J. and Lapata, M. (2016). Neural summarization by extracting sentences and words. CoRR, abs/1603.07252.

[8] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Com- putation, 9(8):1735–1780.

國際替代計量

利用區域上下文資訊來強化類神經網路關鍵字擷取模型

全文下載

主題瀏覽