强化非監督及監督式關鍵詞截取方法

近年來，基於圖形化之非監督式排序演算法已成功應用於關鍵短語萃取的任務上。這些方法具有考慮到全域資訊的優點，例如文本結構以及單詞，短語和句子之間的關係，而不僅僅只依賴於局部端點的特定資訊。然而，基於圖形化之關鍵短語萃取方法具有特殊的缺點，這些缺點源於其基於頻率的分析方法。其弱點為許多常見但較不相關的術語可能會獲得更高的排名，特別是在短文中會發生。與其相反的情況也會發生，較不常見（且可能更相關）的術語獲得較低的排名。我們提出一種非監督式的方法--- 升階 ---透過應用一種與反饋的概念相似的誤差反饋機制來增強基於圖形化之關鍵短語萃取方法。此方法已對來自各種領域的近三千三百篇短文進行實驗。我們的實驗顯示，誤差反饋傳播可以提高基於圖形化之關鍵短語萃取技術中的關鍵短語之品質。

關鍵字

非監督式方法；監督式方法；關鍵詞截取

並列摘要

Traditionally, keyphrases (or keywords) have been manually assigned to documents by their authors or by human indexers. This, however, has become impractical due to the massive growth of documents on the Internet each day, thus creating a need for systems that automatically extract keyphrases from documents. Automatic keyphrase extraction methods have generally taken either supervised or unsupervised approaches. In recent years, unsupervised, graph-based ranking algorithms have been successfully applied to keyphrase extraction tasks. These methods have the advantage of taking into account global information, such as text structure and relations between words, phrases, and sentences, rather than relying solely on local, vertex-specific information. Graph-based approaches for keyphrase extraction, however, have a particular drawback, which comes from their frequency-based analysis methods. The weakness is that many common, less relevant terms may get a higher ranking, particularly in short articles. The converse situation also occurs, where less common (and possibly more relevant) terms obtain lower rankings. First, we propose an unsupervised method---RankUp---that enhances graph-based keyphrase extraction approaches by applying an error-feedback mechanism similar to the concept of backpropagation. Experiments have been performed on almost 3,300 short texts from a variety of domains. Our experiments show that error-feedback propagation can boost the quality of keyphrases in graph-based keyphrase extraction techniques. Second, we present a hybrid keyphrase extraction method for short articles, HybridRank, which leverages the benefits of both supervised and unsupervised approaches. Our system implements modified versions of the TextRank (unsupervised) and KEA (supervised) methods, and applies a merging algorithm to produce an overall list of keyphrases. We have tested HybridRank on more than 900 abstracts belonging to a wide variety of subjects, including engineering, science, physics and IT, and show its superior effectiveness. It is observed that knowledge collaboration between supervised and unsupervised methods can produce higher-quality keyphrases than applying these methods individually.