基於字與詞混合方法之抽象摘要研究

自動抽象文本摘要是自然語言處理的一個重要且充滿挑戰性的研究課題。在許多廣泛使用的語言中，中文具有特殊的語言性質，即中文的字包含著與詞相當的豐富信息。現有的中文文本摘要方法不是完全採用基於字就是完全採用基於詞的表示方法，未能充分利用這兩種表示方法所攜帶的信息。為了準確地捕捉文章的本質，我們提出了一個基於字與詞混用的方法(HWC)，保留了基於字與基於詞表示方法的優點。我們將其應用於兩種現有的架構來評估所提出的HWC 方法的優勢。發現其在廣泛使用的資料集LCSTS 上產生超越目前最先進的方法24 個ROUGE 百分點。除此之外，我們發現LCSTS 資料集中包含一個問題，並提供一個腳本來刪除重疊的資料對(摘要和簡短文本)。以便為社群創建一個乾淨的資料集。提出的HWC 方法也在新的、乾淨的LCSTS 資料集上產生了最佳的表現結果。

關鍵字

抽象摘要；類神經網路；自然語言處理；編碼器-解碼器架構

並列摘要

Automatic abstractive text summarization is an important and challeng- ing research topic of natural language processing. Among many widely used languages, the Chinese language has a special property that a Chinese char- acter contains rich information comparable to a word. Existing Chinese text summarization methods, either adopt totally character-based or word-based representations, fail to fully exploit the information carried by both repre- sentations. To accurately capture the essence of articles, we propose a hy- brid word-character approach (HWC) which preserves the advantages of both word-based and character-based representations. We evaluate the advantage of the proposed HWC approach by applying it to two existing methods, and discover that it generates state-of-the-art performance with a margin of 24 ROUGE points on a widely used dataset LCSTS. In addition, we find an is- sue contained in the LCSTS dataset and offer a script to remove overlapping pairs (a summary and a short text) to create a clean dataset for the commu- nity. The proposed HWC approach also generates the best performance on the new, clean LCSTS dataset.

並列關鍵字

Abstractive Summarization ； Neural Networks ； Natural Language Processing ； Encoder-Decoder Framework

參考文獻

[1] Ayana, S. Shen, Z. Liu, and M. Sun. Neural headline generation with minimum risk training. arXiv preprint arXiv:1604.01904, 2016.

Google Scholar

[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2014.

Google Scholar

[3] M.Banko,V.O.Mittal,andM.J.Witbrock.Headlinegenerationbasedonstatistical translation. In ACL, pages 318–325, 2000.

Google Scholar

[4] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.

Google Scholar

[5] Q. Chen, X. Zhu, Z. Ling, S. Wei, and H. Jiang. Distraction-based neural networks for modeling documents. In IJCAI, pages 2754–2760, 2016.

Google Scholar

國際替代計量

基於字與詞混合方法之抽象摘要研究

全文下載

主題瀏覽