以生成對抗網路達成非監督式文章摘要及主題模型

隨著網際網路的興起，人類在網路上留下各式各樣的資料，由於這些資料大多是未標註的，使用未標註資料來做訓練的非監督式學習成了近年來重要的研究課題。在本論文中，我們使用生成對抗網路來探索非監督式學習在自然語言處理上的可能性，並專注在在兩個不同的主題上。第一個主題是非平行抽象式文章摘要，亦即不需要平行成對的訓練文章搭配其人類撰寫的摘要便可訓練機器撰寫文章的非抽象式摘要。在這個主題中，我們使用摘要來作為文章自編碼器的潛在表徵，並且使用生成對抗網路來限制此潛在表徵必須具備人類可讀的形式，只要提供較少量的人類撰寫的不相關的內容的文章摘要作為辨識器的範本就可讓機器學習人類是如何寫摘要的。我們衡量我們所提出的模型在英文以及中文的新聞摘要資料庫上，模型的表現也驗證了這樣的方法的可行性。第二個主題則是非監督式文章主題模型，希望機器可以自動發現文章的接近人類認知的主題。我們使用資訊生成對抗網路來模擬文章的產生是由一個離散的主題分佈，以及一個連續的向量來控制主題下的文章的變異，而不若前人所提出的主題模型模擬文章的產生是由若干瑣碎的次要主題所產生。實驗顯示我們的模型在文章分類的任務上，以及所抽取出的每一個主題的關鍵詞的品質上，相較於先前的研究結果均有著顯著的進步。

關鍵字

非監督式學習；文章摘要；主題模型；生成對抗網路

並列摘要

With the development of the Internet, humans put various data on the Internet. As most of the data is unannotated, how to efficiently utilize unlabeled data for unsupervised learning, becomes an important research direction. In this thesis, we use Generative Adversarial Network (GAN) to explore the possibility of unsupervised learning on NLP, which mainly covers the two different topics. The first topic is unsupervised abstractive text summarization. That is text summarization without any paired data. We use summaries as latent representations of an auto-encoder and use GAN to constrain the latent representation to be human-readable. WIth fewer summaries as examples for discriminator, machine can understand how humans write summaries for documents. The results on English and Chinese news datasets demonstrate the effectiveness of our model. The second topic is unsupervised topic model. The goal of this section is to train a machine that is able to automatically discover the latent topics similar to humans' cognition. Unlike prior topic models which models text generated from a mixture of sub-topics, we utilize InfoGAN to model texts generated from a discrete code controlling high-level topics and a continuous vector controlling variance within the topics. Compared to prior works, our proposed method greatly improves the performance on unsupervised classification and topical word extraction.

並列關鍵字

Unsupervised learning ； Text summarization ； Topic model ； GAN

參考文獻

[1] Chih-Chung Chang and Chih-Jen Lin, “Libsvm: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., 2011.

Google Scholar

[2] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., 1986.

Google Scholar

[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

Google Scholar

[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30. 2017.

Google Scholar

[5] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial networks,” arXiv preprint arXiv:1406.2661, 2014.

Google Scholar

國際替代計量

以生成對抗網路達成非監督式文章摘要及主題模型

主題瀏覽