基於類神經之關聯詞向量表示於文本分類任務之研究

由於資訊網路的蓬勃發展，人們在物聯網上存取文本資料的需求也與日俱增，因此文本分類在自然語言處理的領域中的應用為相當熱門的研究。目前，在文本分類中最為核心的問題為特徵表示的選擇，大部分的研究使用詞袋(Bag of words)模型做為文本的特徵表示，但詞袋模型無法有效的表達詞與詞之間的關係，進而失去了文本上的語意。在本論文中，我們使用兩種新穎的類神經網路架構 : 連體網路(Siamese Nets)和生成式對抗網路(Generative Adversarial Nets)，在訓練過程中使模型能學習更為強健且帶有豐富語意的特徵表示。本論文實驗採用知名的分類資料庫，IMDB電影評論分類、20Newsgroups新聞群組分類，由一系列的情緒分析和主題分類的實驗結果顯示，藉由這些類神經網路所學習到的特徵表示可以有效地提昇文本分類的效能。

關鍵字

文本分類；表示學習；深度學習；連體網路；生成式對抗網路

並列摘要

With the rapid global access to tremendous amounts of text data on the Internet, text categorization or classification has emerged as an important and hot research topic in the natural language processing (NLP) community with many applications. Currently, the foremost problem in text categorization would be feature representation, which is commonly based on the bag-of-words (BoW) model, where word unigrams, bigrams (n-grams) or some specifically designed patterns are typically extracted as the component features. It has been noted that the loss of word order raised by the BoW representations is particularly problematic on document categorization. In order to leverage the influence of word order and proximity information on text categorization tasks, we explore a novel use of a Siamese nets and Generative adversarial nets for document representation and text categorization. Experiments conducted on two benchmark text categorization tasks, viz. IMDB and 20Newsgroups, we take advantage of these novel architectures for learning distributed vector representations of documents that can reflect the semantic relatedness.

並列關鍵字

Text Categorization ； Representation Learning ； Deep Learning ； Siamese Networkws ； Generative Adversarial Networks

參考文獻

[1] Feldman, R., & Sanger, J.: “The text mining handbook: advanced approaches in analyzing unstructured data.” (2007).

Google Scholar

[2] Joachims T et al.: “Text categorization with support vector machines: Learning with many relevant features.” Machine learning: ECML-98, (1998).

Google Scholar

[3] Cunningham H, Maynard D, Bontcheva K, et al.: “A framework and graphical development environment for robust NLP tools and applications.” ACL, (2002).

Google Scholar

[4] LeCun Y, Bengio Y and Hinton G.: “Deep learning.” Nature, (2015).

Google Scholar

[5] Salton G, Wong A, Yang C S.: “A vector space model for automatic indexing.” Communications of the ACM, (1975).

Google Scholar

主題瀏覽