透過您的圖書館登入
IP:3.137.192.3
  • 學位論文

Doc2vec在自然語言處理應用

Applications of Doc2vec in Natural Language Processing

指導教授 : 陳開煇

摘要


隨著電腦設備的進步以及網路的發達,大多數資訊都可以在網路上搜尋找到。因此,從網路上下載資料變得越來越普遍。要分析文字就需要一些工具來説明電腦理解可用的資料。所以就產生了tf-idf、Word2vec、Doc2vec等工具來完成這個任務。這些工具可以說明將文字轉換成電腦能夠理解的數字和向量。在本文中,我們將重點研究Doc2vec的應用。 我們將在第二章節先介紹自然語言處理(NLP),我們還將討論中文和英文文本資料的預處理,以及tf-idf、One-Hot編碼和Word2vec的概念。由於語言的性質不同,這兩種語言的前處理是不同的。然後在第三章,我們將介紹Dov2vec,以及如何使用開源資料庫gensim實作Doc2vec。在第四章中,我們將簡要介紹深度學習。然後利用深度學習的前饋神經網路實作分類器。利用這些分類器,我們將對中英文文本資料進行分類,並對結果進行分析。在最後一章,我們將討論可能的實作方法和我們可以在未來工作的研究。

並列摘要


Following the advancement of computer hardware and the extensiveness of the internet, most of the information can be found by searching the internet. Thus, downloading data from the web is becoming more and more common. Consequently, there is a need for tools that can help the computer to understand the available data. As a result, tools such as tf-idf, Word2vec, Doc2vec, etc. are created to fulfill this role. These tools can help transform words into numbers and vectors that computers can understand. In this thesis, we will focus on the application of Doc2vec. We will begin our discussion in Chapter 2 with an introduction to natural language processing (NLP). We will also discuss the preprocessing of Chinese and English text data, and the concept of tf-idf, One-Hot Encoding, and Word2vec. The preprocessing of the two languages is different due to the nature of the languages. Then, in Chapter 3, we will introduce Dov2vec, and how to use the open-source package gensim to implement Doc2vec. In Chapter 4, we will give a brief introduction to deep learning. We then use a Feedforward Neural Network of deep learning to implement classifiers. By using these classifiers, we will perform text classification on both English and Chinese text data and analyze our results. In the final chapter, we will discuss possible ways to implement and items that we can work on in the future.

參考文獻


[ 1 ] 讓電腦聽懂人話: 直觀理解 Word2Vec 模型
https://medium.com/@tengyuanchang/%E8%AE%93%E9%9B%BB%E8%85%A6%E8%81%BD%E6%87%82%E4%BA%BA%E8%A9%B1-%E7%90%86%E8%A7%A3-nlp-%E9%87%8D%E8%A6%81%E6%8A%80%E8%A1%93-word2vec-%E7%9A%84-skip-gram-%E6%A8%A1%E5%9E%8B-73d0239ad698
[ 2 ] A simple explanation of document embeddings generated using Doc2Vec
https://medium.com/@amarbudhiraja/understanding-document-embeddings-of-doc2vec-bfe7237a26da
[ 3 ] Day28-NLP自然語言處理 介紹

延伸閱讀