基於深度學習之文本相似度比對研究與實作

隨著網路的快速發展，透過網路便可以輕易的搜尋到所需要查詢的內容，這也使得人們更加依賴網路，透過網路取得自己需要的資訊，這同時也造就抄襲的問題。近幾年的新聞常看到碩士畢業論文抄襲的報導，為了要預防這類的事件發生，便需要透過文本相似度比對系統，確認新撰寫的論文以及現有論文是否有過於相似之處。然而，現有的文本相似度比對系統，雖然能夠將相似度的比對訊息，提供給使用者知道，進而讓使用者針對該部份去修改，不過文章比對的流程則需要花費大量的時間。本研究「基於深度學習之文本相似度比對研究與實作」，提出以詞袋的方式將重要的關鍵字提取出來，做為每篇文章的文章向量，透過文章向量，找出近似的關鍵字(詞)，再針對近似的字(詞)，進一步比對含有該詞的句子，若句子也相似，再比對含有相似句的段落，最後再將具有特別相似段落的文章，逐字比對，如此，由下而上的方式來必對文章的相似度，這樣的做法可節省大量的比對時間，而這些關鍵字(詞)、句子及段落則成為文章的特徵。此外，由於我們比對時採用詞向量、句向量及段落向量的相似度來進行比對，可針對近似詞及語意近似句來檢視是否抄襲，更具彈性。

關鍵字

文本相似度；深度學習；關鍵字提取； Word2vec ； TF-IDF

並列摘要

With the fast growth of Internet technology, searching and finding what people need had become easy and fast with the help of the Internet, which also makes people rely on the internet search to obtain information, which would cause the problem of plagiarism. These years, it is very frequent to hear the Master’s thesis plagiarism events through the news. To prevent such incidents, it is necessary to build up a system for document similarity comparison to confirm if there is too much similarity between articles. Although the current existing document similarity comparison system can provide the users to find out the parts that are similar, and then the users can make corrections accordingly. However, the process of the text comparison system requires a lot of time. This study “Design and Implementation for a document similarity recognition based on Deep Learning”, proposes to extract important keywords in the way of Bag of Words as the article vector for each article. Through the article vector, find the approximate keywords then further compare the sentences containing the similar words. If the sentences are also similar, compare the paragraphs with similar sentences. Finally, compare the articles with particularly similar paragraphs. In this way, the similarity of the articles is determined by the bottom-up method. This method can save a lot of time for comparing, and these keywords, sentences, paragraphs become the characteristics of the article. In addition, since we compare with the keyword vector, sentence vector and paragraph vector. We can check whether similar words and sentences are copied, which is more flexible.

並列關鍵字

Word2vec ； TF-IDF ； Document similarity ； Deep Learning ； Keyword Extraction

參考文獻

[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

Google Scholar

[2] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).

Google Scholar

[3] Gao, Jin, et al. "Duplicate short text detection based on Word2vec." 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS). IEEE, 2017.

Google Scholar

[4] Kenter, Tom, and Maarten De Rijke. "Short text similarity with word embeddings." Proceedings of the 24th ACM international on conference on information and knowledge management. 2015.

Google Scholar

[5] Suleiman, Dima, Arafat Awajan, and Nailah Al-Madi. "Deep learning based technique for plagiarism detection in Arabic texts." 2017 International Conference on New Trends in Computing Sciences (ICTCS). IEEE, 2017.

Google Scholar

主題瀏覽