基於搜尋引擎與文本探勘之相似文件檢測系統

林崇德

doi:10.6837/ncnu202100263

透過您的圖書館登入 IP:18.224.44.108

透過您的圖書館登入

IP:18.224.44.108

繁體中文
English
简体中文

精確檢索 : 冠狀病毒
模糊檢索 : 冠狀病毒
冠狀病毒感染

冠狀病毒疾病
查詢出版品: 冠狀病毒

進階查詢

查詢歷史

主題瀏覽

【下載完整報告】AI熱潮從學術研究也能看出端倪？哪些議題是2023熱搜議題？

學位論文

基於搜尋引擎與文本探勘之相似文件檢測系統

摘要

自83學年度開始全國碩博士論文建檔計畫，直至109學年度，台灣碩博士論文加值系統 (NDLTDT) 已經累積了論文共計1,241,363篇，其中具全文授權的論文 561,731篇。這26年來，平均每年產生42,804篇論文，每年在論文上傳短短一兩個月的畢業期間，須要比對數萬篇論文與百萬篇論文之相似度，有效從中挑出有抄襲疑慮的論文，是很大的挑戰。本論文自動化蒐集擷取 NDLTDT 的龐大資料庫，設計與開發SDDS (Similar Document Detection System) 系統，基於搜尋引擎與文本探勘技術，分析NDLTDT所有論文摘要，並比對兩兩相似度 (pairwise similarities)，以輔助檢查論文是否有抄襲之疑慮。實驗結果發現，摘要相似度大於50%的論文，占了23%的比例，經檢查部分論文內文後，成功找出有抄襲疑慮的論文。基於文本探勘關鍵字及研究領域之關聯，提出分析關鍵字重要的的分法，也有效提升SDDS 檢測抄襲的回收率 (recall rate)。

關鍵字

文本探勘；搜尋引擎；關鍵字擷取；相似度檢測

並列摘要

National Digital Library of Theses and Dissertations in Taiwan (NDLTDT) was established in 1994. Up to now, there are 1,241,363 theses, including 561,731 full texts. In average, 42,804 theses are published per year in Taiwan. It's a big challenge to compare tens of thousands of theses with millions published theses and rapidly detect highly similar theses within one or two months, usually in June and July for every year. In this thesis, we crawl NDLTDT databases, design and implement the Similar Document Detection System (SDDS) based on search engine and text mining methods. SDDS creates the full-text index for all abstracts of NDLTDT and estimates pairwise similarities among all theses based on string and keyword matchings. Given an abstract text string, SDDS can rapid detect similar theses within ??? seconds. Therefore, we do large-scale experiments to detect highly similar abstracts of theses. The result show that 23% abstracts are 50% similar. We seriously reviewed some abstracts and observed that some theses can be regarded as plagiarism. Consequently, SDDS is useful to rapidly detect similar abstracts before students submit their theses to NDLTDT.