從企業間的大型軟體程式碼剽竊到學生程式作業的抄襲,偵測程式抄襲一直是重要的課題。偵測程式抄襲方法大致上可以分成文本分析和結構分析兩種類型,文本分析方法大部分都採用單一演算法擷取部分字串,藉此估算兩兩程式之間的相似程度,再依相似度判斷是否抄襲。結構分析方法主要以樹狀結構的方式紀錄程式碼的結構語法,藉由探勘兩棵樹之間相似的部份以評估程式相似度。每一種演算法都有它的優缺點,只以單一方法評估有無抄襲是不夠全面的,所以本研究提出結合兩種類型的分析方法,希望藉此能夠綜合不同層面偵測程式抄襲。為了驗證可行性,實驗採用真實學生作業的程式碼,依照人工確認的實際抄襲名單評估準確度,與其它方法相較之下,本研究在各種指標的表現都較為優異。
From source code plagiarism among large software in enterprises to duplicates of programming assignments among students, code plagiarism detection have been an important issue at all times. The methods of code plagiarism detection can be roughly divided into two categories: textual analysis and structural analysis. Most of textual analysis methods adopt one single algorithm to extract a portion of strings from source code, compute the similarity between every two programs and then assess the possibility of plagiarism accordingly. Structural analysis methods mainly record the structural syntax in a program as a tree structure, discover the similar parts between every two trees and then estimate the similarity among programs accordingly. Every algorithm has its own pros and cons. Detection of code plagiarism by only one single algorithm is not comprehensive. Therefore, this thesis proposes an approach to integrate the methods of two categories in order to detect code plagiarism from different aspects. To verify the effectiveness, our experiments take into account the source codes from actual student assignments and evaluate the accuracy of our results by using a plagiarism list confirmed manually. Compared with the existing tools, our approach performs better in each of the accuracy measures.