由過往研究顯示,偵測重複錯誤報告是軟體維護中的一項重要議題。一方面,重複錯誤報告會耗費大量人力成本來分析。另一方面,如果能夠整合重複的錯誤報告裡的豐富除錯資訊,將可幫助軟體開發人員進行除錯與測試。而在目前的重複報告偵測方法中,如果使用文字探勘技術,並無法達到良好效能表現,如果配合使用軟體執行資訊,雖然可大幅提升效能,但存在使用者隱私問題。在本論文中,我們提出新的方法,以n-gram 特徵資訊及群集收縮技術來提升重複報告偵測效能。經由四個開放源碼軟體專案的測試,包含Apache, ArgoUML, SVN 及 Eclipse,我們所提出的方法能有效提升偵測效能。
According to past research studies, detection duplicate bug report is an important issue in software maintenance. First, triaging these duplicate bug reports may cost a large amount of human resources. Second, these duplicate bug reports may contain abundant debugging information which can be mined in depth to help testing and debugging processes. In previous studies, the schemes using only text mining techniques cannot achieve excellent performance. Although the performance can be highly improved with additional execution information, this approach has the privacy concern. In this thesis, we propose a novel scheme using n-gram features and the cluster shrinkage technique to improve the detection performance. With four open-source projects, Apache, ArgoUML, SVN, and Eclipse, we have conducted empirical studies. The experimental results show that the proposed scheme can effectively improve the detection performance.