  • 學位論文


The primary study of applying machine learning in data matching problem

指導教授 : 項衛中


異質性資料庫合併時常會面臨資料整合的問題,由於這些異質性資料庫間可能存在重複的資料,而資料的重複將造成資料庫的資料錯誤。若只以人工判斷的方法來找出重複的資料將耗時費力,針對此問題,常見的解決方式是使用資料庫SQL語法的JOIN指令找出相同的資料。但是資料格式與書寫方式不同時,將無法有效的找出兩資料庫間重複的資料。 本論文探討資料整合時資料對映的問題,為了找出異質性資料庫間的重複資料,所採用研究方法是先計算出相似度值,再使用機器學習來分類對映結果。在計算相似度方面,先計算字彙權重,再計算相似度值,雖然此時可以人工方式依據相似度值判別資料是否對映,但資料量若是過大將需花費相當多的人工來處理,所以採用機器學習的分類方式來降低人工處理時間並維持一定的準確性。本論文使用機器學習的方法來建立決策樹與類神經網路,也用此來做資料對映找出重複的資料,並相互比較對映品質。 本論文透過兩個具有不同性質與綱要的範例來實驗驗證所提的研究方法,資料對映結果使用分類品質指標:真對率(Recall)、正對率(Precision)、真對率與正對率的調和平均數(F-measure)與正確率(Accuracy),來衡量決策樹與類神經網路的對映品質。實驗發現此兩種方法的對映結果,在調和平均數與正確率的指標差異皆在0.01之內,並且正確率皆在0.97以上。因此這兩種機器學習方法配合相似度的計算,可以協助企業解決資料庫整合時所面臨的資料重複的問題。


The integration of heterogeneous databases has been studied for many years and it involved different kinds of problem definitions and applications. In this research, the problem is focused on finding identical data instances across heterogeneous databases with inconsistent text formats. This problem can be categorized as the issue of data matching in data integration. Manual data matching is a labor intensive and time consuming task for large dataset. Using SQL command “JOIN” to query duplicates could be a simple solution to this problem. However, it is not effective to find the duplicates with different text formats. In order to find the duplicate data across heterogeneous databases, this thesis proposes first to calculate the similarity values between data instances, and then to use machine learning methods in finding mappings between data instances. A set of attribute similarity values between data instances are calculated based on term frequency, inverse document frequency, and vector space model. To reduce the human involvement in data matching, machine learning techniques are applied to construct the decision tree and neural network for mapping data instances. Two experiments with data in different schemas and domains were conducted to verify the proposed methods. The quality of data matching results for the decision tree and neural network was measured and quantified with Recall, Precision, F-measure and Accuracy indices. The experiments showed that the matching performance between decision tree and neural network was very close and the difference was less then 0.01 in F-measure and Accuracy. Moreover, the Accuracy values of two methods both are larger than 0.97. From the experiments, the proposed method is suitable for finding identical data instances across heterogeneous databases in database integration.


[21] 何明營,「運用XML技術輔助資料交換中綱要配對與轉換之探討」,中原大學工業工程系碩士論文,2005。
[22] 余駿,「本體論為基之智慧型專利文件自動摘要方法論研究」,清華大學工業工程與工程管理學系碩士論文,2006。
[26] 范振添,「醫療影像傳輸及資料探索之系統開發」,中原大學醫學工程系碩士論文,2002。
[28] 黃翊軒,「本體論為基之智慧型專利文件分類方法論研究」,清華大學工業工程與工程管理學系碩士論文,2007。
[33] 劉冠宏,「運用校正距離計算結構相似度增進資料交換中綱要對映正確性」,中原大學工業工程系碩士論文,2006。


