Most Possible Partition: Utilizing Semantic Links for Duplicate Detection

Duplicate detection is a hotspot in the study of heterogeneous data integration and information retrieval. The efficiency and precision of detection are the goals of this study. In this paper, we introduce a duplicate detecting method based on semantic links among data, and propose a novel approach, named Most Possible Partition (MPP) to help detect duplicates efficiently. The main principle of MPP is to partition those data into most-possible-duplicate parts, in which there is a higher probability of duplicates. MPP does not sort data into certain order as classical Sorted Neighborhood Method (SNM) did. We give an effective partition method using semantic links among entities. Experiments on publication datasets show that the proposed method is efficient, and performance and accuracy of MPP are better than those of SNM.

並列關鍵字

Semantic links ； Partition ； Duplicate detection

國際替代計量

全文下載

主題瀏覽