透過您的圖書館登入
IP:18.223.124.244
  • 期刊

A Component Histogram Map Based Text Similarity Detection Algorithm

並列摘要


The conventional text similarity detection usually use word frequency vectors to represent texts. But it is high-dimensional and sparse. So in this research, a new text similarity detection algorithm using component histogram map (CHM-TSD) is proposed. This method is based on the mathematical expression of Chinese characters, with which Chinese characters can be split into components. Then each components occurrence frequency will be counted for building the component histogram map (CHM) in a text as text characteristic vector. Four distance formulas are used to find which the best distance formula in text similarity detection is. The experiment results indicate that CHM-TSD achieves a better precision, recall and F1 than cosine theorem and Jaccard coefficient.

延伸閱讀