This paper analyzes the advantages and disadvantages of traditional K-means and Canopy algorithms, and proposes an improved K-means algorithm based on Canopy. At the same time, it uses the "min-max principle" to improve its space complexity and randomness problems, and applies it to the MapReduce programming model under the Hadoop platform. Experiments show that this method is more accurate and accurate than the traditional K-means and Canopy algorithms. stability.