使用類別分群演算法加速最近鄰居分類

最近鄰居分類法(Neatest Neighbor Rule，NNR)的優點是原理簡單、使用方便，而且預測準確率也相當高。但是NNR會隨著訓練資料的增加，而需要耗費更多的時間與記憶體。目前已經有許多的改善方案被提出。其中有某些文獻提議重整訓練資料的結構、也有文獻建議將冗餘的訓練資料予以剔除、還有文獻是將前二種方法予以整合。然而，先前的文獻若不是沒有兼顧演算法的速度和準確度或是演算法過於複雜，就是需要設定很多參數。本研究提出一個不需要設定參數的NNR加速方法。其構想是先以類別分群法(Class-Based Clustering)將訓練資料分成多個群內資料之類別完全相同的群聚，然後透過不同類別群聚之間的最近鄰居搜尋，取出群聚邊緣的資料做為群聚代表，並以此為訓練模型(Training Model)。由於此方法的群聚代表是群聚邊緣資料，而不是群聚中心點，因此不會因為資料的減少而降低預測準確度。而本文所提出的方法還可以在預測階段(Predicting Phase)利用與最接近群聚之間的距離限定最近鄰居的搜尋範圍，進而加快此演算法的預測速度。本研究採用四個二維資料集和十三個公開資料集，並且以NNR、濃縮式NNR(Condensed NNR)、修編式NNR(Edited NNR)、啟始分群分類法(Clustering-Launched Classification)、支持向量機(Support Vector Machine)等著名的分類方法為實驗的比較對象。研究結果顯示本文所提出的方法是一個非常容易使用，而且兼顧速度與準確度的分類演算法。因此，使用本文所提出的方法將可以使分類工作更快速、更有效率地進行。

關鍵字

分類演算法；加速；最近鄰居；分群；資料濃縮

並列摘要

Nearest Neighbor Rule (NNR) is a parameter-free classifier which is easy to implement, simple to operate and with high accuracy. However, it is time and memory consuming for a large dataset. Many improved NNR algorithms were proposed previously. Amongst them, some suggested employing data reorganization; several proposed using data condensation and others integrated these two technologies. However, they either do not give consideration to speed and accuracy or are too complex or have many parameters. This study proposed a parameter-free method to accelerate NNR. This method employs a class-based clustering to divide the training data into several clusters with respective members belonging to the same class. Cluster representations are extracted clustering border data based on the nearest neighbors between the different class clusters. Since the cluster representations are the clustering border data rather than the clustering centers, the predicting accuracy will not be affected by removing a cluster’s internal data. In the predicting phase, the nearest neighbor search area is narrowed down by referring to a distance between a test data and its nearest cluster. Thus the predicting process is speeded up. In this paper, the performance of the proposed method was evaluated and compared with NNR, Condensed NNR, Edited NNR, Clustering-Launched Classification and Support Vector Machine by using 4 two-dimensional and 13 benchmark datasets. Experimental results show that the proposed parameter-free classification algorithm in this study is very easy to operate and gives consideration to speed and accuracy. Hence, the proposed method in this study could assist users to process their data more conveniently, quickly and accurately.

並列關鍵字

Classification ； Accelerating ； Nearest Neighbor ； Clustering ； Data Condensing

參考文獻

[3] P. E. Hart, “The condensed nearest neighbor rule,” IEEE Transactions on Information Theory, vol.14, no.3, pp.515-516, 1968.

[4] G. W. Gates, “The reduced nearest neighbor rule,” IEEE Transactions on Information Theory, vol.18, no.3, pp.431–433, 1972.

[5] D. L. Wilson, ”Asymptotic properties of nearest neighbor rules using edited data” IEEE Transactions on Systems, Man, and Cybernetics, vol.2, no.3, pp.408–421, 1972.

[6] R. F. Sproull, “Refinements to Nearest-Neighbor Searching in k- Dimensional Trees,” Algorithmica, vol.6, no.4, pp.579-589, 1991.

[7] R. L. Brown, “Accelerated Template Matching Using Template Trees Grown by Condensation,” IEEE Transactions on Systems, Man, and Cybernetics, vol.25, no.3, pp.523-528, 1995.

國際替代計量

使用類別分群演算法加速最近鄰居分類

未授權

主題瀏覽