Finding Sequence Clusters: A Shared Near Neighbors Approach
Jia-Lien Hsu；Tzu-Han Hsiao
SNN ； MBR ； multi-label clustering ； subsequences clustering ； sequence clustering
Journal of Information Science and Engineering
|Volume or Term/Year and Month of Publication||
31卷5期（2015 / 09 / 01）
1647 - 1667
Sequence clustering is one of most fundamental topics which can be applied in various research field. Most of previous work on sequence clustering is dedicated to the single- label clustering in which the whole similarity of equal-length sequence is considered and measured by Euclidean distance function. However, intrinsic properties behind sequence demand the multi-label clustering. In addition, the Euclidean distance in the high dimensional space introduce the problem of dimensionality curse. Therefore, in this paper, we employ the concept of shared near neighbors (SNN), for sequence similarity, which will be integrated in the multi-label clustering process. Given a set of sequences, in our approach, we first apply the sliding window technique and the DCT mapping on sequences to obtain feature vectors. Those feature vectors, associated with the SNN similarity, are further grouped by applying the graph-based clustering and the hierarchical clustering, respectively. We also design a validity measure and perform experiments to show the efficiency and effectiveness of our approach. Meanwhile, those feature vectors are also approximated by the minimum bounding rectangles (MBR). Due to the less amount of MBRs, compared to all feature vectors, the computational complexity can be reduced accordingly without compromising clustering validity.