非監督式學習之圖論分聚與分聚加速

聚類是監督學習中的一個重要領域，具有許多現實生活應用，如模式識別，圖像分析，信息分析，醫學，計算機圖形學等。本論文重點研究聚類的高效算法的開發，旨在提供低復雜度（例如非線性）聚類結構的數據的解決方案。在論文的第一部分，我們在相當溫和的假設下開發了一種基於圖論的新算法。我們的想法是確定一組合適的噪聲對象（在本論文中稱為奇點），以便表示數據的圖可以很好地分解為具有同類對象的子圖。我們還為選擇此算法中使用的參數提供了有用的指導。在論文的第二部分，我們介紹了另一種基於頻譜分析的聚類算法。這個想法是將經典的基於分區的聚類問題表示為雷利商(Rayleigh quotient)，並使用譜分析來近似解決方案。這種算法的好處是它可以應用於基於分區的聚類的任何內核技巧(kernel tricks)，它克服了譜聚類的限制（它只能用於RBF內核函數）。此外，對於大數據集，所提出的算法可以減少大量的計算負荷而不會削弱聚類精度。總之，提出的兩種算法主要用於減少具有復雜聚類結構的大尺寸數據的計算量。與其他復雜的聚類算法（例如kernel K-means）相比，它們具有相對較低的計算複雜度，並且數值證據表明它們相對於許多眾所周知的基準數據集也具有相當高的聚類精度。

關鍵字

非監督是學習；分群；圖論；頻譜分析；雷利商

並列摘要

Clustering is an important field in supervised learning and has many real life appli-cations, such as pattern recognition, image analysis, information analysis, medicine, computer graphics, etc. This thesis focuses on the development of efficient algorithms for clustering, which aims to provide low-complexity solutions for data with compli-cated (e.g. nonlinear) clustering structures. In the first part of the thesis, we develop a novel algorithm based on graph theory under fairly mild assumptions. The idea is to identify a suitable set of noise objects (named as singular points in this thesis) so that the graph that represents data can be well split into sub-graphs with homogeneous objects. We also provide useful guidelines for choosing the parameters used in this algorithm. In the second part of the thesis, we introduce another clustering algorithm based on spectral analysis. The idea is to formulate the classical partitioning-based clustering problem as a Rayleigh quotient one and approximate the solution by using spectral analysis. The benefit of this algorithm is that it can be applied to any kernel tricks for the partitioning-based clustering – which overcomes the limit of the spectral clustering (it can be used merely for the Radial Basis Function kernel). Further, for large data sets the proposed algorithm can reduce a considerable amount of computa-tion load without scarifying the clustering accuracy. In summary, the proposed two algorithms are mainly designed to reduce the computation load for large sizes of data with complicated clustering structures. They have relatively lower computation com-plexity in comparison with other sophisticated clustering algorithms (e.g. the kernel K-means), and, numerical evidences show that they also have fairly high clustering accuracy with respect to a number of well-known benchmark data sets.

並列關鍵字

Unsupervised Learning ； Clustering ； Graph Theory ； Spectral Analysis ； Rayleigh quotient

參考文獻

Bewley, A., & Upcroft, B. (2013). Advantages of Exploiting Projection Structure for Segmenting Dense 3D Point Clouds. In Australian Conference on Robotics and Automation.

Google Scholar

Antonio Di Marco - Roberto Navigili,"Clustering and Diversifying Web Search Results with Graph Based Word Sense Induction".

Google Scholar

R. Sibson (1973). "SLINK: an optimally efficient algorithm for the single-link cluster method" (PDF). The Computer Journal. British Computer Society.

Google Scholar

Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M., eds. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press.

Google Scholar

Newman, M. E. J. (2006). "Modularity and community structure in networks".

Google Scholar

國際替代計量

非監督式學習之圖論分聚與分聚加速

全文下載

主題瀏覽