近乎線性時間之社群偵測與分群演算法

在網路分析領域中，社群偵測與分群這兩個相近的議題十分被重視。由於網路資料大小快速增長，社群偵測與分群演算法的效率及擴展性變得日漸重要。在本篇論文中，我們提供許多方法來在多樣且大型的網路資料上進行社群偵測與分群。本篇論文可以分成三個部分。於第一部分中，我們提出了兩項目標：一、在有向（directed）網路中，正式且精準的定義圖的分群與社群偵測。二、在有向網路中，演算法設計以及分析。為此，我們開發了用於有向網路的機率框架，此框架是奠基於我們之前用於無向網路的版本。縱使將對聯合分佈的要求從「對稱」放寬至只需擁有「相同的邊緣分佈」，我們仍然可以正式的在有向網路中定義何謂「重要性」（centrality）、「相對重要性」（relative centrality）、「社群」（community）、「模組度」（modularity）。透過模組度與稀疏度守恆轉換，我們也將許多常見於無向網路的社群偵測演算法拓展至有向網路，例如：hierarchical agglomerative演算法、partitional演算法、fast unfolding演算法。透過機率框架，我們可以得知這些三種演算法會在有限步內收斂。其中，partitional演算法更被證明是近乎線性時間的演算法，而hierarchical agglomerative演算與fast unfolding演算法輸出結果必定符合社群的數學定義。這些演算法在經過少許修改之後，都可以被拓展至更普遍的聯合分佈。我們實驗了使用「PageRank」和「隨機漫步與後跳」兩種方法得到的聯合分佈。在本論文的第二部份中，我們提出了一個新的迭代演算法「K-sets+」。該演算法可用於對半度量空間（metric space）中的資料分群。而在半度量空間中，三角不等式未必會成立。我們證明了K-sets+會在有限步內收斂，並擁有與K-sets相同的效能。此外我們還將K-sets+拓展至僅有對稱性的相似性度量（similarity measure）。這樣的拓展大幅降低了計算複雜度，使得當相似性矩陣是疏鬆時，演算法會具有線型的時間複雜度與空間複雜度。我們也進行了許多實驗驗證K-sets+演算法的效率。實驗資料分別來自於隨機塊狀模型（stochastic block model）和WonderNetwork的網頁。在本論文的第三部份中，我們詳細描述了如何建構一個線性時間的fast unfolding演算法。由於時間複雜度與空間複雜度受資料結構影響甚大，我們介紹了三個在實現線性時間的fast unfolding演算法時必要的資料結構。分別為adjacency list、disjoint sets和array set。其中adjacency lsit是廣泛用於儲存疏鬆網路拓樸的資料結構，而disjoint sets和array set為我們所開發的資料結構，可用於避免超線性時間的運算，例如排序或在雜湊樹（或二元樹）中插入元素。我們也用實驗去驗證我們的實踐方法的效率與擴展性。在該實驗中，我們的方法速度為比較對象的3.6倍，而處理的網路連結數也可以上達十億個。

關鍵字

社群偵測；分群；網路科學；線性時間演算法

並列摘要

Community detection and clustering are two closely related issues that have drawn much of the attention in network analysis. Due to the rapid growth of the scale of networked data, the efficiency and the scalability of community detection algorithms and clustering algorithms are taken more seriously. In this thesis, we provide several efficient methods to perform community detection and clustering that can deal with diverse and large-scale data. The thesis is organized into three parts. In the first part of this thesis, we address two major points: (i) a formal and precise definition of the graph clustering and community detection problem in directed networks, and (ii) algorithm design and evaluation of community detection algorithms in directed networks. Motivated by these, we develop a probabilistic framework for structural analysis and community detection in directed networks based on our previous work in undirected networks. By relaxing the assumption from symmetric bivariate distributions in our previous work to bivariate distributions that have the same marginal distributions in this thesis, we can still formally define various notions for structural analysis in directed networks, including centrality, relative centrality, community, and modularity. We also extend three commonly used community detection algorithms in undirected networks to directed networks: the hierarchical agglomerative algorithm, the partitional algorithm, and the fast unfolding algorithm. These are made possible by two modularity preserving and sparsity preserving transformations. In conjunction with the probabilistic framework, we show these three algorithms converge in a finite number of steps. In particular, we show that the partitional algorithm is a nearly-linear time algorithm for large sparse graphs. Moreover, the outputs of the hierarchical agglomerative algorithm and the fast unfolding algorithm are guaranteed to be communities. These three algorithms can also be extended to general bivariate distributions with some minor modifications. We also conduct various experiments by using two sampling methods in directed networks: (i) PageRank and (ii) random walks with self-loops and backward jumps. In the second part of this thesis, we first propose a new iterative algorithm, called the K-sets+ algorithm for clustering data points in a semi-metric space, where the distance measure does not necessarily satisfy the triangular inequality. We show that the K-sets+ algorithm converges in a finite number of iterations and it retains the same performance guarantee as the K-sets algorithm for clustering data points in a metric space. We then extend the applicability of the K-sets+ algorithm from data points in a semi-metric space to data points that only have a symmetric similarity measure. Such an extension leads to great reduction of computational complexity. In particular, for an n × n similarity matrix with m nonzero elements in the matrix, the computational complexity of the K-sets+ algorithm is O((Kn+m)I), where I is the number of iterations. The memory complexity to achieve that computational complexity is O(Kn+m). As such, both the computational complexity and the memory complexity are linear in n when the n × n similarity matrix is sparse, i.e., m=O(n). We also conduct various experiments to show the effectiveness of the K-sets+ algorithm by using a synthetic dataset from the stochastic block model and a real network from the WonderNetwork website. In the third part of this thesis, we detail the implementation of the fast unfolding algorithm that has a nearly-linear time complexity and a linear memory complexity. Since the time and the memory complexity depend heavily on the data structures, we introduce three essential data structures for the implementation of the nearly-linear time fast unfolding algorithm: (i) adjacency list, (ii) disjoin sets, and (iii) array set. The adjacency list is a commonly used memory-efficient data structure for storing sparse networks. The disjoint sets and array set are our newly invented data structure that can allow us to avoid using superlinear operations such as sorting and insertings in a hash (or binary) tree. We also do an experiment to test the efficiency and scalability of our implementation of the fast unfolding algorithm. With the data structures and techniques designed by us, our implementation is 3.6 times faster than the competitor, and can cope with networks with one billion edges.