分散式社群資料蒐集器之設計與實作

本論文設計與實作了一個分散式的資料蒐集器。此蒐集器的設計針對 Facebook社群平台上的粉專資料，透過 Facebook 所提供的 Graph API 來抓取。本蒐集器可以針對特定地區的使用族群來蒐集資料，目前的實驗以台灣相關的粉絲專頁為範疇。本蒐集器透過主從式分散式架構可以多台機器同時抓取資料。實驗顯示本系統可以有效地透過搜尋與自動拓展的方式齊全的蒐集相關的粉專節點，並且可以每天高效率地蒐集在 Facebook 粉專上最新的資料。

關鍵字

資料蒐集器

並列摘要

In this thesis we designed and implemented a distributed data crawling system. Our system is aimed at collecting data from Facebook social network via the Graph API provided by Facebook. The crawling system can be tuned to crawl specific regions of interest. Currently, our experiment is on the Facebook pages related to Taiwan users. Our system uses a client-server distributed architecture with the master server responsible for managing job queues and clients for fetching data from the target pages. Our system can automatically and comprehensively collect interested Facebook Pages through edge exploration and keyword searches. The experiment shows that our concurrent crawlers can fetch new data from massive targeted pages very efficiently every day.