透過您的圖書館登入
IP:3.144.252.153
  • 學位論文

基於預測熱門度之大規模即時社群爬蟲演算法分析與設計

An efficient crawling algorithm for large-scale real-time social stream data collection based on popularity prediction

指導教授 : 黃乾綱

摘要


社群網路近年來改變了我們的溝通方式,累積巨量人類行為活動資料,吸引許多新興研究主題與社群網路行為分析結合。進行問題分析的過程中往往需要一個龐大的數據量,最近更朝向時域上分析,每隔一段時間必須對特定的研究標的做一次快照,熱門的訊息尤需要更密集的快照以洞察使用者行為隨著時間上變化。受限於這些社群網路有複雜的網絡,以及爬蟲對於數據存取量和頻率限制,對於多數機構的數據採集部門而言並不容易,且於資料取得之效能上無法進行有效優化。為了取得即時且足夠的資料,必須高頻率對社群網路存取,不僅浪費網路資源,亦增加社群網路的負荷。此外,目前社群網路隱私政策不允許不同單位共享數據,Facebook甚至透過加密的ID來保護使用者使用者資料。這些限制增加單一研究機構與其他機構共享數據,無法利用現有的爬行調度算法與其他機構分配資料收集方式。在本文中,我們提出了一種新爬行排序演算法,考慮用戶過去的行為,隨著收集的資料越多,越能預測該收集標的是否熱門以及有更多文章發布。所設計的演算法可以解決大型立式爬行資源分配與動態網頁無法通過一般的履帶採用的問題。在本研究中,我們運用單位資源內收集的訊息熱度來評估爬行性能。實驗結果呈現我們的演算法在收集社群網路99.5%熱門的訊息能最高節省40%爬蟲網路呼叫次數。

並列摘要


Social media has greatly changed the way we communicate and huge amount of social behavior data is thus recorded and accumulated simultaneously. The data is now widely applied to many emerging research issues in combination with social behavior analysis. More recently, time domain analysis is especially popular on conducting behavior change investigation, in which people take snapshots on a particular subject of network on regular intervals, and hot messages (posts) are in urgent need of snapshot so as to precisely learn about user’s behavior as time moves. Scraping social networking sites such as Twitter, Facebook, etc. is not an easy task for data acquisition departments of most institutions since these sites often have complex structures and also restrict the amount and frequency of the data that they let out to common crawlers. To get more snapshots, groups often consume more computation power and network resources; even increase the load of OSN (Online Social Network) sites. In addition, the current privacy control policies do not allow different groups to share data with one another. These become challenges for an individual research group to collect sufficient data by using existing crawling scheduling algorithms or collaborating with other partners. In this paper, we propose “Novel Crawling Ordering Algorithm”, which allows our crawlers to focus on popular content by collecting and analyzing user behaviors. The designed crawler can also solve the problems of large-scale vertical crawling and dynamic web page problems. The performance of our crawling ordering algorithm” is evaluated by some designed metrics. And the experimental results tell us that this algorithm can save up to 40% of requests by crawling top 99.5 % popular social stream.

參考文獻


[12] H. Kwak, C. Lee, H. Park, and S. Moon, "What is Twitter, a social network or a news media?," in Proceedings of the 19th international conference on World wide web, 2010, pp. 591-600.
[9] C.-I. Wong, K.-Y. Wong, K.-W. Ng, W. Fan, and K.-H. Yeung, "Design of a Crawler for Online Social Networks Analysis."
[1] F. Inc. (2014, 2015/6/15). Facebook Reports Fourth Quarter and Full Year 2014 Results. Available: http://investor.fb.com/releasedetail.cfm?ReleaseID=893395
[4] J. Teevan, D. Ramage, and M. R. Morris, "# TwitterSearch: a comparison of microblog search and web search," in Proceedings of the fourth ACM international conference on Web search and data mining, 2011, pp. 35-44.
[5] J. Cho, H. Garcia-Molina, and L. Page, "Efficient crawling through URL ordering," 1998.

延伸閱讀