  • 學位論文


Incremental Significant URL Mining and Comments Summarization in Real-Time Social Media

指導教授 : 陳銘憲


社交多媒體網路近年來已成為一個即時且有效的溝通平台。無論 新聞、自然災害或是節慶等消息,人們常藉由社交多媒體網路來分享 並進行資訊的傳播。更由於手持式裝紙的普及與蓬勃發展,人們可以 更即時的藉由社交多媒體平台發布即時訊息,也因此社交多媒體上的 資訊更能直接反映時事。然而,由於巨量的資訊被發布於社交多媒體 平台,對使用者而言,要將所有的訊息街進行瀏覽是相當耗時且費力 的。因此,如何有效地探勘這些巨量的資料或提供這些一目了然的資 訊摘要,便成為一極具潛力的研究方向(例如焦點新聞探索、交通狀 況監測以及自然災害監控)。另一方面,雖然名人、公司行號與各種 組織常透過社交多媒體網路來與其粉絲或客戶進行互動,但由於資 料量過於龐大且會隨時間增加。為了要能夠有效分析這些訊息,藉 由各種不同的,在本論文中,我們首先提出一個超連結探勘方法(簡 稱SURLMINE)來協助找出多媒體網路中值得注意的超連結。有別於以 往的尋找方式,超連結由於沒有語言上的限制,可以使得訊息來源更 加完整與全面。此外,根據我們實驗中所發現的結果,目前Twitter上 以英文發表之資料量淤占總資料量的35%,意即透過超連結的方式來 提供訊息,不但可以提高訊息的完整度,亦可以減少因為不同語言翻 譯所帶來的誤差,因而更有效率的達到即時推薦訊息的效果。此外, 為了能夠有效的掌握訊息串流的概要,我們另外提出一增益式摘要彙 總法,讓使用者能夠一目了然知道整個訊息串流的概要。透過真實資 料實驗,超連結探勘能以高達92%的準確度找出重要的超連結,而增 益式摘要叢集法也能有效找出重要的叢集,並且能夠將離群資料有效 排除。


Social media platforms have emerged as a powerful and real-time means of communication recently. People are using social media to share and exchange information about any events, ranging from breaking news stories to natural disasters and information about local festivals. With the help of rapid development of mobile technologies, messages posted in social media can typically reflect these events as they happen. However, since the dramatic growth of the social media data, it becomes infeasible for users to read all posts or comments. Therefore, mining and summarizing rich user generated content in social media can present great opportunities for developing many potential applications (e.g., breaking news discovery, traffic monitoring, and natural disaster monitoring.) On the other hand, the celebrities, corporations, and organizations also set up social pages to interact with their fans and the public. Although it is important for them to understand how their fans and customers reacting to certain topics and content, the volume and the rapidly increment nature of social media make it time-consuming to get the overview of a comment stream. Therefore, in this dissertation, we first propose a significant URL mining approach (named SURLMINE) to rank the URL on social media based on various features. Note that URL is a global language without language dependency. It is also worthy to know that only 35\% of tweets on Twitter are posted in English. In other words, mining social media content through URL is able to involve more data from different languages. Most of all, it is efficient and there is no lost in translation. On the other hand, to summarize the comment stream, we propose a real-time incremental short text summarization on comment streams (abbreviated as IncreSTS) to provide an at-a-glance presentation that users can easily and rapidly understand the main points of similar comments. Our experiments conducted on real datasets show that the SURLMINE can reach up to 92\% of precision based on YouTube datasets and the increSTS possesses the advantages of high efficiency, high scalability, and better handling outliers on the target problem.


[3] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe, “Predicting elections with twitter: What 140 characters reveal about political sentiment.,” ICWSM, vol. 10, no. 1, pp. 178–185, 2010.
[4] E. Khabiri, J. Caverlee, and C.-F. Hsu, “Summarizing User-Contributed Comments,” Proc. of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM’11), pp. 534–537, 2011.
[7] D. Chakrabarti and K. Punera, “Event Summarization Using Tweets,” Proc. of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM’11), pp. 66–73, 2011.
[8] H. Becker, M. Naaman, and L. Gravano, “Selecting Quality Twitter Content for Events,” Proc. of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM’11), pp. 442–445, 2011.
[15] J. Bollen, H. Mao, and A. Pepe, “Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena,” Proc. of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM’11), pp. 450–453, 2011.
