透過您的圖書館登入
IP:13.58.192.154
  • 學位論文

從二階段分群萃取輿情事件

Extracting the Opinion Events from Two-Stage Clustering

指導教授 : 洪智力

摘要


網路新聞為一般人普遍蒐集、接受資訊的來源處,許多人透過網路新聞的閱讀,取得當前社會的議題事件,進而留下口碑、想法產生網路輿情。網路新聞具有時效性及連續性,若一般人需要完整理解某議題事件的全貌,除了需要往前回顧大量新聞資料,還必須持續追蹤新聞事件的未來發展。蒐集網際網路上大量流傳的公眾議題,歸納並分析稱為輿情探勘(Public opinion mining),輿情探勘在文獻上使用的技術為主題偵測與追蹤(Topic detection and tracking; TDT),主要針對網際網路上的資訊採用自動化的方式辨識與分析可能的主題。主題偵測與追蹤所使用的分群歸納模型方法為非監督式學習,較為常見的分群法為K平均法(K-means),主要的優點是它容易明白且操作,分群之間的群聚效果明顯,但是當大規模資料的分群時,也難以處理重疊的資料。另外常見的分群法為自我組織類神經網路(Self-organizing map; SOM),在主題偵測與追蹤上能迅速取得群集分布的關係,但是圖形化結果的呈現和無法自動劃分群集的特性造成主題事件萃取的困難。最後一個常用於萃取主題事件關鍵字的方法為隱含狄利克雷分配模型(Latent Dirichlet allocation; LDA),用於從文章中找出隱含語意並萃取出主題代表字。本研究將這些萃取法結合,利用SOM產生初始的關鍵字群集,再利用K-means取得最終的關鍵字群集,最後將每個群集視為詞袋使用LDA的萃取關鍵字。實驗結果指出,本研究方法透過二階段分群縮減第一階輸出結果並萃取重要的輿情關鍵字,因此在宏平均法(Marco-Average)和微平均法(Micro-Average)較傳統的單一分群法佳,但是議題關鍵字呈現則是只使用SOM分群的方法較佳。

並列摘要


Internet news is a source which people collect and receive information from. They read the Internet news for getting social events and leave word of mouths which become opinions. Internet news are continuously broadcasting. If people want to know the picture of topic events, they need to review a lot of previous news and keep tracking the development of news events. The process of gathering, extracting, summarizing and analyzing popular news events on the Internet is the task of public opinion mining. Traditional opinion mining usually use Topic detection and tracking (TDT) as its main method, which automatically tells and analyzes possible topics from information. TDT usually uses a clustering-based method which is unsupervised learning. The most common model is K-means which can easily use and efficiently cluster its information. However, it is hard to deal with data when it deals with large scale data. Another method is self-organizing map (SOM) which can get clusters faster. But its graphical results and non-automatic partition clusters make it harder to extract topic events. Last method is latent dirichlet allocation (LDA) which finds latent semantics from documents and extracts topic keywords. The paper proposes the two-stage clustering which combines these methods. The first step is producing the initial keyword-clusters by SOM. Then we get the final keyword-clustes by K-means. Finally each cluster will be considered as bag of word and final keywords are extracted by LDA. According to the experiments, the two-stage model is more efficiently than traditional one-stage clustering evaluated by both Macro-average and Micro-average criteria. But the traditional one-stage clustering is better at visualized news events presentation.

參考文獻


Berghel, H. (1997). Cyberspace 2000: Dealing with information overload. Communications of the ACM, 40(2), 19–24.
Bochereau, L., & Boutgine, P. (1990). Extraction of semantic features and logical rules from multilayer neural networks. ResearchGate, 2.
Brown, J. S., & Duguid, P. (2002). The Social Life of Information. Harvard Business School Press.
Bruske, J., & Sommer, G. (1995). Dynamic cell structure learns perfectly topology preserving map. Neural Computation, 7(4), 845–865.
Canini, K. R., Shi, L., & Griffiths, T. L. (2009). Online Inference of Topics with Latent Dirichlet Allocation. In AISTATS (Vol. 9, pp. 65–72).

延伸閱讀