  • 學位論文

運用語意分析技術與Spark分析開放資料 -以社群網路資料為例

Analyzing Open Data by Semantic Analysis and Spark - Using Data of Social Network Data

指導教授 : 胡念祖


巨量資料已成為資料分析領域中十分熱門的議題,且資料來源廣泛,如:政府開放資料、社群網站以及電子布告欄系統等,眾多使用者的交流平臺,而使用者經由這些平台來發送訊息,成為一個巨量的社群資料庫,且能透過這些資料進行多元主題的分析。 然而,以往要蒐集使用者的意見進行資料分析,常需依靠人力做市場調查,不僅花時間,且資料樣本數也不足,導致無法達到預期的成果;並且,若資料蒐集速度緩慢,又無法精確的符合客戶需求,將會錯過市場重要的商機。如何從這龐大的社群資料庫中,呈現出重要的資訊,便是一個目前十分重要的議題。 因此,本研究,以批踢踢實業坊的資料為例,擷取使用者的意見,並將非結構化資料重新定義,成為結構化資料。經資料前置處理後,利用文字探勘方法,萃取資訊。而在文字探勘流程中,首先,針對每篇文章、每個回覆去做資料擷取及拆解,並建立字詞義意正負向之詞庫,作為資料分析之基礎。接著,運用語意分析技術,運用建立好的詞庫來探討使用者所發表文章或回覆意見的情緒正負向,且透過字詞分群的方式改善主題雜亂的問題。再導入權重字詞庫概念以改善貝氏分類器的準確度。並讓使用者可以自訂詞彙,來提升其對於情緒分析的準確性。本系統除了能提供使用者文章內容的傾向,也得以提供分析資訊作為決策參考的依據。此外本研究改善其分詞演算法,利用Spark分散式處理來提高分析效能。


Big Data has been become a quite popular topic in the domain of data analysis. And there is a great diversity of data resources, such as open government data, social networking sites and Bulletin Board System(BBS), etc. Users use these platforms to communicate with each other, which these messages then be collected in a massive community database and could be applied to analyzing the data of multiple topics. However, in the past, it was always a time-consuming and hard job to obtain returned feedbacks by using manpower to perform market survey, which causes losing quality of results. In addition, the inefficiency process of data not only cannot accurately meet customers’ requirements, but also may miss business opportunities. It is a very important issue to summarize the important information inside the huge data generated from activities of popular community sites. Therefore, in this study, using data from PTT as an example, this study tries to fetch user’s posts and transformed these unstructured data format into structured. In addition, we adopt text mining methods to extract meaningful information. During the text mining process, first of all, we perform data acquisition and dismantling to all replies, and establish the positive and negative lexicon. Then, with using semantic analysis techniques, proposed system analyzes published articles and conducts their emotions, so as to reduce the messy problems within all posts. In addition, we try to improve the accuracy of Bayesian classifier by using weighting calculations of several lexicons. Moreover, this study also allows users to define their own vocabulary to enhance the analytical precision. These approaches not only can help user to obtain the tendency of all posts, but also provide the reference information for making decisions. Besides, this study tries to improve the computational performance by using Spark technique.


游和正(2012)。領域相關詞彙極性分析及文件情緒分類之研究。臺灣大學資訊工程學研究所學位論文, 1-57。
陳稼興、謝佳倫、許芳誠(2000)。 以遺傳演算法為基礎的中文斷詞研究。 資訊管理研究, 2(2), 27-44。
林大為(2014)。 以社會網路分析為基礎之股市投資決策支援系統。 中原大學資訊管理研究所學位論文, 1-86。
Quinlan, J. R. (1992). Learning with continuous classes. Australian joint conference on artificial intelligence, 92, pp. 343-348
Stevenson, R. A., Mikels, J. A., & James, T. W. (2007). Characterization of the affective norms for English words by discrete emotional categories. Behavior research methods, 39(4), 1020-1024.
