運用語意分析技術與Spark分析開放資料
-以社群網路資料為例

巨量資料已成為資料分析領域中十分熱門的議題，且資料來源廣泛，如：政府開放資料、社群網站以及電子布告欄系統等，眾多使用者的交流平臺，而使用者經由這些平台來發送訊息，成為一個巨量的社群資料庫，且能透過這些資料進行多元主題的分析。然而，以往要蒐集使用者的意見進行資料分析，常需依靠人力做市場調查，不僅花時間，且資料樣本數也不足，導致無法達到預期的成果；並且，若資料蒐集速度緩慢，又無法精確的符合客戶需求，將會錯過市場重要的商機。如何從這龐大的社群資料庫中，呈現出重要的資訊，便是一個目前十分重要的議題。因此，本研究，以批踢踢實業坊的資料為例，擷取使用者的意見，並將非結構化資料重新定義，成為結構化資料。經資料前置處理後，利用文字探勘方法，萃取資訊。而在文字探勘流程中，首先，針對每篇文章、每個回覆去做資料擷取及拆解，並建立字詞義意正負向之詞庫，作為資料分析之基礎。接著，運用語意分析技術，運用建立好的詞庫來探討使用者所發表文章或回覆意見的情緒正負向，且透過字詞分群的方式改善主題雜亂的問題。再導入權重字詞庫概念以改善貝氏分類器的準確度。並讓使用者可以自訂詞彙，來提升其對於情緒分析的準確性。本系統除了能提供使用者文章內容的傾向，也得以提供分析資訊作為決策參考的依據。此外本研究改善其分詞演算法，利用Spark分散式處理來提高分析效能。

關鍵字

巨量資料；社群網站； Spark ；情緒分析

並列摘要

Big Data has been become a quite popular topic in the domain of data analysis. And there is a great diversity of data resources, such as open government data, social networking sites and Bulletin Board System(BBS), etc. Users use these platforms to communicate with each other, which these messages then be collected in a massive community database and could be applied to analyzing the data of multiple topics. However, in the past, it was always a time-consuming and hard job to obtain returned feedbacks by using manpower to perform market survey, which causes losing quality of results. In addition, the inefficiency process of data not only cannot accurately meet customers’ requirements, but also may miss business opportunities. It is a very important issue to summarize the important information inside the huge data generated from activities of popular community sites. Therefore, in this study, using data from PTT as an example, this study tries to fetch user’s posts and transformed these unstructured data format into structured. In addition, we adopt text mining methods to extract meaningful information. During the text mining process, first of all, we perform data acquisition and dismantling to all replies, and establish the positive and negative lexicon. Then, with using semantic analysis techniques, proposed system analyzes published articles and conducts their emotions, so as to reduce the messy problems within all posts. In addition, we try to improve the accuracy of Bayesian classifier by using weighting calculations of several lexicons. Moreover, this study also allows users to define their own vocabulary to enhance the analytical precision. These approaches not only can help user to obtain the tendency of all posts, but also provide the reference information for making decisions. Besides, this study tries to improve the computational performance by using Spark technique.

並列關鍵字

Big Data ； Social Network Sites ； Spark ； Sentiment Analysis

參考文獻

林大為(2014)。以社會網路分析為基礎之股市投資決策支援系統。中原大學資訊管理研究所學位論文, 1-86。

游和正(2012)。領域相關詞彙極性分析及文件情緒分類之研究。臺灣大學資訊工程學研究所學位論文, 1-57。

陳稼興、謝佳倫、許芳誠(2000)。以遺傳演算法為基礎的中文斷詞研究。資訊管理研究, 2(2), 27-44。

湯甘(2015)。台灣新聞媒體及輿論對待陸生的情感傾向研究。交通大學資訊科學與工程研究所學位論文, 1-47。

蔡博坤(2013)。用智慧聯網時代巨量資料法制議題研析－以美國隱私權保護為核心。科技法律透析期刊，5卷10期,13-45。

國際替代計量

運用語意分析技術與Spark分析開放資料 -以社群網路資料為例

未授權

主題瀏覽