運用R語言與Hadoop分析開放資料-以天氣與農產品資料為例

近年來，由於開放資料（Open Data）被認定為涵有大量的潛在價值，故在資訊技術領域內已成為十分熱門的議題。而於開放資料中，政府開放資料（Open Government Data）已受西方各國與聯合國等國際組織的特別重視，並且大力推行。然而，於網際網路內釋出的開放資料，其資料格式過於繁雜，不同來源的資料常存在資料欄位定義的差異，導致資料整合與分析的不便。如何蒐集與整合其多元的開放資料並加以提供分析人員更迅速地進行資料分析與萃取重點資訊，成為當前熱門的話題。故本研究提出一項資料彙整分析平台雛型。其功能特色主要是可以自動進行開放資料的擷取與整併，結合Hadoop之巨量資料處理工具與R語言之資料探勘工具，以進行資料的分析處理，並於分析完成後自動留存關鍵因子，以提供後續決策分析使用。最後本研究則以農產品交易紀錄與歷史的天氣資料為例，經由本研究所開發之平台進行資料的擷取與整併，並透過平台內的決策樹之資料探勘方法進行迴圈式資料分析之行為，將每次分析模型儲存後，再依各農產品之類別來彙整其共同影響之因素，以提供決策者更完整的參考資訊。

關鍵字

巨量資料；開放資料；政府開放資料； Hadoop ； R語言

並列摘要

In recent years, because of the massive potential values in “open data”, it has been become a quite popular topic in the domain of information technology. In addition, western countries and international organizations, such as United Nations endeavored to prompt the open government data. Moreover, we obtain data from various sources, which usually do not transform the content with unique format. This would cause inconvenient to integrate and analyze the data. Therefore, it is a prominent issue to develop a mechanism which is capable of collecting and integrating the heterogeneous open dataset seamlessly and support the analysts to retrieve the potential information efficiently. Thus, this study adopts Hadoop platform and R language to implement a prototype system that can automatically capture and consolidate the open data. After the processes are finished, all results, including summarized data, analytical models, decision tree rules, and discovered key factors will be stored in relational database and HDFS. We try to collect the agriculture transactional data and historical climate records through our procedures. Additionally, this system generates the common key factors from various crops belong to a specified category by adopting proposed looping decision tree mechanism.

並列關鍵字

Big Data ； Open Data ； Open Government Data ； Hadoop ； R Language

參考文獻

Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. ACM SIGOPS Operating Systems Review. 37, pp. 29-43. ACM

Mell, P., & Grance, T. (2009). The NIST definition of cloud computing. National Institute of Standards and Technology, 53(6), 50.

Quinlan, J. R. (1992). Learning with continuous classes. Australian joint conference on artificial intelligence, 92, pp. 343-348

李孟洋(2014)。開放資料之產業效益-以天氣風險管理開發股份有限公司為例。碩士學位論文，國立清華大學，經營管理所。

Apache. (2013). HDFS Architecture Guide. Retrieved May 5, 2015, from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Google Scholar

國際替代計量

運用R語言與Hadoop分析開放資料-以天氣與農產品資料為例

未授權

主題瀏覽