  • 學位論文


Using ETL and Decision Tree to Integrate and Analyze Open Data - A Case Study of PM2.5

指導教授 : 胡念祖


開放資料 (Open Data) 已成為資料分析領域中十分熱門的議題。不僅資料來源多元且廣泛,且探討之分析項目變化甚多,能從中獲取不錯的分析結果,作為決策處理參考。此外,空氣品質 (Air Quality) 議題在近年來逐漸備受矚目,由空氣汙染中,對人體健康有高度危險性的空氣汙染有害因子-細懸浮微粒 (PM2.5) 最被各國政府重視並列入空汙管制項目,更制定空氣品質標準作為防範。可見PM2.5所帶來人類生活中的潛藏危機,已不能被忽視。   然而,在開放資料的收集中,以往分析人員需透過人力進行逐日資料下載,且針對異質資料做整合分析時,由於不同來源的資料格式相異、無效數值繁複,時常花費大量時間做資料解析與處理。不僅耗費人力且人工判定的處理方式更時常出錯,導致無法達到預期的分析成果。如何提升異質開放資料整合成效,並提供決策者能更迅速的進行分析與萃取重點資訊,為巨量資料分析目前所面臨的重大考驗。   因此,本研究以PM2.5探討為例,提出ETL資料處理流程架構,以整合資料擷取 (Extract)、資料解析 (Transform)、資料倉儲 (Load) 等資料處理流程。首先,透過此架構整合天氣與空氣品質之開放資料,進行資料提取、資料轉換與資料載入等ETL流程將資料去蕪存菁,排除無效資料內容。接著,運用相關分析 (Correlation analysis) 及決策樹分析 (Decision Tree) 找出PM2.5濃度值變化之重要影響因子。最後,繪製可視化圖表,讓決策者得以掌握PM2.5之濃度變化狀態及分佈情況做出對應決策。


Open Data has been become a quite popular topic in the domain of data analysis. And there is a great diversity of data resources can get a good analysis of the results, as a decision-making reference. In addition, the issue of air quality was the center of attention quality in recent years. Fine Particulate Matters (PM2.5) has become an indicator of air pollution in Europe, United States and other advanced countries in the world. It is highly dangerous air pollution factor for human health are most valued by governments, and PM2.5 will be included in the air pollution control item. In academia, many scholars began to analyze the reasons and impact of PM2.5.   However, in the past, it was always a time-consuming and hard job to obtain returned feedbacks by using manpower to data acquisition, which causes losing quality of results. How to improve the effectiveness of heterogeneous data analysis, and provide decision-makers can be more rapid analysis, extraction of key information for the Big Data analysis is currently facing a major test.   Therefore, this study tries to propose a data processing flow that adopts a famous ETL tool, SQL Server Integration Service, to download and integrate the climate and air quality data from open data provided by governments. During the ETL process, data quality procedure is performed to revise the invalid data, such as Null, abnormal value or incorrect format. Thereafter, this study discovers the relationships among all variables by using Correlation Analysis. In addition, Decision Tree algorithm is used to observe the corresponding factors.


[22] 張致瑋、謝雲生 (2013)。南高雄懸浮微粒粒徑分布特性分析。鑛冶: 中國鑛冶工程學會會刊,224,20-26。
[20] 翁叔平、郭乃文、呂珮雯 (2013)。高高屏地區細懸浮微粒 (PM2.5) 污染事件的綜觀環境分析。大氣科學,41 (1),43-64。
[1] Fang, X., Zou, B., Liu, X., Sternberg, T., & Zhai, L. (2016). Satellite-based ground PM2.5 estimation using timely structure adaptive modeling. Remote Sensing of Environment, 186, 152-163.
[3] Li, L., Wu, A. H., Cheng, I., Chen, J. C., & Wu, J. (2017). Spatiotemporal estimation of historical PM2.5 concentrations using PM10, meteorological variables, and spatial effect. Atmospheric Environment.
[9] Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1 (1), 81-106.
