透過您的圖書館登入
IP:3.144.12.14
  • 學位論文

應用Python網路爬蟲技術於政府開放資料平台PM2.5即時動態資料分析

Python Web Crawler Technology Applied to Dynamic Data Analysis of PM2.5 on the Government Open Data Platform

指導教授 : 劉振隆
共同指導教授 : 洪誌隆(Chih-Lung Hung)

摘要


因細懸浮微粒(PM2.5)影響層面越來越擴大,對於環境及人民的健康與過敏問題都造成相當程度的損害,所以空氣品質議題成為了熱門討論的話題。不僅只是政府想要解決細懸浮微粒(PM2.5)濃度過高的問題,一般民眾也想了解即時的整體空氣環境適不適合進行活動。本研究利用Python的爬蟲技術取得政府開放資料平台上所提供的PM2.5即時資料,並將其即時資料儲存於Mongo資料庫內,不僅如此還利用Python將其資料備份於CSV檔以防止資料損毀或遺漏,或是可以提供給未來相關研究人員做為多樣資料類型選擇。接著透過R語言與Mongo資料庫連接將剛爬取回來的資料作出最即時的動態分析並將其視覺化,含有:盒鬚圖(Boxplot)、圓餅圖(Pie Chart)、直方圖(Histogram)、折線圖(Broken-line graph)、散佈圖(scatter plot)、地圖(Map)。透過這些圖表就能較快掌握到資料一些特點,尤其是地圖可以提供給民眾最直觀的圖表,使民眾瞭解到現在全台各區的PM2.5數值。分析完成後會自動將這些圖表轉換成圖片檔存於相對應時間的資料夾內。之後再將這樣一套系統進行自動化,讓系統能完成每小時的爬取、儲存、分析、視覺化的任務。最後經過長時間的累積就能獲得長時間的資料集,並對更大的時間單位進行統計及分析。為補齊之前沒收集到的資料,本研究會另外再匯入環保署所提供的2017年全天逐時資料,並利用Power BI分析2017年整年度的PM2.5資料的分布情形。

並列摘要


The problem of air pollution has become progressively worse. Consequently, air quality issue is a hot topic nowadays. Particulate Matter 2.5 (Aerodynamic Diameter ≤2.5 μm; PM2.5), which is one of the elements of ambient urban air pollution, has been gradually emphasized a hazard to human health. At present, the government is trying to solve the problem of excessive concentration of PM2.5, and the public also wants to understand the immediate regional air conditions. This research used Web Crawler in Python to obtain PM2.5 real-time data from government open data portal, then stored them in Mongo database. Moreover, we also used the Python to back up the data in CSV file format to prevent data loss and provide relevant researchers as a variety of data type options in the future besides. By applying R to connect Mongo database, we could immediately present a dynamic analysis of the data we obtained, including boxplot, pie chart, histogram, broken-line graph, scatter plot, and map. The charts could help people quickly and clearly grasp the key points of the data, especially the map which is most useful to the public to instantly understand the current PM2.5 concentration in all the regions of Taiwan. When the analysis is completed, these charts are automatically converted into image and stored in the corresponding time folder. Afterwards, we set up the system to be automated to crawl, store, analyze and visualize in every hour. We could obtain a huge data set after a long period of accumulated, and perform more statistics and analysis on larger time units. To complete the information that had not been collected before, we additionally import the 2017 full-time data provided by the Environmental Protection Administration and use Power BI to analyze the distribution of PM2.5 data for the entire year of 2017.

參考文獻


一、 連結網址
[1] DAX 簡介(2018)。 檢自: https://docs.microsoft.com/zh-tw/power-bi/ guided-learning/introductiontodax?tutorial-step=1
[2] Power BI 中的視覺效果類型 (2018)。檢自:https://docs.microsoft.com/zh-tw/ power-bi/power-bi-visualization-types-for-reports-and-q-and-a
[3] R ggplot2 教學:基本概念與qplot函數(2016)。檢自:https://blog.gtwang.org/r/ggplot2-tutorial-basic-concept-and-qplot/
[4] 大數據全棧式開發語言–Python (2018) 檢自: http://insights.thoughtworks.cn/ full-stack-Python/

延伸閱讀