應用Python網路爬蟲技術於政府開放資料平台PM2.5即時動態資料分析

因細懸浮微粒(PM2.5)影響層面越來越擴大，對於環境及人民的健康與過敏問題都造成相當程度的損害，所以空氣品質議題成為了熱門討論的話題。不僅只是政府想要解決細懸浮微粒(PM2.5)濃度過高的問題，一般民眾也想了解即時的整體空氣環境適不適合進行活動。本研究利用Python的爬蟲技術取得政府開放資料平台上所提供的PM2.5即時資料，並將其即時資料儲存於Mongo資料庫內，不僅如此還利用Python將其資料備份於CSV檔以防止資料損毀或遺漏，或是可以提供給未來相關研究人員做為多樣資料類型選擇。接著透過R語言與Mongo資料庫連接將剛爬取回來的資料作出最即時的動態分析並將其視覺化，含有：盒鬚圖(Boxplot)、圓餅圖(Pie Chart)、直方圖(Histogram)、折線圖(Broken-line graph)、散佈圖(scatter plot)、地圖(Map)。透過這些圖表就能較快掌握到資料一些特點，尤其是地圖可以提供給民眾最直觀的圖表，使民眾瞭解到現在全台各區的PM2.5數值。分析完成後會自動將這些圖表轉換成圖片檔存於相對應時間的資料夾內。之後再將這樣一套系統進行自動化，讓系統能完成每小時的爬取、儲存、分析、視覺化的任務。最後經過長時間的累積就能獲得長時間的資料集，並對更大的時間單位進行統計及分析。為補齊之前沒收集到的資料，本研究會另外再匯入環保署所提供的2017年全天逐時資料，並利用Power BI分析2017年整年度的PM2.5資料的分布情形。

關鍵字

細懸浮微粒(PM2.5) ；爬蟲技術； R語言；資料視覺化

並列摘要

The problem of air pollution has become progressively worse. Consequently, air quality issue is a hot topic nowadays. Particulate Matter 2.5 (Aerodynamic Diameter ≤2.5 μm; PM2.5), which is one of the elements of ambient urban air pollution, has been gradually emphasized a hazard to human health. At present, the government is trying to solve the problem of excessive concentration of PM2.5, and the public also wants to understand the immediate regional air conditions. This research used Web Crawler in Python to obtain PM2.5 real-time data from government open data portal, then stored them in Mongo database. Moreover, we also used the Python to back up the data in CSV file format to prevent data loss and provide relevant researchers as a variety of data type options in the future besides. By applying R to connect Mongo database, we could immediately present a dynamic analysis of the data we obtained, including boxplot, pie chart, histogram, broken-line graph, scatter plot, and map. The charts could help people quickly and clearly grasp the key points of the data, especially the map which is most useful to the public to instantly understand the current PM2.5 concentration in all the regions of Taiwan. When the analysis is completed, these charts are automatically converted into image and stored in the corresponding time folder. Afterwards, we set up the system to be automated to crawl, store, analyze and visualize in every hour. We could obtain a huge data set after a long period of accumulated, and perform more statistics and analysis on larger time units. To complete the information that had not been collected before, we additionally import the 2017 full-time data provided by the Environmental Protection Administration and use Power BI to analyze the distribution of PM2.5 data for the entire year of 2017.

並列關鍵字

Particulate Matter 2.5(PM2.5) ； Python ； Web Crawling ； R language ； Data Visualization ； Power BI

參考文獻

一、連結網址

Google Scholar

[1] DAX 簡介(2018)。檢自: https://docs.microsoft.com/zh-tw/power-bi/ guided-learning/introductiontodax?tutorial-step=1

Google Scholar

[2] Power BI 中的視覺效果類型 (2018)。檢自：https://docs.microsoft.com/zh-tw/ power-bi/power-bi-visualization-types-for-reports-and-q-and-a

Google Scholar

[3] R ggplot2 教學：基本概念與qplot函數(2016)。檢自：https://blog.gtwang.org/r/ggplot2-tutorial-basic-concept-and-qplot/

Google Scholar

[4] 大數據全棧式開發語言–Python (2018) 檢自: http://insights.thoughtworks.cn/ full-stack-Python/

Google Scholar

延伸閱讀

高聖帆（2017）。運用ETL與決策樹整合並分析開放資料－以PM2.5探討為例〔碩士論文，國立虎尾科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0028-1408201717460400
劉振隆、張愷珉、吳芷軒、于采玉、李彥君、黃鈺珊（2018）。政府資料開放平臺之PM2.5即時監測資料分析。管理資訊計算，7()，1-12。https://doi.org/10.6285/MIC.201808_7(S1).0001
楊永盛（2018）。以公開資料與光散傳感器為基礎探究交通流量與 PM2.5 之關聯性〔博士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-2103201910471669
許雅婷（2017）。移動源排放PM2.5化學組成分析與指紋圖譜建置〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-2712201714433152
Jane, C. J. (2021). A Hybrid Pareto Particle Swarm Optimization with Geographic Information System for Water Resources Optimization. International Journal of Uncertainty and Innovation Research, 3(1), 19-32. https://www.airitilibrary.com/Article/Detail?DocID=P20190619001-202104-202103300013-202103300013-19-32

國際替代計量

應用Python網路爬蟲技術於政府開放資料平台PM2.5即時動態資料分析

全文下載

主題瀏覽