透過您的圖書館登入
IP:3.144.87.149
  • 會議論文
  • OpenAccess

即時空氣品質及類流感資料儲存與處理平台之研製

摘要


在2015年底,台中市境內11處監測站,發生同時有9處監測站超標。此外,類流感疾病統計人數也逐漸增加。為了瞭解空氣品質與類流感之關聯性,本研究建立整合空氣品質與類流感資料的大數據平台。實作方面,第一,建立一個叢集儲存(HDFS)與Spark環境作運算,使用ELK Stack作為視覺化平台與Ceph Object Storage作為資料備份。第二,串接Open Data API自動導入空氣品質與類流感資料至MySQL。研究中遇到些問題。首先,關聯式資料庫造成I/O效能不好。因此,本研究使用索引方式達到兩倍的讀寫效能。在Sqoop工具環境,一般應用只能將原始資料切割成為多分檔案。但是,切成多分檔案也增加傳輸時間。因此,本研究使用「with direction」方法與切成多份檔案的組合可以達到同樣的效能。最後本研究使用Spark,並使用Alluxio加速存取資料。資料儲存在HDFS,自動傳輸到Alluxio記憶體中。讓Spark在記憶體讀取更加快速。最終由ELK Stack將空氣品質資料及類流感資料匯入,並透過此平台視覺化分析,我們觀察出ILI發病時間有晚於AQI的趨勢,所以加設Lag Time考量關聯性,發現大概延遲四周時,AQI其關聯性最為明顯。更進一步再透過R語言將多種空氣污染物帶入複迴歸模型檢定其變數於不同Lag Time對於類流感之顯著性,運算結果發現大多污染源會於四至十周時達到p-value小於0.05,其代表有關聯性。

並列摘要


Air quality becomes a main concern in the eyes of Taiwan. In recent years, this problem is always occurred in Taiwan. Therefore, the government needs various systems as a benchmark in air pollution. Besides, to understand whether air quality is associated with Influenza-like Illness disease, we need to build an integrate system that combines between air-pollution and Influenza-Like Illness. The purpose of this study is to provide an innovative application of the research environment that concern on the performance and application of value added. For more detail, it consists of three phase designs and implementation. First, we build a cluster HDFS and Spark environment, ELK Stack as a visualization platform and Ceph Object Storage as cluster backup storage. Second, using Open Data API to transfer air quality and ILI data into MySQL. It also has problems in study. First, database relation of this ecosystem is used for front-end and back-end big data is not relevant. Reading and writing data will have slowly speed. Therefore, we need table index to increase the speed of operation. Second, transferring data between MySQL and HDFS. And, we used Sqoop to split data into multiple files need to spend much time. So, we need “with direction” function to split data into multiple file with the same duration. The last one, in our study, the more data operation the more slowly speed is. So, we need Alluxio as an in-memory middle bridge storage. In the end, We imported AQI and ILI data to ELK Stack, visualized data, and analyze. Through the visualization process, we found that occurrence time of ILI was later than AQI. We started to consider Lag time. There was a signifient correlation between AQI and ILI, when Lag time was setted to 4 weeks. Therefore, we used R language to import air pollutants data into the Multiple Regression model and test the different Lag Time of AQI for ILI. The results of the operation found that most of the air pollution were delay four to ten weeks to achieve pvalue of less than 0.05, which means that the correlation.

延伸閱讀