透過您的圖書館登入
IP:18.223.114.142
  • 學位論文

基於文件物件模型的網站表格資料擷取與分析之研究-以COVID-19為例

Web Table Extraction and Analysis Based on Document Object Model - A Case of COVID-19 Web Site

指導教授 : 陳彥錚

摘要


新型冠狀病毒(COVID-19)肆虐全球至今,人們運用科技的力量,試圖從數據中找尋初一些端倪。網際網路的發展,網路上擁有各式各樣的資料,但資料通常以表格的方式進行呈現,不容易從中觀察出資料想傳達給人們的訊息,於是Web Table Extraction及資料視覺化成為了一個相當有趣的研究議題。網路上有很多資料視覺化的工具,例如:Microsoft於2011年所提供的Power BI、Tableau Software提供的Tableau以及Google所推出的Google Charts,讓人們能更有效率的檢視數據。   本研究提出了一個基於文件物件模型的網站表格資料擷取系統。使用Web Table Extraction技術,以文件物件模型(DOM)的方式解析網頁結構,將擷取資料匯入資料庫。最後透過網頁開發,並使用Google Charts方法,讓數據以圖表的方式進行呈現,使用者僅需透過網頁瀏覽器(Web Browser),就能夠使用本研究所開發的系統。本研究將以worldometers網站每日更新COVID-19疫情資料為例,將儲存於表格中的各國確診資料進行資料擷取、儲存、以及資料視覺化,以驗證本研究之有效性。

並列摘要


As coronavirus disease 2019 (COVID-19) ravaged the world seriously, people try to apply technologies to find some clues among the data collected from countries. Most of the dada come from the web sites of the Internet. Web sites contain tons of useful data, and most of the data are usually displayed in web tables. Web tables are constructed by a number of HTML elements, It’s not easy to retrieve the data contained in HTML tables. Therefore, web table extraction has become a very interesting research topic. Furthermore, additional tools for the visualization of the extracted data are required. Currently, there are many data visualization tools, e.g. Power BI provided by Microsoft, Tableau by Tableau Software, and Google Charts provided by Google. They allow people to view data more efficiently. This thesis study proposes a web table extraction and analysis system based on Document Object Model (DOM). The proposed system applies a web table extraction technology, analyzes the structure of a web page using DOM, and imports the data into a database for further processing. Finally, for data visualization, we use Google Charts to make represent data in charts. Only a web browser is required for the above functions. To evaluate the effectiveness of the proposed system, we will apply the proposed approach to the COVID-19 web page provided by the worldometers web sites. The reported COVID-19 statistics data of each country are retrieved, analyzed, and stored. Data visualization of the data is further provided.

參考文獻


[1] CR. Prajapati and PP. Solanki, “A Study on Improve Quality of Data for Web Mining Using Data Cleaning Tools,” International Journal of Engineering Research & Technology (IJERT), vol. 6(6), pp.241-243, 2017.
[2] JI Maletic, A Marcus, “Data Cleansing: Beyond Integrity Analysis,” CiteSeerx , 2000.
[3] M KARABULUT, İ MAYDA, “Development of Browser Extension for HTML Web Page Content Extraction,” 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), 2020.
[4] Erdinç Uzun, “A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages,” IEEE Access, , vol. 8, pp.61726-61740, 2020.
[5] S. Zhang and K. Balog, “Web Table Extraction, Retrieval and Augmentation: A Survey,” ACM Transactions on Intelligent Systems and Technology, vol. 11, pp.1-35, 2020.

延伸閱讀