非監督式包覆程式維護之綱要對映

包覆程式(Wrapper)泛指用以收集網頁中特定資料的擷取程式，使用者能藉由包覆程式存取特定的資料，再將資料透過資訊整合步驟(Information Integration)以形成有利用價值的資訊，而後提供一套整合性的網路服務系統或資料分析系統。然則，網站的開發者常會因為各種不同的需求而修改網站，使得原本的包覆程式產生錯誤以至於無法使用相同的程式來擷取資訊，此時程式開發員只能選擇重新撰寫或修改擷取程式來解決。有鑑於此，近年來有許多非監督式包覆程式產生器(Unsupervised Wrapper Induction) 被廣泛的討論，藉由動態網頁中的規律性來產生對應於網站的擷取模組，並藉由擷取模組自動化地擷取資料，如此就不需要每次都重新撰寫包覆程式。然非監督式包覆程式產生器在維護上可能遭遇到的狀況是:當網站隨著時間而修改，使得在時間t和時間t’時所擷取下來的資訊無論在綱要、實例上都會有極大的差異，要如何整合資料就是本論文深入探討的問題。當取得時間t和時間t’的綱要(Schema)後，可以利用綱要所提供的結構資訊(Structure)和實例內容(Instance)的高度相關性來將此兩綱要作對應，本論文分別就實例步驟和結構步驟遴選出對應屬性。實例步驟包含資料型別的鑑定、相同記錄的找尋、以及利用實例資訊的相似度找尋可能的對應屬性。結構步驟提出不同類型的結構相似度計算方法，而後結合這些相似度以反應出資料在結構上的特徵，進而選出相對應的屬性。藉由實例資訊相似度用以擷取屬性的特徵，再使用結構資訊相似度來取得屬性間的關係，故不需要訓練資料也使得系統能自動化的對應屬性，且在各領域上都能有令人滿意的效能。對於Book領域的屬性對應的F-Measure可以達到92%的效果，而Job領域也能達到95%，Hotel領域達到86%的效果，最不容易作對應的CarBuyer領域也能達到84%的屬性對應，就整體來說結構相似度在屬性的對應上是確實有幫助的。

關鍵字

包覆程式的維護；資訊整合；綱要對映

並列摘要

Wrapper refers to program which is used to extract the specific data in web page, researchers can access specific data by wrapper and use information integration to transfer the data to be useful information, then provide a set of integrated network services, systems or data analysis system. But the site developers often modify the website because of different needs, this making the original wrapper error that can’t extract data. At this situation, the program developer can just re-write or modify original wrapper to solve. For this reason, unsupervised wrapper induction is widely discussed in recent years. It builds extracted module automatically by the regularity of the dynamic web page and extracted data by such module, so programmer don’t need to write wrapper for specific website every time. The problem unsupervised wrapper induction may encounter is its maintenance. If the website changes by time, we will have two extracted data at time t and at time t’. How to identify the related information and integrate them is our goal. We use the instance and structure information which generated by FiVatech (the unsupervised wrapper induction tool we used) to match the correlation attribute.

並列關鍵字

Wrapper Maintenance ； Data Integration ； Schema Matching

參考文獻

[7] B. He, K. C.-C. Chang, and J. Han. Discovering Complex Matching across Web Query Interfaces: A Correlation Mining Approach. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 148-157, 2004.

[3] A. Algergawy, E. Schallehn, G. Saake. Improving XML schema matching performance using Prufer sequences. Data & Knowledge Engineering, Volume 68, pp. 728–747. 2009.

[4] A. Algergawy, R. Nayak, G. Saake. Element similarity measures in XML schema matching. Information Sciences Vol.180. pp. 4975-5998. 2010.

[5] A. Gal, Managing uncertainty in schema matching with top-k schema mappings, Journal on Data Semantics Vol.6 90–114, 2006.

[9] C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A Survey of Web Information Extraction Systems, IEEE TKDE (SCI, EI), Vol. 18, No. 10, pp. 1411-1428. 2006.

國際替代計量

非監督式包覆程式維護之綱要對映

未授權

主題瀏覽