近年來由於網路迅速發展,提供許多應用服務平台讓使用者在網路上進行搜尋、創作、社交等各式各樣的活動,因此如何有效率的從豐富的網路資料中,擷取出感興趣的部份作進一步的加值應用成為一重要議題。引文分析文獻探討研究就是相關的應用之一。目前已有研究人員發展出自動化的文獻分析系統一引文分析系統-智識建構者平台(Intellectual Structurer System),但網頁內文與格式的快速變動,往往導致智識建構者平台的資訊擷取(Information Extraction)功能或稱網路爬蟲(Web Crawler)無法適用於調整後的網頁,只能不斷重新修改繁雜難懂的擷取邏輯,才能擷取出正確的資訊。但要設計出能適應多樣化的網頁型態,同時又擁有自我維護機制之彈性化網頁資訊擷取系統卻相當不容易。因此本研究對其中面臨的相關議題,包含網頁干擾資料排除機制、重新定義擷取規則(Extraction Rule)、彈性調整擷取流程、提升擷取效能等,實作出一套具規則擷取與自我維護機制之監督式網頁資訊擷取系統,以期改善智識建構者平台在文獻資訊擷取上之問題,並驗證其方法與理論的可行性。
The web crawling function is an essential component of any automatic information extraction system, which needs to trawl web sites for up-to-date information. Researches have tried different way to develop a flexible and adaptable web crawler that is capable of parsing web pages following a set of pre-defined web syntax rules, and these rules may be learned and derived from the target web sites. A universal solution is elusive since the markup language used by web sites is often loose and syntactically incomplete. This research designed, developed, and validated a supervised adaptable web crawler, which is capable of derive extraction rules from a web page segment selected by the user. The derived rules are used by the web crawler to extract the desired information from the website. This supervised rule learning and application scenario makes the information component easier to maintain when the syntax of web pages from a target web site changed. A working web page syntax rule extracting and crawling system written in Java was implemented and tested against two popular citation data web sites. The syntax rule is extracted by highlighting a portion of web pages that the user is interested in. The XML-based web syntax rules are generated by the system. These rules are then used by the crawler to extract the desired citation information from the target web sites. In case of the syntax of the web pages in the target web site changed, the system is capable of detecting the change and re-generates most of the correct rules for the crawler to use.