一個識別特定主題深網查詢介面的分類器

根據研究估算，深層網路(Deep Web)的規模大約為表層網路(Surface Web)的400~550倍，為了擷取深網資料庫的內容，首先必須找出資料庫的入口，即深網查詢表單。此外，由於深網內容通常屬於某個特定主題，為了從眾多該特定主題的網頁表單中識別出深網查詢表單，本研究提出一個兩階段的分析方法，結合提交查詢前之表單分析以及提交查詢後之表單分析，發展一個自動化深網查詢介面識別技術。不同於其他研究，本研究不僅能識別出查詢表單，更能進一步過濾搜尋引擎、站內搜尋這類只對靜態網頁進行索引的非深網查詢表單。在前置準備階段，我們會建立非查詢表單欄位特徵字，並透過大量爬行特定主題查詢表單以擷取出該主題常見欄位語意。我們的分類系統，在提交查詢前之表單分析這個階段，我們使用非查詢表單欄位特徵字優先過濾常見的非查詢表單，以降低提交查詢的時間成本。在參考提交查詢結果之表單分析這個階段，我們利用常見欄位語意對表單自動填值以實際對表單自動提交查詢，並根據查詢回傳的結果進一步分析，以判定表單是否為特定主題的深網查詢介面。實驗結果顯示，我們提出的方法可以得到高精確度(precision)，不僅可過濾搜尋引擎這類的非深網查詢表單，更可自動偵測及過濾連結失效的查詢表單。

關鍵字

深層網路；查詢介面；搜尋引擎

並列摘要

From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web query interfaces. Moreover, since the content of deep web is domain-specific, to identify the deep web query interfaces from various web forms, we propose a two-phase analysis methodology which combines pre-query and post-query analyses, and develop an automatic deep web query interface classification technique. We not only can identify deep web query forms, but also can filter out search engine forms and site search forms, which are to extract static web pages inside a site. Before the classification, we would build feature words for the non-query forms, and would crawl a large scale of domain-specific query forms to extract the semantics of popular fields of that domain. In our classification system, in the pre-query analysis phase, we use feature words for the non-query forms to filter out non-query forms so that processing time at the next phase could be reduced. In the post-query analysis stage, we use the field semantics to fill in values and submit forms automatically, and then classify forms according to the query results of the forms. The experimental result shows our two-phase analysis methodology can obtain high precision. We can filter out not only the search engine forms and site search forms, but also deep web query forms which link to disabled databases.

並列關鍵字

Deep Web ； Query Interface ； Search Engine

參考文獻

[6] Barbosa, L. & Freire, J. (2007). Combining classifiers to identify online databases. Proceedings of the 16th International Conference on World Wide Web, pp. 431-440.

[7] Bergholz, A. & Chidlovskii, B. (2003). Crawling for domain-specific hidden web resources. Proceedings of the 4th International Conference on Web Information Systems Engineering (WISE), pp. 125-133.

[8] Bergman, M. K. (2001). The deep web: surfacing hidden value. Technical report, BrightPlanet LLC.

[9] Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine, Proceedings of the 7th International Conference on World Wide Web, pp. 107-117.

[10] Caverlee, J., Liu, L. & Buttler, D. (2004). Probe, cluster, and discover: focused extraction of qa-pagelets from the deep web. Proceedings of the 28th International Conference on Very Large Data Bases, pp. 103-114.

被引用紀錄

蕭子竣（2014）。基於漸進式匹配與合併之深網查詢介面整合-以書籍領域為例〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2014.00946

鄭又誠（2012）。深層網路查詢介面之綱要擷取研究〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2012.00427

國際替代計量

一個識別特定主題深網查詢介面的分類器

全文下載

主題瀏覽