基於主題式網路爬行器以提昇跨語言資訊檢索效能之資訊檢索系統

本篇論文主要探討如何建立擷取效率高的主題式網路爬行器(Topic web crawler)和運用主題式網路爬行器來改進跨語言資訊檢索(Cross-Language Information Retrieval)效能。主題式網路爬行器主要針對主題相關的網頁作擷取，因此我們主要結合一個典型的網路爬行器和判斷網頁相關/不相關的分類器，一開始給予網路爬行器相關的網頁種子來擷取主題相關的網頁，並透過分類器來判斷擷取的網頁是否與主題相關，最後經由相關網頁中的URL來進一步作網頁的擷取。在本篇論文中，我們將採用主題式網路爬行器作為查詢擴展(Query Expansion)的來源之一，並將主題式網路爬行器整合於跨語言資訊檢索系統中，在實驗中也和先前提出使用由維基百科(Wikipedia)擷取查詢擴展候選詞作查詢擴展來比較，最後我們整合主題式網路爬行器、維基百科和Okapi BM25演算法來進行查詢擴展，以改進跨語言檢索系統的效能。我們主要使用NTCIR-8 IR4QA的文件集來評估跨語言檢索系統，而實驗結果顯示結合不同資源來進行查詢擴展優於使用單一資源來進行查詢擴展，並且能夠有效地提昇跨語言資訊檢索系統的效能。

關鍵字

主題式網路爬行器； NTCIR ；跨語言資訊檢索；查詢擴展；維基百科

並列摘要

The paper describes how to build an efficient topic web crawler and use it to improve the performance of cross language information retrieval (CLIR). A topic web crawler can extract web pages related to a certain topic. A topic web crawler is built by combining a standard crawler and a relevance classifier. Given some seed URLs, the crawler gets web pages from the World Wide Web, and the relevance classifier judges which pages are relevant. The URLs in the relevant pages are treated as seeds for further web page retrieval. In this paper, we will adopt topic web crawler as a way of query expansion for CLIR. The topic web crawler extracts candidate query terms form web page. We conduct experiments to compare the method to previous works, i.e. extract candidate query terms from Wikipedia to assist CLIR. We also combine these resources to do query expansion, i.e. combining the topic web crawler, Wikipedia, and Okapi BM25 algorithm, to improve our information retrieval system performance. We test our system on the NTCIR-8 IR4QA data set to evaluate our CLIR system. The experiment result shows that query expansion from combining resources gives better performance than query expansion from single resource.

並列關鍵字

Topic web crawler ； NTCIR ； Wikipedia ； Query Expansion ； Cross-Language Information Retrieval

參考文獻

4. G. Almpanidis, C. Kotropoulos, I. Pitas, “Combining text and link analysis for focused crawling—An application for vertical search engines”, Information Systems, Volume 32, Issue 6, September 2007, pp.886-908.

6. L. Ballesteros, and W.B. Croft, “Resolving Ambiguity for Cross-Lingual Information Retrieval”, Research and Development in Information Retrieval, 1998, pp. 64-71.

7. S. Brin, L. Page, “The anatomy of a large-scale hypertextual web search engine”, Computer Networks and ISDN Systems, Volume 30, Issues 1-7, April 1998, pp. 107-117.

9. Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Springer, 2006.

10. J. Cho, H.G. Molina, L. Page, “Efficient crawling through URL ordering”, Computer Networks and ISDN Systems, Volume 30, Issues 1-7, April 1998, pp. 161-172.

國際替代計量

基於主題式網路爬行器以提昇跨語言資訊檢索效能之資訊檢索系統

主題瀏覽