離線搜尋Wikipedia以縮減NGD運算時間之研究

隨著網際網路的快速發展，各式各樣的網頁資訊持續不斷的增加，使用者可以輕易的從各種搜尋引擎及入口網站獲取大量的資訊，例如Google和Yahoo奇摩!等。然而根據Jansen et al.研究指出一般情況下大部分使用者僅輸入2.35個關鍵字，且大多為不清楚或不詳盡的關鍵字，結果回傳的文件過量導致資訊過載的問題。過去的研究文獻中，常使用資訊分類或過濾的方法來協助降低使用者的資訊存取成本，但是這些方法都必須建立在大量訓練資料為前提下才能有好的效果。近期研究提出NGD，藉由Google所提供的搜尋引擎利用輸入關鍵字所回傳的結果數，計算兩個字詞之間的抽象距離，進而得出兩個字詞所在的文件是否相似。但是NGD依賴Google的線上搜尋功能，以致次數頻繁而被拒絕使用搜尋服務，因此本研究有別於先前之研究，提出將Wikipedia建立成離線版搜尋引擎，透過Wiki結構化的概念和純度較高的資訊內容，解決使用Google搜尋引擎所遇到的困難。並經過實驗的證明，使用者使用離線版Wikipedia搜尋引擎時，本研究提出的方法仍能提供使用者維持穩定的過濾效能，並且節省使用者的大量時間成本。

關鍵字

NGD ； Wikipedia ； Google

並列摘要

With the rapid development of Internet, many kinds of information website continued a steady increase; the user can easily obtain a great deal of information from a variety of search engines and portals such as Google and Yahoo! However, Jansen, et al. pointed out that under normal circumstances, most users enter only 2.35 keywords, and mostly unclear or incomplete keyword results in returning a lot of websites so that lead to information overload. The research literature in the past, often using the categories of information, or filtering to help reduce the cost of user access to information, but these methods have to be built under the premise of a large number of training data can have good results. Recent studies have proposed NGD provided by Google''s search engine, key in the keywords to get the number of results to calculate the abstract distance between the two words, and then draw a conclusion of two words where the file is similar. However NGD rely on Google''s online search function, with the high-frequency query, Google will refused user to use the search service. In order to solve this problem, this study advances a method that use Wikipedia to establish the offline search engine, because Wikipedia has a structured concepts and high purity content. And with the experimental proofs, when user uses the offline Wikipedia database, the method proposed in this study still provides the user has a stable filtration performance, and saves the user a plenty of time costs.

並列關鍵字

NGD ； Wikipedia ； Google

參考文獻

﹝1﹞ Jansen, M., Spink, A., Bateman, J., and Saracevic, T., “Real Life Information Retrieval: A Study of User Queries on the Web,” in: Proc. ACM SIGIR Forum, vol. 32, pp. 5–17., 1998.

﹝2﹞ Montebello, M., “Information overload-an IR problem?”, String Processing and Information Retrieval: A South American Symposium, September 1998.

﹝4﹞ 李浩平，「運用NGD建立適用於使者回饋資訊不足之文件過濾系統」，國立中央大學，碩士論文，民國100年。

﹝6﹞ Pazzani, M., and Billsus, D., “Content-Based Recommendation Systems”, The Adaptive Web, Vol 4321, pp. 325-341, 2007.

﹝7﹞ Basilico, J., and Hofmann, T., “Unifying collaborative and content-based filtering”, Proceedings of the twenty-first international conference on Machine learning, Banff, Alberta, Canada, 2004.

被引用紀錄

吳登翔（2014）。使用者模型為基礎的概念飄移預測〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0412201511590773

李佩儒（2014）。利用自建Ontological User Profile應用於文字文件推薦〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0412201511590667

江欣鴻（2015）。以自建本體進行使用者興趣偵測與文件推薦〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0412201512073363

陳靜儀（2015）。以Normalized Google Distance辨識學名與別名-以化學物質為例〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0412201512070659

蘇鼎文（2015）。探討多重記憶系統應用於遺忘因子的使用者興趣模型〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0412201512073572

國際替代計量

離線搜尋Wikipedia以縮減NGD運算時間之研究

未授權

主題瀏覽