結合動態分群與字詞類型權重觀念的分散式新聞查詢模型

傳統的單機作業受限於機器設備的運算處理能力不足，面對龐大的資料量時，僅能夠以關鍵字檢索或處理事先分類的資料檢索。為建立能夠處理大量資料的概念式檢索環境，本文發展格網架構 ( Grid Computing ) 下，儲存文件及概念式檢索的環境，提高傳回結果的效益。同時，對於傳回結果，本研究以字詞類型權重 ( Extended Significance Vector Model ; ESVM ) 使用X-means 進行動態分群 ( Dynamical Clustering )，提高系統傳回的可讀性。原有資料集中時，大量文件及發生字詞所形成矩陣過大，而使多數電腦記憶體難以負荷，藉由分散環境將資料來源切割處理，文字矩陣維度得以大幅降低，而使概念性檢索得以在分散環境下實現。本研究的實驗使用不同類型、不同分群數及集中與分散模式差異比較，實驗記錄由檢索開始到分群完成的時間，分析文字預處理及分群的時間，由實驗結果得知分群大小對於文字預處理時間影響少，對於分群時間有相當大的影響，分群數少時間顯著減少，但是，群內文件相似度也稍稍降低。類型大小差異對時間及分群內相似度影響不大。分散分群處理可以有效降低的時間，群內文件相似度確也更加降低。

關鍵字

分散運算；格網；動態分群

並列摘要

The capacity of data storage needed is increasing rapidly due to the increasing availability of information in electronic forms in the information era. A traditional data retrieval system uses a word-form based comparison approach to get the search results. Although this approach is able to handle huge amount of information, it still suffers from the semantic problem. On the other hand, the conceptual retrieval system can get an improvement by using vector space model ( VSM ) but this system is restricted to the curse of dimensionality. In order to handle large amount of data, firstly we develop the conceptual retrieval environment by using a grid computing structure to improve the effectiveness of the search system based on the vector space model. Next, we cluster the results by the x-means dynamical clustering model. The document space vector model and the extended significance vector model ( ESVM ) are used to improve the readability of the system search results. In this research, we evaluate our models for different retrieved types, number of grouping, centralization and decentralization based on time and clustering similarity. According to our experiments, we found that the size of cluster has less impact on the time of text-processing but a great impact on the time of clustering. In other words, the duration of time is significantly reduced when the number of clusters decreases. However, in this case, the clustering similarity between each document in the same group is slightly reduced. Different size of retrieved type has a small effect on time of clustering and similarity. Distributed grouping can greatly enhance processing efficiency.

並列關鍵字

Dynamic Clustering ； Grid ； Distributed Computing ； EVSM

參考文獻

卜小蝶，陳思穎，(2007)。網路自動分群搜尋引擎之使用者評估研究。圖書資訊學研究， 2 ( 1 )，55-80。

王志立，陳鴻文， (2004)。旅遊語意網整體服務系統之建置。大葉大學。

王志浩，姚修慎， (2003)。知識發掘之技術於智慧型資訊檢索系統之研究。元智大學。

徐福聲，皮世明， (2006)。個人化網路搜尋分類之研究—以中文旅遊網站為例。中原大學。

Callan, J. P. (2000). Distributed information retrieval. Proceeding of Advances in Information Retrieval, Kluwer Academic

國際替代計量

結合動態分群與字詞類型權重觀念的分散式新聞查詢模型

全文下載

主題瀏覽