傳統的單機作業受限於機器設備的運算處理能力不足,面對龐大的資料量時,僅能夠以關鍵字檢索或處理事先分類的資料檢索。為建立能夠處理大量資料的概念式檢索環境,本文發展格網架構 ( Grid Computing ) 下,儲存文件及概念式檢索的環境,提高傳回結果的效益。同時,對於傳回結果,本研究以字詞類型權重 ( Extended Significance Vector Model ; ESVM ) 使用X-means 進行動態分群 ( Dynamical Clustering ),提高系統傳回的可讀性。原有資料集中時,大量文件及發生字詞所形成矩陣過大,而使多數電腦記憶體難以負荷,藉由分散環境將資料來源切割處理,文字矩陣維度得以大幅降低,而使概念性檢索得以在分散環境下實現。本研究的實驗使用不同類型、不同分群數及集中與分散模式差異比較,實驗記錄由檢索開始到分群完成的時間,分析文字預處理及分群的時間,由實驗結果得知分群大小對於文字預處理時間影響少,對於分群時間有相當大的影響,分群數少時間顯著減少,但是,群內文件相似度也稍稍降低。類型大小差異對時間及分群內相似度影響不大。分散分群處理可以有效降低的時間,群內文件相似度確也更加降低。
The capacity of data storage needed is increasing rapidly due to the increasing availability of information in electronic forms in the information era. A traditional data retrieval system uses a word-form based comparison approach to get the search results. Although this approach is able to handle huge amount of information, it still suffers from the semantic problem. On the other hand, the conceptual retrieval system can get an improvement by using vector space model ( VSM ) but this system is restricted to the curse of dimensionality. In order to handle large amount of data, firstly we develop the conceptual retrieval environment by using a grid computing structure to improve the effectiveness of the search system based on the vector space model. Next, we cluster the results by the x-means dynamical clustering model. The document space vector model and the extended significance vector model ( ESVM ) are used to improve the readability of the system search results. In this research, we evaluate our models for different retrieved types, number of grouping, centralization and decentralization based on time and clustering similarity. According to our experiments, we found that the size of cluster has less impact on the time of text-processing but a great impact on the time of clustering. In other words, the duration of time is significantly reduced when the number of clusters decreases. However, in this case, the clustering similarity between each document in the same group is slightly reduced. Different size of retrieved type has a small effect on time of clustering and similarity. Distributed grouping can greatly enhance processing efficiency.