本研究探討如何利用潛在語意分析 (Latent Semantic Analysis) 來找出 Google 搜尋結果的相關詞,並且依照語意分析產生的概念維度來將相關詞做分群,之後再利用每一筆網頁所包含的相關詞來將搜尋結果進行分群。我們分別抓取 Google 搜尋結果前20筆到50筆的網頁來做潛在語意分析,找出與查詢關鍵詞相關的相關詞,再透過一個概念維度篩選機制將相關詞進行分群,並且計算代表性較高的相關詞當作群集的標題,以方便使用者了解各群集的內容。本研究提出的方法會依據概念維度門檻為每一個查詢關鍵詞的搜尋結果判斷出適合的分群群數,而不用事先決定。搜尋結果中的網頁依據其所包含的相關詞在各相關詞群集之間的概念維度分數,分配到一個主要的文件群集或是分配到多個文件群集。最後,本研究使用 Silhouette Coefficient 來評估我們提出的相關詞分群以及文件分群方法的效能,並且與其他分群系統作比較。
This study proposed a Latent Semantic Analysis based method to find semantically related terms from Google search results for a given query and to group the terms into clusters. Each item of the search results is then grouped into one individual cluster based on the terms it contains. Top 20 to 50 search results for each query are crawled for LSA analysis. A heuristic method is proposed to conduct clustering of semantically related terms based on their concept dimension significance after LSA analysis. For each cluster, the terms that have high representative values are chosen as the title words. The proposed method determines the best fit number of clusters for each query, without the burden of defining the number of clusters in advance. Web pages containing multiple terms is assigned to a primary cluster or allocated them into multiple clusters based on either coverage or concept dimension significance. Finally, the clustering quality is evaluated using silhouette coefficient on experiment results using a set of mixed popular and industrial keywords. The clustering quality of the proposed method is also compared with carrot2, a popular clustering engine.