本研究提出了一個以潛在語意分析(Latent Semantic Analysis)為基礎的方法來推估Google搜尋引擎的排名。我們對關鍵字查詢結果的網頁進行潛在語意分析,來評估語意相關詞會對排名造成的影響。我們對搜尋結果網頁進行啟發式n-gram斷詞以擷取出n-grams,並建立詞文矩陣(term-document matrix),來找出文章與詞之間隱含的語意關係。我們使用聚合式分群技術建立概念群組並使用泡泡圖(bubble graph)來呈現。我們由文章與查詢虛擬詞文章的文章-文章相關矩陣來評估文章與查詢詞的相關度。實驗結果顯示使用啟發式n-gram斷詞系統來推估排名,效果比僅使用uni-gram更為出色,而且R-Precision平均值可以達到70%。
This study proposed a Latent Semantic Analysis based method to analyze Google’s ranking. We conducted Latent Semantic Analysis on Google’s search results for a given set of queries to evaluate if latent semantic terms contribute in ranking. We implemented heuristic n-gram extraction tool for extracting n-gram terms from search engine results pages. A term-document matrix was constructed for Latent Semantic Analysis to explore the latent relationship between terms and documents. We used agglomerative clustering to build concept groups and demonstrated them with a bubble graph. To obtain correlation between documents, a document-document correlation matrix with respect to query pseudo document was implemented. Experimental results show that using the heuristic n-gram extraction, the method performed better, as compared to unigrams, and achieved average R-Precision up to 70%.