根據學者研究,共被引鏈結應用期望值最大化演算法(Expectation Maximization, EM)之分群方式有較佳的結果,但鏈結導出集群的結果需要研究者自行為集群命名,故有正確率與效率之疑慮。而隱含狄利克雷分佈模型(Latent Dirichlet Allocation, LDA)即主題模型,分群方式為分析每篇文獻之內文,並利用斷字斷詞、字詞原形化等步驟進行分群,故主題模型的分群結果可利用與每個主題最高度相關的字詞作為各主題的名稱,較鏈結導出集群的方式可省去人工命名的步驟。本研究認為,若可驗證鏈結導出集群的結果與主題模型的分群結果有高一致性,則可利用主題模型發掘與主題相關的字詞作為主題名稱,提昇研究者挑選文獻的效率。本研究從Microsoft Academic Search之學術資料庫蒐集兩個不同主題的文獻集,分別以鏈結導出模型(此使用共被引矩陣)與主題模型(此使用LDA分佈機率矩陣)進行分群,產出四種分群結果,利用檢定集群一致性的指標(Rand Index及Kappa and Gwet’s)了解四種分群結果的集群一致性,另也採用檢定內文一致性的指標(Jensen Shannon Divergence, JSD)驗證四種分群結果的內文一致性,並作為分群效果的客觀判斷標準。 本研究結果發現,採取主題模型的分群結果與鏈結導出集群有顯著的集群一致性,且以主題模型分群的結果之JSD平均值較低,即內文一致性較佳,因此初步認為主題模型的分群方式較鏈結導出集群的結果佳。
Latent Dirichlet Allocation (LDA) is widely used to elicit the latent topics from documents. A latent topic derived by LDA comes with relevant associated keywords and topics probability distribution. A document could be characterized by several of its associated latent topics. Specifically, we could use an array of probability of latent topics as the feature vector of a document. We empirically explore the feasibility and applicability of characterizing a document by its LDA latent topics probabilities in this study. Two document corpora retrieved from Microsoft Academic Search are used in the experiment. Each corpus is partitioned into ten clusters by applying the Expectation Maximization (EM) on the co-citation matrix as well as on the arrays of the probability of latent topics. The congruity of clusters derived from these two methods is evaluated through their Rand index, Kappa, and Gwet’s value. The modest congruity between clusters generated by these two methods indicates that the LDA latent topic probability vector is a viable alternative feature to characterize a document. As such, we can utilize LDA to elicit latent topics from a document corpus as well as use the associated topics probability vector to characterize documents and petition them into clusters of similar topics.