透過您的圖書館登入
IP:34.237.245.80
  • 學位論文

主題模型集群與鏈結導出集群之一致性檢定

The Congruity between Document Clusters Derived From LDA Topic Modeling and Traditional Link-Based Expectation Maximization Method

指導教授 : 陳宗天
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


根據學者研究,共被引鏈結應用期望值最大化演算法(Expectation Maximization, EM)之分群方式有較佳的結果,但鏈結導出集群的結果需要研究者自行為集群命名,故有正確率與效率之疑慮。而隱含狄利克雷分佈模型(Latent Dirichlet Allocation, LDA)即主題模型,分群方式為分析每篇文獻之內文,並利用斷字斷詞、字詞原形化等步驟進行分群,故主題模型的分群結果可利用與每個主題最高度相關的字詞作為各主題的名稱,較鏈結導出集群的方式可省去人工命名的步驟。本研究認為,若可驗證鏈結導出集群的結果與主題模型的分群結果有高一致性,則可利用主題模型發掘與主題相關的字詞作為主題名稱,提昇研究者挑選文獻的效率。本研究從Microsoft Academic Search之學術資料庫蒐集兩個不同主題的文獻集,分別以鏈結導出模型(此使用共被引矩陣)與主題模型(此使用LDA分佈機率矩陣)進行分群,產出四種分群結果,利用檢定集群一致性的指標(Rand Index及Kappa and Gwet’s)了解四種分群結果的集群一致性,另也採用檢定內文一致性的指標(Jensen Shannon Divergence, JSD)驗證四種分群結果的內文一致性,並作為分群效果的客觀判斷標準。   本研究結果發現,採取主題模型的分群結果與鏈結導出集群有顯著的集群一致性,且以主題模型分群的結果之JSD平均值較低,即內文一致性較佳,因此初步認為主題模型的分群方式較鏈結導出集群的結果佳。

並列摘要


Latent Dirichlet Allocation (LDA) is widely used to elicit the latent topics from documents. A latent topic derived by LDA comes with relevant associated keywords and topics probability distribution. A document could be characterized by several of its associated latent topics. Specifically, we could use an array of probability of latent topics as the feature vector of a document. We empirically explore the feasibility and applicability of characterizing a document by its LDA latent topics probabilities in this study. Two document corpora retrieved from Microsoft Academic Search are used in the experiment. Each corpus is partitioned into ten clusters by applying the Expectation Maximization (EM) on the co-citation matrix as well as on the arrays of the probability of latent topics. The congruity of clusters derived from these two methods is evaluated through their Rand index, Kappa, and Gwet’s value. The modest congruity between clusters generated by these two methods indicates that the LDA latent topic probability vector is a viable alternative feature to characterize a document. As such, we can utilize LDA to elicit latent topics from a document corpus as well as use the associated topics probability vector to characterize documents and petition them into clusters of similar topics.

參考文獻


鄭宇傑(2016)。以核運算方法與LDA主題模型產生文字標籤之比較研究。論文發表於國立臺北大學資訊管理學系。
黃馨儀(2016)。智識建構方法論之改進研究。論文發表於國立臺北大學資訊管理學系。
Boyack, K. W., & Klavans, R. (2010). Co‐citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389-2404.
Gwet, K. (2002). Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Statistical methods for inter-rater reliability assessment, 1(6), 1-6.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.

延伸閱讀