Title

主題模型集群與鏈結導出集群之一致性檢定

Translated Titles

The Congruity between Document Clusters Derived From LDA Topic Modeling and Traditional Link-Based Expectation Maximization Method

Authors

林盈萱

Key Words

隱含狄利克雷 ; 期望值最大化 ; 一致性檢定 ; Latent Dirichlet Allocation ; Expectation Maximization ; Congruity

PublicationName

臺北大學資訊管理研究所學位論文

Volume or Term/Year and Month of Publication

2016年

Academic Degree Category

碩士

Advisor

陳宗天

Content Language

繁體中文

Chinese Abstract

根據學者研究,共被引鏈結應用期望值最大化演算法(Expectation Maximization, EM)之分群方式有較佳的結果,但鏈結導出集群的結果需要研究者自行為集群命名,故有正確率與效率之疑慮。而隱含狄利克雷分佈模型(Latent Dirichlet Allocation, LDA)即主題模型,分群方式為分析每篇文獻之內文,並利用斷字斷詞、字詞原形化等步驟進行分群,故主題模型的分群結果可利用與每個主題最高度相關的字詞作為各主題的名稱,較鏈結導出集群的方式可省去人工命名的步驟。本研究認為,若可驗證鏈結導出集群的結果與主題模型的分群結果有高一致性,則可利用主題模型發掘與主題相關的字詞作為主題名稱,提昇研究者挑選文獻的效率。本研究從Microsoft Academic Search之學術資料庫蒐集兩個不同主題的文獻集,分別以鏈結導出模型(此使用共被引矩陣)與主題模型(此使用LDA分佈機率矩陣)進行分群,產出四種分群結果,利用檢定集群一致性的指標(Rand Index及Kappa and Gwet’s)了解四種分群結果的集群一致性,另也採用檢定內文一致性的指標(Jensen Shannon Divergence, JSD)驗證四種分群結果的內文一致性,並作為分群效果的客觀判斷標準。   本研究結果發現,採取主題模型的分群結果與鏈結導出集群有顯著的集群一致性,且以主題模型分群的結果之JSD平均值較低,即內文一致性較佳,因此初步認為主題模型的分群方式較鏈結導出集群的結果佳。

English Abstract

Latent Dirichlet Allocation (LDA) is widely used to elicit the latent topics from documents. A latent topic derived by LDA comes with relevant associated keywords and topics probability distribution. A document could be characterized by several of its associated latent topics. Specifically, we could use an array of probability of latent topics as the feature vector of a document. We empirically explore the feasibility and applicability of characterizing a document by its LDA latent topics probabilities in this study. Two document corpora retrieved from Microsoft Academic Search are used in the experiment. Each corpus is partitioned into ten clusters by applying the Expectation Maximization (EM) on the co-citation matrix as well as on the arrays of the probability of latent topics. The congruity of clusters derived from these two methods is evaluated through their Rand index, Kappa, and Gwet’s value. The modest congruity between clusters generated by these two methods indicates that the LDA latent topic probability vector is a viable alternative feature to characterize a document. As such, we can utilize LDA to elicit latent topics from a document corpus as well as use the associated topics probability vector to characterize documents and petition them into clusters of similar topics.

Topic Category 商學院 > 資訊管理研究所
社會科學 > 管理學
Reference
  1. 黃馨儀(2016)。智識建構方法論之改進研究。論文發表於國立臺北大學資訊管理學系。
    連結:
  2. 鄭宇傑(2016)。以核運算方法與LDA主題模型產生文字標籤之比較研究。論文發表於國立臺北大學資訊管理學系。
    連結:
  3. Boyack, K. W., & Klavans, R. (2010). Co‐citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389-2404.
    連結:
  4. Gwet, K. (2002). Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Statistical methods for inter-rater reliability assessment, 1(6), 1-6.
    連結:
  5. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.
    連結:
  6. Hull, D. A. (1996). Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1), 70-84.
    連結:
  7. McLachlan, G., & Krishnan, T. (2007). The EM algorithm and extensions (Vol. 382): John Wiley & Sons.
    連結:
  8. Van Sickle, J. (1997). Using mean similarity dendrograms to evaluate classifications. Journal of Agricultural, Biological, and Environmental Statistics, 2(4), 370-388.
    連結:
  9. Wang, Y. (2008). Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details, Tech. Rep.
    連結:
  10. Wongpakaran, N., Wongpakaran, T., Wedding, D., & Gwet, K. L. (2013). A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC medical research methodology, 13(1), 1.
    連結:
  11. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022.
  12. Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistic. Computational linguistics, 22(2), 249-254.
  13. Chuang, J., Manning, C. D., & Heer, J. (2012, May). Termite: Visualization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces (pp. 74-77). ACM.
  14. Darling, W. M. (2011, December). A theoretical and practical implementation tutorial on topic modeling and gibbs sampling. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 642-647).
  15. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 1-38.
  16. Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M., . . . Zamir, O.(1998). Text mining at the term level. In J. Żytkow & M. Quafafou (Eds.), Principles of Data Mining and Knowledge Discovery (Vol. 1510, pp. 65-73): Springer Berlin Heidelberg.
  17. Jaccard, P. (1901). Distribution de la Flore Alpine: dans le Bassin des dranses et dans quelques régions voisines: Rouge.