主題相似性估計與其在主題建模穩定性測量之應用

主題建模的穩定性測量針對相同文本集合以及在相同起始條件下，同一建模方法產生的模型能夠具有相似主題的程度。由於估計主題之間相似性的方法是主題建模穩定性測量的基礎，並且「主題對齊」是這項測量的關鍵步驟。本研究首先根據經由主題對齊之後獲得配對主題相同的比例，比較不同相似性估計方法，並觀察各種方法的相似性分數分布。最後，也分析主題數目對於穩定性測量的影響。本研究使用的主題建模方法是常用的潛在狄利克里分配（LDA）主題建模，並從 PTT BBS Book板上約30,000篇發文產生分析的模型。研究結果觀察到這些相似性估計方法配對主題相同的比例很高，但在配對主題上的相似性分數則有不同的分布，同時也發現隨著主題數目增加，主題建模的穩定性下降。

關鍵字

主題建模；潛在狄利克里分配（LDA）；穩定性測量；主題相似性估計；主題對齊

並列摘要

Topic modeling stability is a measurement of the extent to which models produced by the same modeling approach for the same corpus and with the same initial conditions have similar topics. Since the method used for calculating similarity between topics is considered the basis for measuring topic modeling stability and topic alignment is a key step in the measurement, the present study first calculated the proportion of identical paired topics among the optimal combinations of paired topics generated using different topic similarity calculation methods, and then observed the distribution of similarity scores of paired topics for each method. Finally, this study performed an analysis of the effects of the number of topics on topic modeling stability. The topic modeling method used in this study is commonly used LDA topic modeling, and the corpus used to establish topic models including about 30,000 posts was collected from the PTT Bulletin Board System (BBS) Book message board. The results indicated that there is a high proportion of identical paired topics among the different methods of measuring similarity, although the similarity scores of paired topics for each method had different distributions due to the different kinds and amounts of information of word distribution in each topic they used. The results also revealed that with the increase of the number of topics, the stability noticeably decreased.

並列關鍵字

Topic modeling ； latent Dirichlet allocation (LDA) ； Stability measurement ； Topic similarity estimation ； Topic alignment

參考文獻

Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74-88. https://doi.org/10.1016/j.infsof.2018.02.005

Ballester, O., & Penner, O. (2022). Robustness, replicability and scalability in topic modelling. Journal of Informetrics, 16(1), 101224. https://doi.org/10.1016/j.joi.2021.101224

Belford, M., Mac Namee, B., & Greene, D. (2018). Stability of topic modeling via matrix factorization. Expert Systems with Applications: An International Journal, 91, 159-169. https://doi.org/10.1016/j.eswa.2017.08.047

Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In W. W. Cohen & A. Moore (Eds.), Proceedings of the 23rd international conference on machine learning (pp. 113-120). ACM. https://doi.org/10.1145/1143844.1143859

Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17-35. https://doi.org/10.1214/07-AOAS114

主題瀏覽