  • 期刊


Estimation of Topic Similarity and Its Application to Measuring Stability of Topic Modeling


主題建模的穩定性測量針對相同文本集合以及在相同起始條件下,同一建模方法產生的模型能夠具有相似主題的程度。由於估計主題之間相似性的方法是主題建模穩定性測量的基礎,並且「主題對齊」是這項測量的關鍵步驟。本研究首先根據經由主題對齊之後獲得配對主題相同的比例,比較不同相似性估計方法,並觀察各種方法的相似性分數分布。最後,也分析主題數目對於穩定性測量的影響。本研究使用的主題建模方法是常用的潛在狄利克里分配(LDA)主題建模,並從 PTT BBS Book板上約30,000篇發文產生分析的模型。研究結果觀察到這些相似性估計方法配對主題相同的比例很高,但在配對主題上的相似性分數則有不同的分布,同時也發現隨著主題數目增加,主題建模的穩定性下降。


Topic modeling stability is a measurement of the extent to which models produced by the same modeling approach for the same corpus and with the same initial conditions have similar topics. Since the method used for calculating similarity between topics is considered the basis for measuring topic modeling stability and topic alignment is a key step in the measurement, the present study first calculated the proportion of identical paired topics among the optimal combinations of paired topics generated using different topic similarity calculation methods, and then observed the distribution of similarity scores of paired topics for each method. Finally, this study performed an analysis of the effects of the number of topics on topic modeling stability. The topic modeling method used in this study is commonly used LDA topic modeling, and the corpus used to establish topic models including about 30,000 posts was collected from the PTT Bulletin Board System (BBS) Book message board. The results indicated that there is a high proportion of identical paired topics among the different methods of measuring similarity, although the similarity scores of paired topics for each method had different distributions due to the different kinds and amounts of information of word distribution in each topic they used. The results also revealed that with the increase of the number of topics, the stability noticeably decreased.


Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74-88. https://doi.org/10.1016/j.infsof.2018.02.005
Ballester, O., & Penner, O. (2022). Robustness, replicability and scalability in topic modelling. Journal of Informetrics, 16(1), 101224. https://doi.org/10.1016/j.joi.2021.101224
Belford, M., Mac Namee, B., & Greene, D. (2018). Stability of topic modeling via matrix factorization. Expert Systems with Applications: An International Journal, 91, 159-169. https://doi.org/10.1016/j.eswa.2017.08.047
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In W. W. Cohen & A. Moore (Eds.), Proceedings of the 23rd international conference on machine learning (pp. 113-120). ACM. https://doi.org/10.1145/1143844.1143859
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17-35. https://doi.org/10.1214/07-AOAS114
