主題建模可以應用在自動分析大量文本資料的主題結構。雖然已經發展出許多新的技術,但是主題建模的穩定性仍然是值得關注的問題之一。本研究針對Latent Dirichlet Allocation(LDA)主題建模演算法的穩定性,提出一個集成學習式的改善方法。這個方法的第一階段將從多個基本模型中選取穩定出現的有效主題,第二階段則根據這些有效主題的詞語出現資訊進行引導式主題建模產生穩定性較高的改良模型。本研究並以兩個不同的語料庫進行測試,結果證明所建議之改善方法能夠提高穩定性。並且為了節省整體的建模時間,本研究建議在建立每一個基本模型時只選用語料庫中的部分文本資料,實驗結果表示少量比例的文本資料即可獲得不錯的穩定性分數。
Topic modeling enables the rapid discovery of latent thematic structures within large amounts of unstructured textual data. Although many new techniques have been developed, the stability of topic modeling remains one of the noteworthy concerns. This study focuses on Latent Dirichlet Allocation (LDA) topic modeling and proposes a two-stage ensemble learning approach to improve the stability of topic modeling. The first stage of this method involves selecting stable and meaningful topics from multiple basic models, while the second stage utilizes the word occurrence information of these selected topics to guide the creation of improved models with higher stability. This research was tested on two different corpora, and the results demonstrate that the proposed improvement method can consistently improve the stability of models. Furthermore, in order to save overall modeling time, this study suggests sampling only a subset of the textual data from the corpora when building each basic model, as experimental results indicate that a small proportion of textual data can also yield satisfactory stability scores.