以集成學習方法改善LDA主題建模穩定性之研究

主題建模可以應用在自動分析大量文本資料的主題結構。雖然已經發展出許多新的技術，但是主題建模的穩定性仍然是值得關注的問題之一。本研究針對Latent Dirichlet Allocation（LDA）主題建模演算法的穩定性，提出一個集成學習式的改善方法。這個方法的第一階段將從多個基本模型中選取穩定出現的有效主題，第二階段則根據這些有效主題的詞語出現資訊進行引導式主題建模產生穩定性較高的改良模型。本研究並以兩個不同的語料庫進行測試，結果證明所建議之改善方法能夠提高穩定性。並且為了節省整體的建模時間，本研究建議在建立每一個基本模型時只選用語料庫中的部分文本資料，實驗結果表示少量比例的文本資料即可獲得不錯的穩定性分數。

關鍵字

集成學習；潛在狄利克里分配（LDA）；穩定性；主題建模

並列摘要

Topic modeling enables the rapid discovery of latent thematic structures within large amounts of unstructured textual data. Although many new techniques have been developed, the stability of topic modeling remains one of the noteworthy concerns. This study focuses on Latent Dirichlet Allocation (LDA) topic modeling and proposes a two-stage ensemble learning approach to improve the stability of topic modeling. The first stage of this method involves selecting stable and meaningful topics from multiple basic models, while the second stage utilizes the word occurrence information of these selected topics to guide the creation of improved models with higher stability. This research was tested on two different corpora, and the results demonstrate that the proposed improvement method can consistently improve the stability of models. Furthermore, in order to save overall modeling time, this study suggests sampling only a subset of the textual data from the corpora when building each basic model, as experimental results indicate that a small proportion of textual data can also yield satisfactory stability scores.

並列關鍵字

Ensemble Learning ； Latent Dirichlet Allocation (LDA) ； Stability ； Topic Modeling

參考文獻

邱志洲,吳忠敏,簡德年,高淩菁,邱德生 J. T., J. T.(2023).加護病房患者臨床結果預測—機器學習與主題模型法之應用.醫務管理期刊.24(3),221-248.

林頌堅 S.-C., S.-C.(2022).主題相似性估計與其在主題建模穩定性測量之應用.教育資料與圖書館學.59(2),201-231.

郝沛毅,歐仁彬,黃天受,林振穎,吳建生 J.-S., J.-S.(2018).透過新聞文章預測股價漲跌趨勢—結合情緒分析、主題模型與模糊支持向量機.資訊管理學報.25(4),363-395.

陳怡璇,劉桂君 A. K. C., A. K. C.(2023).以瘟疫之名：探討公眾風險感知的COVID-19 網路論述與詮釋,以批踢踢八卦板與政黑板為例.新聞學研究.156,47-104.

傅文成,黃琝戩,顏瑞宏 J.-H., J.-H.(2021).以資料科學方法輔助民意趨勢分析：戰略及戰爭風險感知的網路民意研究.新聞學研究.149,1-49.

延伸閱讀

全文下載

主題瀏覽