透過您的圖書館登入
IP:3.17.156.160
  • 期刊
  • OpenAccess

Code switching: exploring perplexity and coherence metrics for optimizing topic models of historical documents

摘要


The Latent Dirichlet Allocation (LDA) model has two important hyperparameters that control the document-topic distribution known as alpha (α), and topic-word distribution known as beta (β). It is important to find the suitable values for both hyperparameters to achieve an accurate topic cluster. Using a single evaluation method to determine the optimal hyperparameters values is insufficient due to the size and complexity of the dataset. Thus, an experiment was conducted to study the relationship between the hyperparameters with perplexity, coherence scores and to establish a baseline for further topic modelling studies. It is the first study that focuses on multiple languages in Sarawak Gazette data for topic modelling. The study was conducted on LDA using Gensim package. The result shows that while perplexity scores were good indicator of the model’s ability to predict new or hidden data, the word cluster within topic does not always reflect the similarity or relationships between words which compromised topic interpretation. The lowest perplexity score was observed when α was set to 5 and β to 0.4. The coherence evaluation indicated the optimal number of topics for each set of hyperparameter values although the relationship with hidden words remains unclear. The coherence score is highest when the number of topics was 5 and 4. In conclusion, the perplexity scores are effective indicators of word prediction accuracy for each hyperparameter setting. While coherence captures the optimal number of topics needed to produce high-coherence word cluster within a topic. Combining both evaluation methods ensures optimal results, producing topics that are both accurate and interpretable.

延伸閱讀