基於語彙鏈、格律斷詞方法以主題模型進行古詩詞探勘與分析

鑒於傳統白話文的斷詞技術對於古詩往往有扞格不入的缺憾，本研究分別以基於語句鏈提出的CSCP與基於詩詞格律提出的CCPF斷詞法，擷取詩詞關鍵語彙。實驗素材取自中國詩詞全盛時期的唐宋詩詞，共計204633首詩，建構潛藏狄利克雷分配（LDA）的特徵詞詞袋，再依朝代分別執行CSCP-LDA及CCPF-LDA，產出四種唐、宋朝主題模型。所有主題採用Gibbs Sampling進行估計和推斷，參數的選擇採用原始的最佳預設α和β的值，並以Perplexity的最低值訂出LDA主題數量110與迭代數600。研究發現唐宋詩主題詞以一字詞及二字詞居多，CSCP斷詞取決於語句鏈分佈率，斷出的字詞屬於鏈結頻率較高者，因此詞數較CCPF少。實驗也發現即使唐詩數量遠低於宋詩，然而唐詩不重複的主題字詞數量比宋詩還多，表示唐詩的用詞中較多元、活潑、多樣化；宋詩則趨向保守、謹慎，推測或許是因為宋朝各派思想主流，如佛、道、儒各家的思想，已逐漸融合，成為一統局面，因此用字較趨一致。實驗結果顯示CSCP所斷出的主題字詞正確率雖不如CCPF，但是UMass Topic Coherence以及專家的評量，都顯示CSCP-LDA主題凝聚程度優於CCPF-LDA，也與原詩文極度相關，說明利用分佈率斷詞的CSCP-LDA有較高的機會凸顯詩詞主題。

關鍵字

主題模型；主題凝聚；古詩分類；詩詞格律；中文語句鏈；潛藏狄利克雷分配

並列摘要

Purpose-To investigate the feasibility of applying Latent Dirichlet Allocation (LDA) to a large number of Chinese ancient poems. This study explores word usages, the connotation of poems, the topical association between poems, and to observe the changes in words between different dynasties. Design/methodology/approach - Since term segmentation techniques of vernacular are often inadequate for classical Chinese poetry, this study proposes two methods - Chinese Syntactic Chain Processing (CSCP) and the Chinese Classic Poetic Formula (CCPF), to process poetry segmentation. The experimental material was collected from ＂The Complete Tang Poetry＂ and ＂The Complete Song Poems＂, totaling 204,633 pieces, constructing the word bag of the LDA, and then implementing CSCPLDA and CCPF-LDA, producing four kinds of Tang, Song Dynasty topic model. All topics were estimated and inferred using Gibbs Sampling, and the parameters were chosen using the preset values of α = 0.5, β = 0.1. The perplexity value is calculated and determined 110 as the LDA topic number, 600 as the iteration number. Findings-The research result observes that even though the number of Tang poetry is much less than that of Song poetry, the number of unique words identified is more than that of Song poetry, indicating that Tang poetry is more pluralistic, lively and diversified; Song poetry tends to be conservative and cautious. The experimental results show that the correct rate of segmented word by CSCP is not as good as CCPF, but the evaluation of UMass Topic Coherence and experts indicates that the generated poetic theme of CSCP-LDA is better than that of CCPF-LDA. Research limitations/implications - Although the correct rate of word segmentation of CCPF is effective, it cannot be applied to non-regulated verse poems, and the CCPF-LDA classification effect is not as good as CSCP-LDA. Future research is recommended to explore ancient poetry classification by using other approach, such as deep neural network approach. Practical implications -Although literati distinguish the poets and poetry in different styles, the rules of the distinction are not obvious and generally recognized; therefore, it is difficult to generate the rules for the classification of poetry from critics' comments or from poetic writing alone. To our best knowledge, the CSCP is the first of its kind to analyze ancient poetry not relying on the rules of classical Chinese regulated verse. This study is also the only one applying LDA to analyze the meaning of verses. With the promising results of topic modeling of this study suggests that the traditional vernacular word segmentation method and the removal of single character are not suitable for the word processing of ancient poetry. Originality/value - We proposed a new poetry segmentation method. The fundamental idea of building CSCP is a bottom-up concatenating process based on the intensity and significance degree of distribution rate to extract meaningful descriptors from a string by processing the direct link and the inverted link in parallel. The process will be iterated until no concatenation can be found.