詞嵌入應用於佛學研究—兼論詞嵌入模型評估

詞嵌入是利用語料庫自動產生語義向量的方法，本論文的目標為探索詞嵌入在Comprehensive Buddhist Electronic Text Archive（CBETA）漢文佛典中的可能應用面向。為取得適用於佛學研究的詞嵌入最佳模型，本文利用莊春江辭典、丁福保辭典和Digital Dictionary of Buddhism辭典建立實驗資料集，並設計偵測同義詞及干擾詞等兩種評估實驗來取得模型優化的基線。結果發現Word2Vec CBOW（continuous bag-of-words）、Dimension 400、Window 10、Epoch 10為最佳超參數組合，驗證正確率為0.87，測試正確率為0.86。據此，我們將CBETA語料分類訓練出不同詞嵌入模型，再跑出依據年代、譯者及部類的不同範圍語料對比詞表，並進行實際應用分析。本論文的主要貢獻有三：一、建置適用於漢文佛典研究之詞嵌入同義詞資料集；二、找出適於漢文佛典文本之詞嵌入超參數；三、探討與分析詞嵌入於漢文佛典研究之實例，包括可用於判斷譯詞的語義核心演變、能用於界定不明確的語義、能透過語義類比找出相關概念、能找出各部類的核心概念、能藉以拓展研究廣度和深度，以及可用於驗證傳統研究結果等面向。

關鍵字

詞嵌入；漢文大藏經；佛學研究；語義關係；語義類比

並列摘要

Word embedding is a method to automatically generate semantic vectors using corpora. This paper aims to explore the possible applications of word embedding in the Chinese Buddhist database (Comprehensive Buddhist Electronic Text Archive, CBETA). In order to obtain the best model of word embedding for Buddhist studies, we compile an experiment dataset using Chunjiang Zhuang's dictionary, Fubao Ding's dictionary, and Digital Dictionary of Buddhism dictionary; and designs two evaluation experiments for detecting synonyms and outlier words to obtain a baseline for model optimization. It is found that Word2vec CBOW, Dimension 400, Window 10, Epoch 10 is the best set of parameters. The validation score is 0.87 and the test score is 0.86. Accordingly, we categorize the CBETA corpus to train different models; and then run comparative word lists for different chronologies, translators, and schools of Buddhism; then further demonstrated the applications in real cases. The main contribution of this paper is threefold: 1. to build a synonym collection for word embedding used in the study of Chinese Buddhism; 2. to identify the hyper-parameters of word embedding for the study of Chinese Buddhism; 3. to explore and demonstrate the results of word embedding in the Chinese Buddhist studies, including the ability to determine the semantic core evolution of translated words, to define new words, to identify related concepts through semantic analogy, to identify the core concepts of each school, and to expand the scope of researches. In addition, it can be used to verify the results of traditional research.

並列關鍵字

word embedding ； Chinese Tripitaka (CBETA) ； Buddhist studies ； word relations ； word analogy

參考文獻

曾元顯、許瑋倫、吳玟萱、古怡巧、陳學志（2020）。基於檢索方法的中文幽默對話系統之建置應用與評估。圖書資訊學刊，18(2)，73-101。doi:10.6182/jlis.202012_18(2).073

謝吉隆、楊苾淳（2018）。從「應變自然」到「社會應變」：以文字探勘方法檢視國內風災新聞的報導演變。教育資料與圖書館學，55，285-318。doi:10.6120/JoEMLS.201811_55(3).0022.RS.BM

Hu, C., & Zhao, B. (2021). Movie recommendation system based on deep learning. International Core Journal of Engineering, 7(9), 289-296. doi:10.6919/ICJE.202109_7(9).0043

Bjerva, J., & Praet, R. (2015). Word embeddings pointing the way for late antiquity. In K. Zervanou, M. van Erp, & B. Alex (Eds.), Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) (pp. 53-57). Beijing, China: Association for Computational Linguistics. doi:10.18653/v1/W15-3708

Burns, P. J., Brofos, J. A., Li, K., Chaudhuri, P., & Dexter, J. P. (2021). Profiling of intertextuality in Latin literature using word embeddings. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, ... Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4900-4907). Stroudsburg, PA: Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.389

被引用紀錄

黃淑齡、王昱鈞、洪振洲（2024）。深度學習方法在中國佛教經典目錄分類中的應用。圖書資訊學刊，22(1)，133-164。https://doi.org/10.6182/jlis.202406_22(1).133

國際替代計量

詞嵌入應用於佛學研究—兼論詞嵌入模型評估

全文下載

主題瀏覽