透過您的圖書館登入
IP:18.226.251.68
  • 期刊
  • OpenAccess

詞嵌入應用於佛學研究—兼論詞嵌入模型評估

Word Embedding in Buddhist Studies: On the Basis of Evaluation of Word Embedding Models

摘要


詞嵌入是利用語料庫自動產生語義向量的方法,本論文的目標為探索詞嵌入在Comprehensive Buddhist Electronic Text Archive(CBETA)漢文佛典中的可能應用面向。為取得適用於佛學研究的詞嵌入最佳模型,本文利用莊春江辭典、丁福保辭典和Digital Dictionary of Buddhism辭典建立實驗資料集,並設計偵測同義詞及干擾詞等兩種評估實驗來取得模型優化的基線。結果發現Word2Vec CBOW(continuous bag-of-words)、Dimension 400、Window 10、Epoch 10為最佳超參數組合,驗證正確率為0.87,測試正確率為0.86。據此,我們將CBETA語料分類訓練出不同詞嵌入模型,再跑出依據年代、譯者及部類的不同範圍語料對比詞表,並進行實際應用分析。本論文的主要貢獻有三:一、建置適用於漢文佛典研究之詞嵌入同義詞資料集;二、找出適於漢文佛典文本之詞嵌入超參數;三、探討與分析詞嵌入於漢文佛典研究之實例,包括可用於判斷譯詞的語義核心演變、能用於界定不明確的語義、能透過語義類比找出相關概念、能找出各部類的核心概念、能藉以拓展研究廣度和深度,以及可用於驗證傳統研究結果等面向。

並列摘要


Word embedding is a method to automatically generate semantic vectors using corpora. This paper aims to explore the possible applications of word embedding in the Chinese Buddhist database (Comprehensive Buddhist Electronic Text Archive, CBETA). In order to obtain the best model of word embedding for Buddhist studies, we compile an experiment dataset using Chunjiang Zhuang's dictionary, Fubao Ding's dictionary, and Digital Dictionary of Buddhism dictionary; and designs two evaluation experiments for detecting synonyms and outlier words to obtain a baseline for model optimization. It is found that Word2vec CBOW, Dimension 400, Window 10, Epoch 10 is the best set of parameters. The validation score is 0.87 and the test score is 0.86. Accordingly, we categorize the CBETA corpus to train different models; and then run comparative word lists for different chronologies, translators, and schools of Buddhism; then further demonstrated the applications in real cases. The main contribution of this paper is threefold: 1. to build a synonym collection for word embedding used in the study of Chinese Buddhism; 2. to identify the hyper-parameters of word embedding for the study of Chinese Buddhism; 3. to explore and demonstrate the results of word embedding in the Chinese Buddhist studies, including the ability to determine the semantic core evolution of translated words, to define new words, to identify related concepts through semantic analogy, to identify the core concepts of each school, and to expand the scope of researches. In addition, it can be used to verify the results of traditional research.

參考文獻


曾元顯、許瑋倫、吳玟萱、古怡巧、陳學志(2020)。基於檢索方法的中文幽默對話系統之建置應用與評估。圖書資訊學刊,18(2),73-101。doi:10.6182/jlis.202012_18(2).073
謝吉隆、楊苾淳(2018)。從「應變自然」到「社會應變」:以文字探勘方法檢視國內風災新聞的報導演變。教育資料與圖書館學,55,285-318。doi:10.6120/JoEMLS.201811_55(3).0022.RS.BM
Hu, C., & Zhao, B. (2021). Movie recommendation system based on deep learning. International Core Journal of Engineering, 7(9), 289-296. doi:10.6919/ICJE.202109_7(9).0043
Bjerva, J., & Praet, R. (2015). Word embeddings pointing the way for late antiquity. In K. Zervanou, M. van Erp, & B. Alex (Eds.), Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) (pp. 53-57). Beijing, China: Association for Computational Linguistics. doi:10.18653/v1/W15-3708
Burns, P. J., Brofos, J. A., Li, K., Chaudhuri, P., & Dexter, J. P. (2021). Profiling of intertextuality in Latin literature using word embeddings. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, ... Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4900-4907). Stroudsburg, PA: Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.389

延伸閱讀