基於主動式學習之古漢語斷句系統發展與應用研究

本研究旨在開發支援數位人文研究之「基於主動式學習的古漢語文本斷句系統」，結合主動學習與機器學習演算法，透過人機合作模式降低建立自動化古漢語斷句建立模型時所需的訓練語料，並協助人文學者面對未解讀過的文獻能更有效率的進行斷句判讀作業。為了找出最合適建立「基於主動式學習的古漢語文本斷句系統」的的演算法與特徵模板，本研究設計第一個實驗採用了不同的演算法與特徵模板配合依序文本和主動學習兩種選擇文本方法所建立的斷句模型進行比較。實驗結果發現，條件隨機場（conditional random fields）與三字詞特徵模板在主動學習方法中能有效地進行學習，適合發展「主動學習斷句模式」。第二個實驗邀請人文專長領域的學者使用「基於主動式學習的古漢語文本斷句系統」進行古漢語文本的斷句判讀，以人文學者各自標註資料建立的斷句模型進行比較分析，並輔以半結構式訪談深度了解人文學者對於本研究發展之系統輔以斷句的使用感受與建議。實驗結果發現「基於主動式學習的古漢語文本斷句系統」確實能有效學習人文學者的斷句標註資料，並且模型預測能力能基於人機合作而不斷提升。最後，透過訪談結果歸納得知人文學者對於系統操作流程與介面具有正面評價，多數受訪者認為本系統的斷句預測功能在古漢語斷句上能提供有效之輔助功能。未來可考量增加命名實體模型或其他古漢語規則的特徵模板設計，以進一步提升斷句預測能力，也希冀能將發展的系統運用在人文領域教育上，發展為訓練古漢語斷句之數位人文教育平台。

關鍵字

數位人文；主動學習；機器學習；自動化古漢語斷句；人機互動

並列摘要

This study aims to develop a sentence segmentation system of ancient Chinese texts based on active learning. It is expected that through the human-machine cooperation mode, the training corpus needed to establish a model for automated ancient Chinese sentence segmentation could be reduced and humanities researchers may work more efficiently on sentence identification of uninterpreted text. Two experiments were conducted in this study for the system development and evaluation. In the first experiment, the automatic sentence segmentation models established by applying different algorithms and feature templates to sequential text selection and active learning text selection were compared to select the most suitable algorithm and feature template to employ in establishing this system. The results show that conditional random fields combined with three-word feature template adopted in active learning could perform effective learning outcomes that would be appropriate to apply to build the active learning sentence segmentation model for ancient Chinese texts. In the second experiment, six humanities researchers were invited to use the system to conduct sentence segmentation tasks of the assigned ancient Chinese texts to evaluate the performance of the system. Sentence segmentation results produced by individual humanistic researchers using the system were compared and analyzed. Semi-structured interviews were also conducted to gather an in-depth understanding of their experience and suggestions of using the system The experimental results show that the developed ancient Chinese sentence segmentation system based on active learning could effectively learn humanities researchers sentence segmentation data and constantly improve the model prediction through human-machine cooperation. Moreover, according to the interviews, most of the humanities researchers participated in this study reported a positive experience of using the system and indicated that the sentence segmentation prediction function provided in the system could effectively assist their sentence segmentation work. The prediction of the active learning sentence segmentation model could be further improved by embedding the name entity model or applying other phonological features or POS tagging of ancient Chinese in the future study. It is also expected to develop this system into a digital humanities learning platform for ancient Chinese sentence segmentation training in the future.

並列關鍵字

Digital humanities ； Active learning ； Machine learning ； Automatic ancient Chinese sentence segmentation ； Human-computer interaction

參考文獻

Culotta, A., & McCallum, A. (2005). Reducing labeling effort for structured prediction tasks. In AAAI (Vol. 5, pp. 746-751). Fort Belvoir, VA. https://doi.org/10.21236/ADA440382

Hu, Y. (2016). Classical Chinese sentence segmentation as sequence labeling (Doctoral dissertation, Texas Christian University Fort Worth, Texas). Retrieved from https://repository.tcu.edu/handle/116099117/10350

Google Scholar

Huang, J., & Hou, H. (2008). On sentence segmentation and punctuation model for ancient books on agriculture. Journal of Chinese Information Processing, 22(4), 31-38.

Google Scholar

Huang, H. H., Sun, C. T., & Chen, H. H. (2010). Classical Chinese sentence segmentation. CIPS-SIGHAN Joint Conference on Chinese Language Processing. Retrieved from https://www.aclweb.org/anthology/W10-4103

Google Scholar

Huang, S., & Wang, D. (2017). Review and trend of researches on ancient Chinese character information processing. Library and Information Service, 61(12), 43-49.

Google Scholar

國際替代計量

基於主動式學習之古漢語斷句系統發展與應用研究

全文下載

主題瀏覽