斷句或標點對文意理解很重要,現今的標點符號概念是近代由西方國家輸入的,中文古籍通常不具標點符號,使得理解古文相當困難。THDL-古契書仍有大量文本尚未斷句,此資料集是來自THDL台灣歷史數位圖書館的古契書文獻集,古契書文獻集蒐集古臺灣的土地檔案共40428件,其中有10492件內文未斷句或只有部分斷句,文本量龐大所以需要資訊技術協助斷句。在嘗試一些斷句工具後,發現結果不如預期,尤其是在含有特殊字彙、特殊格式的文件,或含有日文假名的文件,斷句工具無法正確斷句,勢必得自己訓練可靠的斷句模型替THDL-古契書斷句。本研究的實驗圍繞在微調SikuBERT預訓練模型做斷句或標點任務。除了用THDL-古契書中已斷句的文件訓練斷句模型外,為了說明其必要性,我們也使用含經史子集的ctext文本訓練斷句模型,與THDL-古契書訓練的斷句模型在文本通篇、特殊字彙、日文假名的斷句結果比較。結果顯示以THDL-古契書訓練的模型比ctext文本訓練的模型顯著的優異,表示對同屬中文古文的不同文體,設計不同的模型仍有其意義。除了THDL-古契書外,其他的古文文本也有斷句或標點的需求。因此我們用ctext文本微調SikuBERT預訓練模型做標點任務,再將其與先前訓練的古契書斷句模型、ctext斷句模型做成古文斷句標點工具,供使用者批次斷句或標點。
Sentence segmentation and punctuation play crucial roles in understanding the meaning of texts. However, Chinese classic texts typically lack punctuation marks, which makes understanding these texts quite challenging. THDL(Taiwan History Digital Library) database consists of three collections. One of them is The collection of Taiwanese Land Deeds(古契書), which gathers a total of 40,428 old land deeds in Taiwan, while 10,492 of them hadn’t been punctuated. Due to the massive amounts of texts, assistance from information technology is needed for us to segment the documents. After trying some sentence segmentation tools, we found that the results were not as expected, especially for documents containing special vocabulary, special styles, or some Japanese kana characters. Therefore, we must train reliable segmentation models ourself. In this research, we focuses on fine-tuning the pre-trained model SikuBERT for sentence segmentation or punctuation tasks. Besides training a segmentation model with pre-segmented Taiwanese Land Deeds, to demonstrate its necessity, we also utilized Chinese classic texts on ctext for training another sentence segmentation model, and compare the segmentation results of these two models. The evaluations show that the model trained on Taiwanese Land Deeds significantly outperforms the model trained on ctext texts. This implies that training distinct models for different styles of Chinese classical texts still holds significance. There is a need for sentence segmentation and punctuation in other Chinese classic texts as well. Therefore, we develop a tool for sentence segmentation and punctuation, containing segmentation model trained on Taiwanese Land Deeds, both segmentation and punctuation models trained on ctext texts. This tool provides users with the ability to perform batch sentence segmentation and punctuation in various Chinese classical texts.