From Text to Data: Extracting Posting Data from Chinese Local Gazetteers




彭維謙(Wai-Him Pang);程卉(Hui Cheng);陳詩沛(Shih-Pei Che)

地方志 ; 中國史 ; 數位人文 ; 資訊擷取 ; 正規表達式 ; local gazetteers ; Chinese history ; digital humanities ; data extraction ; regular expressions



1期(2018 / 04 / 01)

79 - 125

This paper introduces a semi-automatic text tagging interface to help historians efficiently garner posting records from the Chinese Local Gazetteers (difangzhi 地方志) with the format of "who, when, which posting." By turning texts into tabular data forms, this interface aims to lay the basis for analyzing Chinese local gazetteers on a large scale. Although local gazetteers from various locations all follow a general pattern when recording posting data, they still differ in details due to their substantial amount. Therefore, it is unfeasible to ask programmers to extract the posting data using a onesize- fits-all computer program. This tagging interface, on the other hand, provides a simple user interface with built-in patterns to extract the subjects' names, posting titles, dynasties, posting times, basic addresses and entry methods. This allows users to tag most of data in the text quickly, which can then be proofread by them themselves in order to check the tagging result and to correct mistakes. The interface also enables users to adjust the extraction patterns according to each text in order to accurately extract posting data from local gazetteers with distinct patterns.

