Translated Titles

From Text to Data: Extracting Posting Data from Chinese Local Gazetteers




彭維謙(Wai-Him Pang);程卉(Hui Cheng);陳詩沛(Shih-Pei Che)

Key Words

地方志 ; 中國史 ; 數位人文 ; 資訊擷取 ; 正規表達式 ; local gazetteers ; Chinese history ; digital humanities ; data extraction ; regular expressions



Volume or Term/Year and Month of Publication

1期(2018 / 04 / 01)

Page #

79 - 125

Content Language


Chinese Abstract


English Abstract

This paper introduces a semi-automatic text tagging interface to help historians efficiently garner posting records from the Chinese Local Gazetteers (difangzhi 地方志) with the format of "who, when, which posting." By turning texts into tabular data forms, this interface aims to lay the basis for analyzing Chinese local gazetteers on a large scale. Although local gazetteers from various locations all follow a general pattern when recording posting data, they still differ in details due to their substantial amount. Therefore, it is unfeasible to ask programmers to extract the posting data using a onesize- fits-all computer program. This tagging interface, on the other hand, provides a simple user interface with built-in patterns to extract the subjects' names, posting titles, dynasties, posting times, basic addresses and entry methods. This allows users to tag most of data in the text quickly, which can then be proofread by them themselves in order to check the tagging result and to correct mistakes. The interface also enables users to adjust the extraction patterns according to each text in order to accurately extract posting data from local gazetteers with distinct patterns.

Topic Category 人文學 > 人文學綜合
基礎與應用科學 > 資訊科學
  1. 清.王道亨修,張慶源纂(1788)。德州志(清乾隆五十三年刻本)。北京:中國方志庫。
  2. 清.王贈芳、王鎮修,成瓘、冷烜纂(1840)。濟南府志(清道光二十年刻本)。北京:中國方志庫。
  3. 清.沈世銓修,李勗纂(1899)。惠民縣志(清光緒二十五年柳堂校補刻本)。北京:中國方志庫。
  4. 清.彭君穀修,鍾應元纂(1869)。新會縣續志(清同治九年刻本)。北京:中國方志庫。
  5. 清.盧承業原編,馬振文增修(1915)。偏關志(清道光間刊民國四年鉛印本)。北京:中國方志庫。
  6. Bol, P. K., Ge, J., Henderson, M., Lavely, B., Man, Z., Skinner, G. W., & Tang, X. (2001). China Historical Geographical Information System (CHGIS). Retrieved from http://www.fas.harvard.edu/~chgis
  7. Bol, P.,Chen, S-P.,Yamangil, E.(2012).A RegEx machine.New Directions in Analyzing Text as Data,Cambridge, MA.:
  8. Chen, S.-P., Schäfer, D., & Che, Q. (n.d.). Local Gazetteers Project. Retrieved from https://www.mpiwg-berlin.mpg.de/en/research/projects/departmentSchaefer_SPC_MS_LocalGazetteers
  9. Goyvaerts, J. (2003). Regular-Expressions.info. Retrieved from http://www.regular-expressions.info/
  10. Harvard University, Academia Sinica, & Peking University. (n.d.). China Biographical Database. Retrieved from https://projects.iq.harvard.edu/cbdb/
  11. Ho, H. I.(2015).MARKUS: A fundamental semi-automatic markup platform for classical Chinese.Digital Humanities Conference,Sydney, Australia:
  12. Ho, H. I. (n.d.). MARKUS. Retrieved from http://dh.chinese-empires.eu/markus/
  13. Import.io Inc. (n.d.). import.io. Retrieved from https://www.import.io
  14. Regular Expression. (n.d.). In Wikipedia. Retrieved November 29, 2017 from https://en.wikipedia.org/wiki/Regular_expression
  15. Sync RO Soft SRL. (n.d.). oXygen. Retrieved from https://www.oxygenxml.com
  16. Text Encoding Initiative Consortium. (2017). TEI: P5 Guidelines. Retrieved from http://www.tei-c.org/P5/
  17. World Wide Web Consortium. (2016). Extensible markup language (XML). Retrieved from https://www.w3.org/XML/
  18. Yamangil, E. (n.d.). CBDBRegexMachine. Retrieved from https://projects.iq.harvard.edu/cbdb/cbdbregexmachine
  19. 中央研究院地理資訊科學研究專題中心(n.d.)。中國大陸各省地方志書目查詢系統。取自http://webgis.sinica.edu.tw/place/
  20. 王德毅編、李榮村編、潘柏澄編(1979)。元人傳記資料索引。臺北:新文豐出版公司。
  21. 何浩洋(2014)。MARKUS:古籍文本半自動標記平臺。2014第五屆數位典藏與數位人文國際研討會,臺北,臺灣:
  22. 何浩洋(n.d.)。古籍半自動標記平臺MARKUS。取自http://www.dhtaiwan.org/detail.do?action=interperspective&id=1
  23. 昌彼得編、王德毅編、程元敏編、侯俊德編(1974)。宋人傳記資料索引。臺北:鼎文書局。
  24. 法鼓文理學院(2008)。法鼓文理學院(2008)。佛學規範資料庫(BSADB)。取自http://authority.dila.edu.tw/。doi:10.6741/DILA.DB_BSADB/Text。http://authority.dila.edu.tw/
  25. 國立中央圖書館(1965)。明人傳記資料索引。臺北:國立中央圖書館。
Times Cited
  1. 徐力恆,包弼德,王宏甦(2020)。用於中國歷史研究的網路基礎設施:對相關探索的建議和展望。數位典藏與數位人文,6,1-35。
  2. 林敬智(2020)。道教開光儀式疏文之文本探勘與數位人文探索:以府城延陵道壇為例。圖資與檔案學刊,97,44-75。