透過您的圖書館登入
IP:3.14.70.203
  • 期刊

人工智慧在中文歷史文獻判讀領域應用初探:以國立故宮博物院典藏為例

A Preliminary Study on the Application of Artificial Intelligence in the Interpretation of Chinese Historical Documents: A Case Study of National Palace Museum Collection

摘要


國立故宮博物院(以下簡稱故宮)擁有近七十萬件文物典藏,其龐大的數量不僅對數位化工作而言是莫大的挑戰,後續的解讀應用對研究者而言亦為艱難的門檻。自2017年起,圖書文獻處數位典藏科承接「圖書文獻高解析重點項目數位化子計畫」,計畫完成近四十萬頁數位檔,加上歷年完成的數位檔,已有近二百四十萬頁。漫長的工作時程與龐大的數位化資產,促使筆者開始思考如何利用新科技優化已完成數位掃描文獻的加值應用。而文獻數位化重要的第一步,在於建立全文檢索。建立數位掃描影像已屬曠日廢時,以人工辨識內容更是耗費資源。為此計畫引入人工智慧科技,在掃描圖檔的同時,進行文字辨識與元資料輔助分類,以加快數位化之進程。更可為後續加值應用預留各種可能性,如將文獻中的地理資訊對接GIS(Geographic Information System)系統,方便以地名檢索所有清檔、奏摺;或是將文獻內涉及人物自動對接清代檔案人名權威檔資料庫,並標定其時任官銜,自動建立同地緣關係或交遊網路,大幅增加從事清史研究者之便利。雖目前人工智慧尚難以直接完美辨識並標點文獻,然而學術界已有部分案例探討,本文亦在此一基礎上稍做抒發,期能拋磚引玉,促進院藏清史文獻數位化的進程。

並列摘要


National Palace Museum (NPM) obtains nearly 700,000 world-class extensive art collections, of which the large quantity is not only a great challenge for digitization, but also a high threshold for researchers on subsequent interpretation and application. Ever since 2017, Department of Rare Books and Historical Documents submitted the "Subordinate Program of Digitalizing Crucial Historical Documents in High Resolutions" to bid for the Executive Yuan's Forward-looking Infrastructure Development Program. Based upon the idea above, the department's main goal was to digitize at least 400,000 pages, which adds up to nearly 2.4 million pages of digital files over the years. The long working hours and large digital assets have prompted us to think about ways to leverage new technologies and optimize the value-added applications of completed digital scans. One of the major milestones in digitizing documents is the creation of full-text searches. Since this is a resource-intensive and time-consuming task to accomplish manually, full-text retrieval is even more unattainable when digital scanning is long overdue. In order to do so, the artificial intelligence technology has been introduced to perform text recognition and metadata auxiliary classification with digital scans to speed up the process of digitization, so that there may be more possibilities for subsequent value-added applications, such as connecting geographic data in the literature to the GIS (Geographic Information System) to facilitate the retrieval of all Qing Dynasty archives by geographical locations; or automatically linking the characters in the literature to names in the Qing Dynasty archive database, automatically establish geopolitical relations or networking to their titles, making it more convenient for the researchers of Qing History. Although it is still difficult to perfectly identify and punctuate literature directly with artificial intelligence, there are a number of case studies in the academic world, and this paper will also provide some insights on this basis, in the hope that it can facilitate the process of digitization of literature.

參考文獻


Buchanan, B. G. (2005). A (very) brief history of artificial intelligence. Ai Magazine, 26(4), 53. doi: 10.1609/aimag.v26i4.1848
Cai, D., Zhao, H., Zhang, Z., Xin, Y., Wu, Y., & Huang, F. (2017). Fast and accurate neural word segmentation for Chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 608-615. doi: 10.18653/v1/P17-2096
Chen, J., Cao, H., & Natarajan, P. (2015). Integrating natural language processing with image document analysis: What we learned from two real-world applications. International Journal on Document Analysis and Recognition (IJDAR), 18(3), 235-247. doi: 10.1007/s10032-015-0247-x
Chen, X., Shi, Z., Qiu, X., & Huang, X. (2017). Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1193-1203. doi: 10.18653/v1/P17-1110
Han, X., Wang, H., Zhang, S., Fu, Q., & Liu, J. S., (2018). Sentence segmentation for classical Chinese based on LSTM with radical embedding. ArXiv e-prints (Oct. 2018). doi: arXiv:cs.CL/1810.03479

延伸閱讀