  • 學位論文


Information Technology for Historical Document Analysis

指導教授 : 項潔


本論文旨在探討:在後數位典藏時代、前所未有的大量歷史資料被數位化的背景底下,資訊科技該如何介入歷史研究過程,幫助歷史學者有效運用大規模的史料,進行歷史研究。 本論文首先介紹「台灣歷史數位圖書館」(Taiwan History Digital Library, THDL)收錄的兩批重要臺灣史料的內容,以及我們在這兩批資料上發展的檢索系統與觀察工具。本研究使用的兩批臺灣史料:『明清臺灣行政檔案』與『古契約文書』,已累計有73,287件,全文超過五千四百萬字,我們在論文中詳細介紹了其資料內容、來源出處、以及對臺灣史研究的重要性。接著我們介紹THDL系統因應這兩批史料所發展的檢索工具,以及「將檢索結果文件集視為有意義的整體」(regard query returns as a sub-collection)之觀念,並描述我們如何透過「檢索後分類」與「詞頻分析」等工具,為史家分析檢索結果,以引導史家發掘史料之間可能隱含的關連。 本論文緊接著提出兩種方法:「文件集特徵分析」與「史料關係建構」,來進一步拓展史家運用史料的手段。「文件集特徵分析」是將大量史料視為「觀察特徵的環境」,以史料作為「特徵」(史家想觀察的人物、地點、議題等)出現的證據(稱為support),針對史家目前關心的文件子集(sub-collection),藉由分析特徵在sub-collection中出現的數量(稱為特徵量),引導史家觀察跟此sub-collection有密切關聯的特徵,以及關聯的情況。我們將此方法寫成一個數學模型,並且也實際運用到『明清臺灣行政檔案』與『古契約文書』兩批史料上,從中得到了人力不易看出的有趣觀察。 而「史料關係建構」則是指在大量數位化史料集結的環境下,以資訊技術發掘隱含史料關係。本論文舉出三種史料關係作為實例:明清檔案引用關係、契書關係、與內容相似關係,說明其建構方法與成果,其中針對明清檔案引用關係的建構我們有詳細的方法論述。透過我們的方法,我們在37,836件『明清臺灣行政檔案』中發現了6,802對引用關係,在35,451件『古契約文書』中發現3,910組契書關係,在兩文獻集中各發現了3,973與3,570個內容相似群組。論文中我們也說明,史料關係建構不僅能構築起史料間的脈絡,也能為史家帶來新的發現,我們舉出根據上述史料關係而形成的1,101章引用關係圖、2,219張土地轉移圖、以及對範本群組的內容分析應用,來加強此一論述。


This thesis proposes two IT methods to help historians utilize digitized historical documents. The availability of large quantity of historical documents that can be searched and retrieved has become a challenge for historians since the traditional way of carefully going through a small number of documents is no longer sufficient. In this thesis we first give an overview of THDL, the Taiwan History Digital Library, a full-text digital library of primary historical documents about Taiwan. The documents in THDL, currently numbered 73,287 documents and over 54,000,000 words, are the major experiment materials in this thesis. We then introduce the feature analysis method, which puts a collection of historical documents in an observation environment to be studied collectively as opposed to treating them as individual documents. Feature analysis takes a sub-collection, meaning a set of documents related to a research topic that the user is currently interested in, as its input and analyzes the features shared by these documents. By calculating the amount of support for each feature (the amount of documents which are evidences of the occurrence of a feature), this method discovers features that are highly related to a sub-collection. We have developed a mathematical model for this method. We have also applied it to two of the corpuses in THDL and found unexpected and interesting observations. We then present several relation discovery methods that try to find relationships among historical documents in a large collection of documents. We gave three examples of relation discovery carried out on the Imperial Court documents and Taiwanese land deeds. They are citation relations, land transaction relations, and the template relation. Through our methods, we have discovered 6,802 citation relations among the 37,836 Imperial Court documents selected from 280 sources, 3,910 transaction relations among the 35,451 land deeds from 117 sources, and 105 templates that were created following a specific format. We argued that the relationship discovery not only can help historians to consider more angles while reading the documents, but also can lead to new findings. The citation relations found have been transformed into 1,101 successive citation graphs, each of which reveals how a historical event evolved through the correspondence between a Qing emperor and his officials. The transaction relations are also transformed into 2,219 land transitivity graphs, some of which indicates land development activities that have never been studied before.


[66] 陳秋坤, 清代台灣土著地權:官僚、漢佃與岸裏社人的土地變遷1700-1895. 臺北市: 中央硏究院近代史硏究所, 1994.
[77] 李文良, “土地行政與契約文書—臺灣總督府檔案抄存契約文書解題”, 臺灣史研究, 第11卷第2期 , pp. 221-240, 2004.
[78] 黃于鳴, “臺灣古地契關係自動重建之研究,” 碩士論文, 資訊工程研究所, 國立臺灣大學, 臺北市, 2009.
[79] 盧家慶, “台灣古契書自動分類與依分類定義契書角色,” 碩士論文, 資訊工程研究所, 國立臺灣大學, 臺北市, 2008.
[80] 林韋翰, “辨識中文字相似特性產生的同地異名-以台灣歷史數位圖書館古契書為例,” 碩士論文, 資訊工程研究所, 國立臺灣大學, 臺北市, 2010.


