  • 學位論文


Automatic Classification of the Taiwan-Related Archives in the Qing Dynasty by Classifying Documents into Historical Events

指導教授 : 項潔


「台灣歷史數位圖書館」(Taiwan History Digital Library,以下簡稱THDL) 是個為了服務台灣史研究者所建的全文資料庫。 將資料庫中的文件以橫軸為年代、縱軸為文件數量繒製出「年代文件分布圖」,發現圖中的趨勢線高頻處與歷史大事件的發生時間有密切的對應,因而使人想探究每條趨勢線的高頻處各是發生那些歷史事件。為達到此目的,必須先將資料庫中的清代台灣行政檔案之文件自動分類到歷史事件。 本研究蒐集「台灣小事典」與「臺灣歷史辭典記載的事件」,在初步分類整理後,選出四十一筆歷史事件。設計的自動分類方法是先用人工搜尋出能代表每個事件的「初始關鍵字」,接著設定某個「association rule之confidence參數值」為門檻,對從數個「人名權威資料庫」蒐集出來的「候選特徵關鍵詞」做篩選。再將檢索年代限定為該事件發生的年代,並對該事件的「初始關鍵字」和「特徵關鍵詞」作聯集來對THDL做查詢,最後將回傳文件判定為與該事件相關。 系統共分類了11826篇文件,占清代台灣行政檔案的32%。另外68%的文件為六部相關奏摺、官員任免奏摺、地方政府回報米糧價格、關稅報告等庶務性奏摺文件。 本論文分別挑選與「戴潮春事件」、「牡丹社事件」以及「清日甲午戰爭」三個事件發生年代相同的文件,用人工方式逐篇閱讀並判斷該文件是否與該事件相關。目的是作為ground truth和「使用自動分類方法得到的文件」做比較,以計算出recall和precision來評估本研究使用的自動分類方法之成效。 當t→q 為0.2時,牡丹社事件、清日甲午戰爭和戴萬生事件的recall分別為0.7241、0.9941、0.8928;Precision分別為0.6117、0.6175、0.6735。由於歷史學家在檢索文件時,偏好先得到所有的文件再逐篇閱讀分析 (查全導向),因此recall平均值超過80% 以及precision平均值超過60%的分類結果還算可以接受。


Taiwan History Digital Library (THDL) is a full-text database built for Taiwan history researchers. By plotting the numbers of the documents of THDL annually (the horizontal axis is A.D. year; the vertical axis is numbers of the documents), it was discovered that critical historical events always happened in the peaks of the graph. To explore what historical events happened in each peak of the graph, a method should be developed to classify the documents into the historical events. After organizing, classifying and removing unnecessary Taiwan historical events from two dictionaries, forty-one Taiwan-related historical events in the Qing dynasty (form A.D.1684 to A.D. 1895) were chosen to be the experiment materials To classify the documents into the events, the “initial keywords” were manually selected first. Secondly, the parameter of the association rule, confidence (t→q), was employed to evaluate whether the “feature keyword” should be selected or not. If one document contains the “initial keywords” of the event or “selected feature keywords whose t→q is over the threshold” and that document was written in the years that the event happened, this document would be considered belonging to the historical event, and be classified into it. 11826 documents (32% of the archive) were classified into the historical events. The rest 68% documents of the archive are routine administrative documents, for example, the employments and discharges of the government officers, the price reports of the crops, the reports of tariff, etc. In order to evaluate the performance of the automatic classification method, the documents written in the year near the outbreak of the following three events: 1. Tai Chao-chuen incident, 2. Taiwan Expedition of 1874 (a.k.a. Mudan incident) and 3.First Sino-Japanese War were selected. Then each document was read manually, and was determined one by one if it belongs to the historical event as ground truth. In this way, the results of the automatic classification could be compared with these determined documents (ground truth) to calculate recall and precision. When the parameter, t→q, equals to 0.2, the recall of the “Tai Chao-chuen incident”, “Taiwan Expedition of 1874” and “First Sino-Japanese War” is 0.7241, 0.9941 and 0.8928 respectively, and the precision is 0.6117, 0.6175 and 0.6735 respectively. As historians prefer to retrieve all the related documents first, and then read these documents one by one (recall-oriented), the automatic classification method with the average of the recalls over 80% and the average of the precision over 60% is acceptable.


[12] 張鈞韜。「官職表的模型與實作」。臺北:國立臺灣大學資訊工程學研究所,碩士論文,民國96年。
[17] 黃于鳴。「臺灣古地契關係自動重建之研究」。臺北:國立臺灣大學資訊工程研究所,碩士論文,民國98年。
[18] 盧家慶。「台灣古契書自動分類與依分類定義契書角色」。臺北:國立臺灣大學資訊工程研究所,碩士論文,民國97年。
[3] R. Feldman and J. Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge, NY: Cambridge Univ. Press, 2007.
[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann, 2006.


