透過您的圖書館登入
IP:3.145.60.29
  • 學位論文

《清實錄》人名擷取自動化

Automated Annotation of Person Name of the Veritable Records of the Qing Dynasty

指導教授 : 項潔

摘要


在所有的史料內容中,「人」一直都是極具代表性且富有研究價值的象徵物。因此在史料的電子檔中,人的標記尤其重要。《清實錄》從太祖起,至德宗止,共十二部,共四千四百八十四卷,我們無法藉由人工翻書的方式將人名一一標示出來。此外,清朝的人名和現代的中國人名相對比較沒有規則,現代中國的人名可以用百家姓就擷取出幾乎全部的人名,而在《清實錄》中除了漢人的人名以外,還記載了滿洲人名,外國傳教士的人名,和以數字為組合的人名,這些人名都無法用單一的規則來解決,處理上會困難許多。本論文的研究主旨就是如何利用程式自動化的方式將《清實錄》中的人名在metadata中標記出來。 本論文中會介紹如何利用PMI(Pointwise mutual information)公式,將《清實錄》中的內文正確地斷出詞條,在搭配規則找出候選人名。在這階段將人名正確斷開以後,下一階段就要考量如何在這些大量的二字詞(bigrams)中找出可能為人名者,因此必須要做人名驗證(the validation of the names)。再者會介紹整個自動化演算法的流程,一開始先利用斷詞提升召回率(recall),再利用人名驗證提高候選人名的精確度(precision)。方法確定以後在針對《清實錄》中各個朝代作人名辨識,得到附錄中《清實錄》的候選人名結果。

關鍵字

清實錄 自動化 PMI 人名驗證 召回率 精確度 人名辨識

並列摘要


Among all the historical material, "Person" is always the highly representative symbol that has rich research value. Therefore, it is important to tag the name of a person correctly in the electronic file of historical data. “Qing Dynasty” , starting from Taejo to DeZong, has a total of four thousand four hundred and eight four chapters in twelve volumes, therefore, it is not possible to manually mark each person’s name in Qing Dynasty. Besides, there’s no relatively mapping rule between the names of the Qing Dynasty and the modern Chinese names. The modern Chinese name can be found from the “hundred of surnames in China”, however, the names of the Qing Dynasty are formed not only from the Chinese, but also from the people of Manchukuo, the foreign missionaries, and sometimes from the combination of numbers only. It’s not possible to tag these name correctly with single rule. Therefore, the main purpose of this thesis is to tag the names of people from Qing Dynasty correctly by using the programming automatically in metadata. This thesis will introduce how to use the formula of the PMI (Pointwise mutual information) so as to correctly segment the phrase in the context of Qing Dynasty and to identify the names of people correctly with rules. After the stage of segmenting the phrase of names correctly, the next stage is to consider how to sort out the potential names of people from the big pool of bigrams. To do so, we need to validate the names of people. Furthermore, this thesis will also introduce the entire process of the algorithm in automation, using the segmentation of phrase to improve the recall rate at first, then using the validation of the names to enhance the degree of accuracy. With such method, we can easily identify the names of people in each dynasty. The derived result of the candidate names of people from Qing Dynasty is in Appendix.

參考文獻


[2] 陳品諺,“《清實錄》之文本分析與時間標記初探”, 碩士論文, 資訊工程研究所, 國立臺灣大學, 臺北市, 2011, pp. 5-12.
Chinese Surname-Names.” In Proceedings of Natural Language Processing Pacific
[4] Wang, L. J., et al. (1992).“Recognizing Unregistered Names for Mandarin
[7] 張尚斌,“詞夾子演算法在專有名詞辨識上的應用─以歷史文件為例”, 碩士論文, 資訊工程研究所, 國立臺灣大學, 臺北市, 2006.
[10] 清朝官職表. Available:

被引用紀錄


彭維謙(2013)。不同脈絡中的歷史文本之自動分析 以《資治通鑑》、《冊府元龜》及《正史》為例〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2013.02636
高欣愷(2013)。歷史文件自動地名標註-以《清實錄》為例〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2013.00182

延伸閱讀