詞夾子演算法在專有名詞辨識上的應用
                 	-以歷史文件為例

中文詞集是一個開放集合，現階段不存在任何一個詞典或方法可以盡列所有的中文詞。當處理不同領域的文件時，領域相關的特殊詞彙或專有名詞，常常造成辨識錯誤的情況。現今的作法大致上分為三種，一種是以人工撰寫規則的rule-based方法，一種是以建置詞庫為主的corpus-based方法，最後一種是利用學習方式machine-learning的方法。大部分的作法都是以詞庫為主，但是詞庫要建置完備並不容易。本論文的目的是提供一個不建立詞庫的方法，來做專有名詞辨識。本論文提出詞夾子演算法來解決專有名詞辨識的處理，詞夾子是使用“前文”、“詞首”、“詞尾”、“後文”的組合。主要概念是利用文章寫作上的一些特定習性與字辭之間的耦合關係，來找出專有名詞。先給予樣本詞，然後找出和樣本詞相關的詞夾子，並利用這些詞夾子找出與樣本詞類似的候選詞出來，之後以迭代方式不斷的產生詞夾子和候選詞。我們以歷史文件(在明清檔按有33025個檔案古契書有21575個檔案)為實驗資料。明清擋案在人名辨識上，得到在77.1%的召回率下得到56.1%的精確度，而在地名辨識上，得到在87.9%的召回率下得到87.0%的精確度。古契書在人名辨識上，得到在72.9%的召回率下得到45.6%的精確度，而在地名辨識上，得到在80.3%的召回率下得到77.6%的精確度。

關鍵字

詞夾子；候選詞；專有名詞辨識

並列摘要

The Chinese characters may in principle be composed into a countless number of phrases, which no existing methods, including dictionaries, can completely enumerate. This leads to the problem of erroneous detections or misses when attempting to identify proper nouns (PN) in a document. In this thesis, we have proposed a method based on a notion of word-clip to identify proper nouns from documents in a specific domain. Methods for PN recognition can be classified into the following three categories: rule-based methods, corpus-based methods, and machine-learning methods. The corpus-based methods are the most widely used approach. However, they usually require the establishment of a large dictionary. This is where the bulk of work lies. The word-clip method has no need of establishing a dictionary, which makes our algorithm more efficient. The main concept of the word-clip method is to use some existing relationships between PNs and the whole phrase. For example, the abbreviation "Mr." is usually followed by the name of a person (with a few exceptions such as "Mr. President"). A typical word-clip is thus formed by combining a "leading phrase", a "PN prefix", a "PN postfix", and an "ending phrase." Our algorithm uses a set of initial sample PNs plus a set of training documents to generate word-clips. These word-clips are then used to identify new PNs for the next training cycle. This process is iterated to generate candidate PNs. We have tested our method on two large sets of historical documents. One is a set of 33,025 court documents from the Ming and Qing Dynasties, and the other is a set of 21,575 old land deeds. For the former we have generated 74,825 names of persons with a precision rate of 56.1% and recall rate of 77.1% ,and we have generated 6,306 names of location with a precision rate of 87.0% and recall rate of 87.9%. For the latter we have generated 28,358 names of persons with a precision rate of 45.6% and recall rate of 72.9%, and we have generated 4,132 names of location with a precision rate of 77.6% and recall rate of 80.3%.

並列關鍵字

word-clip ； candidate ； named entity recognition

參考文獻

[2002朱怡霖]

[99 Borthwick, A.]

2. Borthwick, A. A maximum entropy approach to named entity recognition. Ph.D. Thesis, New York University,1999.

4. Fang Xiaoshan , Sheng Huanye ,“A Hybrid Approach for Chinese Named Entity Recognition.” Discovery Science 2002: 297-301

[2002 Isozaki]

被引用紀錄

宋浩（2015）。自動化資料豐富程序〔博士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2015.10213

卓文福（2014）。旅遊網頁觀光目的地意象之內容分析工具研究〔博士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2014.01561

彭維謙（2013）。不同脈絡中的歷史文本之自動分析　以《資治通鑑》、《冊府元龜》及《正史》為例〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2013.02636

高欣愷（2013）。歷史文件自動地名標註-以《清實錄》為例〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2013.00182

劉士綱（2012）。《清實錄》人名擷取自動化〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2012.01315

國際替代計量

詞夾子演算法在專有名詞辨識上的應用 -以歷史文件為例

全文下載

主題瀏覽