本論文提出一個以網路資源為本,自動收集中文人名經歷資訊及專業領域。透過個人經歷資訊擷取以及專業領域的分類,可以有效地解決人名歧異(Personal Name Disambiguation)之問題。而專業領域分類更使得個人資訊的提供,能有系統一致化地呈現給使用者。 在訓練過程中,我們利用語言學的知識以及統計學上的技術,從網路上收集經歷資訊之表面樣式(surface patterns),作為從網路上收集人名資訊以及擷取個人資訊之依據。並且應用Yarowsky (1995)的自舉式方法,以網路資源為本來訓練文件分類器。在執行階段,輸入的人名透過表面樣式之輔助收集經歷資訊,經由經歷資訊及領域分類,解析區隔同名同姓人士的資訊。 我們也將描述此一方法的系統實作。實驗結果證明我們的方法能夠有效地取出人名的經歷,並且區格不同領域的同名同姓人士,使得個人資訊之網路搜集更為有效。
We introduce a method for automatically collecting personal information and professional domain of the person. In our approach, personal information is extracted and the domain is identified from web-based data based on personal name disambiguation. In the training phase, the method involves generating surface pattern to personal information extraction based on linguistic and statistical information from the Web, and an unsupervising algorithm for constructing Web-based text categorization. At runtime, submitting a person name into a search engine, extracting personal information and identifying each retrieved passage the domain according to the expected person name, finally the referents are sorted by domain, personal information and the degree of popularity. We also described an implementation of the proposed method. Blind evaluation of a set of names shows that our method outperforms extracting personal information and cleanly classifying individual’s domain-specific knowledge. This method can be applied to help users quickly find about a person with resulting in the display of personal information in a systematic and consistent way.