Ambiguity Resolution for Author Names of Bibliographic Data

Users have been confronted with serious problems in ambiguities of author names, while a great deal of scholar information quickly accumulated in Internet. Therefore researches on ambiguity resolution for author name are indispensable. With comparison to previous work, this study attempts to address the problem using information contained in bibliographic data only. Five features, co-author (C), article title (T), journal title (J), year (Y), and number of pages (P), are used in this study. Note that feature Y and feature P are not ever used before. Both supervised learning methods (Naïve Bayes and Support Vector Machine) and unsupervised learning method (K-means) are employed to explore 28 different feature combinations. The findings show that the performance of feature journal title (J) and co-author (C) is very effective. Feature J plays an important role in three different methods, and feature C is effective in SVM. In addition, feature Y and feature P obviously enhance accuracy and the average improvement rate of feature Y is more significant than that of feature P (+2.5% in average). It is also shown that the performance of feature combination CTJ is not superior to JYP, and the performance of feature combinations CJY, JY and J are also very effective in the three methods. Finally, it is found that the accuracy of disambiguation on larger datasets is 10% inferior to that of the smaller ones, which indicated the limitation of using bibliographic data only. Consequently, the effective approach to disambiguating author name has to not only fully use bibliographic data but also introduce appropriate outer resources.

關鍵字

Author disambiguation ； Bibliographic data ； Machine learning

並列摘要

目前網際網路已經快速地累積大量的學術資訊，使用者經常會面臨到著者歧異性的問題，使得對同名著者群的解析成為一項重要的研究課題。相較於前人研究，本研究充分應用文獻書目資料僅有的資訊，而不使用書目資訊以外的資訊。本研究探討「共同著者姓名（C）」、「文獻題名（T）」、「期刊題名（J）」、「出版年（Y）」、「頁數（P）」等五項特徵資訊，其中「出版年」與「頁數」從未有其他研究使用過。本研究使用監督式學習方法（Naïve Bayes與SVM）與非監督式分類方法（K-means），探討28項不同的特徵資訊組合。研究發現「期刊題名（J）」與「共同作者（C）」是特別有效的特徵資訊；J在三種方法皆有很好的表現，C則是在SVM方法有很好的效用。「出版年（Y）」與「頁數（P）」在與其他特徵資訊的組合明顯地提升歧義性解析的正確率，兩者以「出版年（Y）」的輔助效果較為突出（平均提升2.5%）。在前人研究中經常被使用的特徵資訊組合「CTJ」並不一定能取得最佳的正確率，而JYP、JY、CJ等特徵組合亦能達到最佳的正確率。最後比較資料集的規模與複雜度的實驗結果發現，規模較大複雜度較高的資料集的準確率低了10%，顯示當測試的資料集日益龐雜時，完全倚靠書目資料難以提供令人滿意的辨識效果。顯現在未來研究中，若要有效地解決人名歧異性之問題，除了充分使用書目資料的各項特徵，仍須使用適當的外部資訊。

並列關鍵字

著者歧義性；書目資料；機器學習

參考文獻

Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1-36.

Google Scholar

Can, F., & Patton, J. M. (2004). Change of writing style with time. Computers and the Humanities, 38(1), 61-82.

Google Scholar

Chang, C. C., & Lin, C. J. (2010). LIBSVM— A library for support vector machines (Version 3.0). Retrieved May 18, 2011, from http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Google Scholar

Churches, T., Christen, P., Lim, K., & Zhu, J. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2(9). Retrieved May 18, 2011, from http://www.biomedcentral.com/1472-6947/2/9.

Google Scholar

CiteSeerX. (2011). About CiteSeerX . Retrieved May 18, 2011, from http://citeseer.ist.psu.edu/about/site.

Google Scholar

主題瀏覽