應用倉頡編碼特徵於中文人名性別預測之研究

日常生活中，對於素昧平生的人們，第一印象往往來自他的名字，我們常試著從名字中推敲他的性別、與其他人的關係（如是否與認識的人是兄弟）甚至樣貌。一般來說，性別是最顯而易見也最無爭議的。我們甚至可以推論，中文人名中本身就蘊含著性別資訊，而這些資訊往往能提供我們重要的人際線索。　　本研究以倉頡碼對中文人名進行編碼，並配合性別資料藉由支援向量機學習中文字的性別特徵，進而達到以中文人名預測性別。在本研究中，我們比較了K-最鄰近法與支援向量機的結果，並且對倉頡編碼採用不同的組合模式，企圖找出預測中文人名性別最精確的方法。　　由於中文人名中存在著兩性皆可使用的名稱，所以性別預測難以達到100%的準確率。在本實驗中發現以支援向量機搭配倉頡四連詞（4-grams）的準確率最高，達到最高可能預測結果的93.59%。另外我們透過問卷比較人類判斷性別與系統判斷性別的差異，在統計檢定下為不顯著，代表系統處理中文人名的性別判斷與人類判斷無異。此外我們以模型對其他不同的資料集作測試，如臉書的好友名稱、英文譯名等，一樣展現出超過85%的準確率。在本實驗的最後，我們將模型套用在台灣商家與台灣個股的名稱中，檢視不同類型的商店或類股是否會有不同的性別比例，從實驗結果中也發現的確存在這樣的差異。　　本研究從中文人名的性別預測延伸到商家名稱等非人名的中文字，而發現以倉頡碼拆解中文字的確可以達到以字型表示文字某些特性，進而增加中文自然語言處理的可能性。除了利用本實驗的結果建立自動化大量人名性別判定的系統外，也可以在文件探勘時使用性別屬性而提供文章不同的特徵，可能可以提升文件分類、分群或觀點分析的準確率。另外最重要的是，本實驗代表著可以以倉頡碼描述中文文字性別傾向，因而開啟後續研究以倉頡碼描述中文其他屬性的大門。

關鍵字

文件探勘；中文人名；性別預測；支援向量機；中文字子結構；倉頡編碼

並列摘要

In daily life, when we meet people we don't know, our first impressions usually come from their names: we often try to guess their gender, relationship with others (e.g. whether he is a brother of someone we know), or even appearance. Generally speaking, the gender characteristic in the name is the most obvious. We can even infer that a Chinese name contains gender information, and such information usually provides us with important clues concerning interpersonal relationships. This paper uses CangJie code to represent Chinese names, and uses SVM (support vector machine) to learn the gender characteristics. In this paper, we compared the results of K-NN and adopted different combination modes to the CangJie coding in the SVM to find out the best method to predict of gender of a person through their Chinese name. Because some Chinese names can be used in both genders, it is difficult to achieve the 100% accuracy when predicting the genders. We found that the highest accuracy of gender prediction is about 93.59% (by SVM with Cangjie 4-grams). On the other hand, we compare the gender prediction accuracy by humans and the systems through a questionnaire, and found that there is no significant statistical difference, which means there is no difference in the prediction of the gender of Chinese names between humans and our system. In addition, we applied the model to different data sets, such as Facebook friends’ names, English names (translated in Chinese), and the accuracy also exceeds 85%. Finally, we applied the model to local shop names and stock names in Taiwan, finding the shop type or sector whether can have the different gender proportion, from the experimental result also found there indeed has such difference. We found that the prediction of the gender of Chinese name can be extended to the name of shops and the non-name Chinese characters, and found that the Cangjie code could possibly express the structure of the Chinese character, thus increasing the potential of Chinese natural language processing. The results of the experiment not only institutes the framework for a massive automatic name-sex prediction system, but can also be applied to text mining by provide more features of the articles and increase the accuracy of document classification, clustering, or viewpoint analysis. Moreover, the most importantly, Cangjie code can describe the gender characteristic of a Chinese character, thus opening the gates for future research on using Cangjie code to extract more attributes from Chinese characters.

並列關鍵字

text mining ； Chinese name ； gender prediction ； support vector machine ； Chinese sub-character ； Cangjie coding

參考文獻

1. Bergsma, Shane, Lin, Dekang and Goebel, Randy,“Glen, Glenda or Glendale: Unsupervised and Semi-supervised Learning of English Noun Gender”, CoNLL, 2009.

5. Fryer, Roland G. Jr. and Levitt, Steven D., “The Causes and Consequences of Distinctively Black Names”, Quarterly Journal of Economics Volume119, Issue3, Pp. 767-805, 2004.

6. Gallagher, A.C., Chen, Tsuhan, “Estimating Age, Gender, and Identity using First Name Priors” Computer Vision and Pattern Recognition. CVPR 2008. IEEE Conference, 2008.

9. Joachims, T., “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, Proceedings of the European Conference on Machine Learning Springer, 1998.

10. Kilarski, Marcin, “On grammatical gender as an arbitrary and redundant category”, In Douglas Kilbee, editor, History of Linguistics 2005: Selected papers from the 10th International Conference on the History of Language Sciences (ICHOLS X), pages 24–36. John Benjamins, Amsterdam, 2007.

被引用紀錄

薛仱芸（2014）。改善網路操弄評論分類績效之研究〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-0905201416542666

國際替代計量

應用倉頡編碼特徵於中文人名性別預測之研究

主題瀏覽