  • 學位論文


New Word Extraction from News and Social Media

指導教授 : 項潔


新聞及社群媒體為網路上重要的訊息來源,其中新詞又別具意義,因為新詞反映出當下發生的事件及立場。故本研究提出一新詞擷取的系統,系統分為新詞擷取模組與分類預測模組。 在新詞擷取中,首先藉由既有的詞庫對文獻集斷詞,此時不在詞庫中的新詞會被斷為小的詞素碎片,之後計算n-gram獲得候選新詞。接著利用候選新詞的詞頻、Pointwise Mutual Information(PMI)、分歧亂度3個統計特徵進行篩選。最後利用人工規則移除含有數字、介詞、停用詞的候選新詞,得到擷取的新詞。 在分類預測模組中,以Support Vector Machine(SVM)預測新詞出現的文件分類後,取平均後作為新詞的分類。 實驗結果顯示,本系統在社群媒體的文獻集表現較佳,F1-score達到70.4%,此時準確率為62.7%,召回率為80.2%。經實驗觀察也發現藉由新詞能更全面地分析輿情,並掌握過去難以觀察的事件及立場。


News and social media are the main sources of information on the Internet nowadays and new words have been created on these sources every day. Although new words represent up-to-date and meaningful information, most of the word extraction tools cannot extract new words. Therefore, in this study, we propose a new word extraction system. We first segment the corpus with an existing dictionary and new words will be segmented into small morphemes. We then obtain new words by computing statistical features such as term frequency, Pointwise Mutual Information (PMI), and Branching Entropy. We also predict the domains of new words with Support Vector Machine (SVM). Our result shows that our system has a better performance on the corpus from social media, and it achieves 70.4% of F1-score with 62.7% of accuracy and 80.2% of recall. With new words, we find that extensive opinion analysis and understanding of corpus can be better achieved.


謝育平, ‘同位詞夾子: 主題式分類詞庫萃取演算法’, 數位人文研究的新視野: 基礎與想像, 2010
Chen, K.-J., and Ma, W.-Y., ‘Unknown word extraction for Chinese documents’: ‘Book Unknown word extraction for Chinese documents’ (Association for Computational Linguistics, 2002), pp. 1-7
Ma, W.-Y., and Chen, K.-J., ‘A bottom-up merging algorithm for Chinese unknown word extraction’: ‘Book A bottom-up merging algorithm for Chinese unknown word extraction’ (Association for Computational Linguistics, 2003), pp. 31-38
Sun, J., ‘‘Jieba’Chinese word segmentation tool’
Xue, N., and Shen, L., ‘Chinese word segmentation as LMR tagging’: ‘Book Chinese word segmentation as LMR tagging’ (Association for Computational Linguistics, 2003, edn.), pp. 176-179
