基於詞彙模式關聯的主題式中文文本集群

資訊搜尋與新聞、雜誌閱讀是目前人們上網時最常進行的活動，但網路所回應的資訊若未依主題聚集，使用者將必須花費更多的時間去識別出各搜尋結果的主題為何。聚集主題的問題可以集群方法解決，而向量空間模型是一種集群文件前的表示方法，該模型應用於文件集群的缺點在於其基本假設是特徵項之間是獨立無關的，但自然語言中詞彙之間並非獨立無關，有些詞彙常常一起出現，該模型僅以詞彙索引項匹配進行文本之間的比對，可能會遭致詞彙匹配錯誤的問題，因為詞彙關係裡的多義現象（Polysemy），會造成檢索結果資訊過多；同義現象（Synonymy），會造成檢索結果資訊過少的問題。為解決上述問題，研究藉由詞彙擴展的方法，將具相關性的特徵組成同一語意概念之後，進而引導出相對應的文件，期待這種以語意概念形成索引的機制，能減少詞彙共現、一詞多義及一義多詞的問題。實驗將同一句中的兩個或是三個詞彙序列形成一種詞彙模式，取代關鍵詞作為文本特徵。依據模式頻率、模式頻率與反向文件頻率、條件機率、交互訊息，以及關聯基準等衡量各詞彙模式在文件裡分佈的強度，再以階層式集群方法進行詞彙模式的集群。之後每個群體被視為一個語意概念，再以概念間共同出現的文件為基礎，將數個語意概念合併成同一主題，此時同一主題所對應的文本將被視為與主題相關。實驗結果顯示，我們所提出的集群方法基於五種特徵強度都優於傳統的VSM集群方法，在Average Recall方面，成效最好的是模式頻率，98.84%。Average Precision方面，成效最佳者為關聯基準，95.26%。至於Average F-measure方面，關聯基準依然最佳，96.7%。

關鍵字

文本集群；主題；模式關聯；模式；詞彙

並列摘要

Information searching and news & magazines reading are the most common activities when people are surfing Internet nowadays. However, if the information that is responded by internet does not assemble according to the topics, then the user needs to spend more time on distinguishing the topic of every search result. The problems of assembling the topic can be solved by clustering techniques, and vector space model is a kind of document representation method before clustering. The weakness of this model applied to document clustering is its assumption, which states that there is no relation between features; however, the words in natural language are not independent and irrelevant, some of the words always appear at the same time. Therefore, this model using only word index to proceed with the matching to the text may result in the problem of word matching errors, since the polysemy in the word relation will cause too much information retrieved; but synonymy will cause too little information on retrieved. In order to solve the above problem, this study affiliates with the method of word expansion to compose relevant features into the same semantic concept, and then lead the corresponding documents out; we expect this mechanism, the use of semantic concept to form an index, can reduce the problems of collocation, polysemy, and synonymy. In the experiment, the sequence of two or three words in the same sentence is used to form a word pattern; and then this word pattern is used to replace the keyword and becomes the feature of the text. The distributive strength of key patterns is measured by Pattern Frequency, Pattern Frequency-Inverse Document Frequency, Conditional Probability, Mutual Information, and Association Norm. According to the strength the hierarchical clustering technique is applied to cluster these key patterns. After that, every cluster is going to be considered as one semantic concept. Then, based on the common documents between concepts, several semantic concepts are merged and become the same topic. At this time, the corresponding text in the same topic will be considered as topic-related. The experimental results show that our proposed text clustering based on five strength of features are all better than the traditional VSM clustering. In Average Recall, Pattern Frequency has the best outcome, 98.84%. In Average Precision, Association Norm has the best outcome, 95.26%. In Average F-measure, Association Norm is still the best, 96.7%.

並列關鍵字

text clustering ； topic ； pattern relation ； pattern ； word

參考文獻

[21] 林頌堅, "基於術語抽取與術語叢集技術的主題抽取 " 中文計算語言學, vol. 9, pp. 97-112, 2004.

[22] 許長謨, "從近三年報刊標題看語詞的豐富多變--兼論詞彙學的重要," 成大中文學報, vol. 11, pp. 167-200, 2003.

[2] R. Attar, A. S. Fraenkel, "Local Feedback in Full-Text Retrieval Systems," Journal of the ACM, vol. 24, pp. 397--417, 1977.

[4] L.-F. Chien, "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval," in Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval Philadelphia, Pennsylvania, United States 1997, pp. 50-58.

[6] K. W. Church, P. Hanks, "Word Association Norms, Mutual Information, and Lexicography," Computational Linguistics, vol. 16, pp. 22-29, 1990.

國際替代計量

基於詞彙模式關聯的主題式中文文本集群

未授權

主題瀏覽