異質性資料在文件上的共現問題導致了複雜的結構,如何解釋它們之間的關聯一直以來是很多研究者想解決的問題。尤其現今電腦網際網路(Internet)時代來臨,大部份的人皆被網路便利性、快速性等性質深深吸引著,人們漸漸以網際網路作為尋找資料、分享資料的主要管道,使得文字電子資訊量大增,在文獻、網頁、新聞或企業文件量上皆成指數成長,因此如何有效管理這些大量文件變成一個重要議題。 本論文主要目的是發展一套生醫文獻自動化分群系統,希望能從這些散亂的文獻中自動化將類似領域主題知識聚集在一起。藉此幫助使用者在面對龐大的醫學文獻時能有效、快速瞭解其知識結構內容。在這篇論文中我們以關聯法則實作Clique Percolation Method Simplex概念,最後與Literature Clustering Search在Reuters- 21578與OHSUMED兩個文件分類測試集(Benchmark)上評估其Precision、Recall、Normalized mutual Information、Pairwise Testing之間的差異。
The co-occurrence of items in data always induces a complex structure. Many researchers try to discover them. However, heterogeneity lets the data hard to analysis. Especially associated with the arrival of the Internet era, most of the people become deeply attract to the convenience and effectiveness of Internet, therefore, try to find a way to explain its model. As Internet has gradually become a major access for people to search for information and share it with others, which brings about the large increase in electronic texts—the growth in the number of literature, web pages, news reports, and business documents is exponential. Therefore, how to effectively arrange this large amount of texts has become a crucial issue. This essay aims to develop a set of automatic biomedical literature clustering system and compare them. Hopefully, it will be able to automatically arrange these disorderly texts into an organized knowledge database, in the meantime categorizing them according to different themes and fields. We hope this system will be of help to its users to effectively grasp the structure and content of the knowledge they seek for when they encounter such great deal of medical literature. In this thesis, we apply the association rule to the clique percolation method and the concept of simplex. Then, for the literature clustering search, we will adopt two text categorization and collection benchmarks—Reuters-21578 and OHSUMED, discerning the differences of the precision, recall, normalized mutual information, and pairwise testing of the two.