應用資料叢集原理執行非結構型文件分群—
以某資訊通報為例

資料叢集分析屬資料探勘理論中的一環，其目的是把大量的相異性資料，透過資料分群技術，將資料區隔成同性質較高的資料叢集，而叢集內的文件之間的相似性是最小，叢集與叢集之間相異性是最大。叢集（Clustering）分析法與分類法(classification)，同樣是用來對資料進行分群的工作，但這兩者的最大差異，在於分類法的分群方式，為事先定義好的群組，分群時再依每個資料適合那一個群組，將資料分配至已經定義好的群組中，來進行分群的工作，而叢集分析法的不同之處，則是無任何事先定義好的群組，群組的特性，需視欲將資料分為幾群或視資料的屬性而定。本論文之目的是在應用資料探勘之叢集化技術對非結構型文件進行文字探勘，在資料預處理後，利用關鍵字集去搜尋文件庫之方式，建立關鍵字與文件之頻率矩陣，再利用歐幾里德距離公式，計算出矩陣內關鍵字間的距離，並利用資料探勘之凝聚式階層叢集演算法，建立文件分群，提供最佳的文件集，供使用者參考，並提供全文檢索與資料分群在資料檢索績效之評估，供業者建置資訊系統時參考。

關鍵字

文字探勘；非結構型文件；凝聚式階層叢集演算法

並列摘要

ABSTRACT The Technology to proceeding the data clustering analysis is one of the theorem of data mining, The purpose is to separate a large amount of diversity data into a cluster with a homogenous data , there are a minimum similarity in same cluster, but with great dissimilarity in different clusters. Both Clustering analytic approach and classification method are the same methods for cluster the data, but its consist of two greatest difference between Clustering and classification , the one is that the clustering method of classification is define a group in advance, the data is suited to which group that in accordance with each data while clustering , and assign the data to the group which already defined and then proceeding the classifications work, and the one is that the approach of cluster analytic is differ from classification, it does not have any defined groups in advance, the characteristics of group is regard to divide the data into several groups or depend on attribute of the data. The purpose of thesis is to utilize the clustering technology of data mining to proceed the text mining of unstructured documents, after data preprocessed, utilize the keywords to search the text database, create the frequency matrix of keyword and documents , and then utilize the Square Euclidean distance equation to calculates the distance between the keywords , and utilize the Agglomerate Hierarchical Clustering Algorithm of data mining to build the clusters of data , and offered the best collections of similar documents to user's reference and also offered a standard of assessment for both performances of data clustering and information retrieval to whom wants to build a information system. Keywords: Text Mining, Unstructured Documents, Agglomerate Hierarchical Clustering Algorithm.

並列關鍵字

Text Mining ； Unstructured Documents ； Agglomerate Hierarchical Clustering Algorithm.

參考文獻

3. 李駿翔，「應用資料探勘分類技術於專利分析之研究」，中原大學資訊管理學系論文，碩士論文，民92年。

13. 鍾明璇，「應用關聯規則技術有效輔助以向量空間模型為基礎之文件群集法」，中原大學資訊管理研究所，碩士論文，民91年。

15. Lance, G. N. and Williams, W. T. “A general theory of classificatory sorting strategies”. Computer Journal, 9: 1967.,373-380 . http://149.170.199.144/multivar/ca_alg.htm

17. A. Bouguettaya and Q. Le Viet. “Data Clustering Analysis in a Multidimensional space.” , 2000

18. G. G. Chowdhury, “Introduction to modern information retrieval “, Library Association, 1999., p2066.

國際替代計量

應用資料叢集原理執行非結構型文件分群— 以某資訊通報為例

全文下載

主題瀏覽