面對現今如細胞增殖般快速成長的資訊,如何有效地取得、組織、呈現、及應用這些資訊的方法,將是致勝的關鍵。 群集化技術,能將資料依某種特徵自動地組織及分類;而將該技術應用於文件型態的資料時,則能提升資訊檢索系統的搜尋效果、有效地組織及呈現資訊、及自動建立文件的分類架構(如Yahoo網站的分類目錄)。傳統的文件群集化涉及了二個重要的步驟:(1)萃取文件特徵,並將文件對應至向量空間模型中;(2)利用特定的群集演算法進行群集。然而,在第一個步驟中的向量空間模型,本身有些先天的缺失,其無法區別文中詞彙間的關聯性,因之可能導致後續的群集運算失準。因此,本研究將利用資料探勘領域中的「關聯規則探勘技術」,改善傳統文件群集方法的缺失,有效提升群集的品質。 研究中利用關聯規則探勘技術,找出文件中詞彙的關聯性,以之對向量空間模型進行修正。最後並以Reuters-21578文件集進行實驗評估,將本研究所提出的文件群集法與傳統的文件群集法相比,証明了本研究所提出的方法確實能提升文件群集的效果,產生高品質的文件群集。未來希望將之應用於其它各種以向量空間模型為基礎的文件群集演算法當中,以更加提升文件群集的效果。
Nowadays, the information flow grows as fast as the cell division; being able to retrieve, organize, and present these fast growing information efficiently will be the key to success. Clustering has been investigated for organizing and classifying information automatically according to some features. When applying this technology to documentary data, it can improve the precision or recall in information retrieval systems, and allow the system to organize and present information efficiently. Furthermore, Document clustering has also been used to automatically generate hierarchical clusters of documents (E.g.: The automatic generation of taxonomy of Web documents like that provided by Yahoo!). The traditional document clustering involves two phases: first, feature extraction maps each document or record to a point in vector space model, then applying specific clustering algorithms to group the points into clusters. Nevertheless, due to some inherent defects of the vector space model, which can’t differentiate relationships of the terms in documents, these may cause errors in the following operations. Therefore, this study proposes to use the association rule, which is one of the Data mining techniques, to make up for the inadequacy of the traditional document clustering and effectively improve the quality of clustering. This study use association rules to mine the relationships between terms in documents and further improves the shortcomings of the vector space model. At the end, we conducted some experiments with the Reuters-21578 corpus, we have compared the proposed method of document clustering with traditional one, and proved that the proposed method does generate higher quality clusters than the one produced by the traditional method. In the future, we plan to apply the proposed method of document clustering to other clustering algorithms based on the vector space model in order to further improve the quality of clustering.