應用關聯規則技術有效輔助以向量空間模型為基礎之文件群集法

面對現今如細胞增殖般快速成長的資訊，如何有效地取得、組織、呈現、及應用這些資訊的方法，將是致勝的關鍵。群集化技術，能將資料依某種特徵自動地組織及分類；而將該技術應用於文件型態的資料時，則能提升資訊檢索系統的搜尋效果、有效地組織及呈現資訊、及自動建立文件的分類架構(如Yahoo網站的分類目錄)。傳統的文件群集化涉及了二個重要的步驟：(1)萃取文件特徵，並將文件對應至向量空間模型中；(2)利用特定的群集演算法進行群集。然而，在第一個步驟中的向量空間模型，本身有些先天的缺失，其無法區別文中詞彙間的關聯性，因之可能導致後續的群集運算失準。因此，本研究將利用資料探勘領域中的「關聯規則探勘技術」，改善傳統文件群集方法的缺失，有效提升群集的品質。研究中利用關聯規則探勘技術，找出文件中詞彙的關聯性，以之對向量空間模型進行修正。最後並以Reuters-21578文件集進行實驗評估，將本研究所提出的文件群集法與傳統的文件群集法相比，証明了本研究所提出的方法確實能提升文件群集的效果，產生高品質的文件群集。未來希望將之應用於其它各種以向量空間模型為基礎的文件群集演算法當中，以更加提升文件群集的效果。

關鍵字

文件探勘；資料探勘；文件群集；向量空間模型；關聯規則

並列摘要

Nowadays, the information flow grows as fast as the cell division; being able to retrieve, organize, and present these fast growing information efficiently will be the key to success. Clustering has been investigated for organizing and classifying information automatically according to some features. When applying this technology to documentary data, it can improve the precision or recall in information retrieval systems, and allow the system to organize and present information efficiently. Furthermore, Document clustering has also been used to automatically generate hierarchical clusters of documents (E.g.: The automatic generation of taxonomy of Web documents like that provided by Yahoo!). The traditional document clustering involves two phases: first, feature extraction maps each document or record to a point in vector space model, then applying specific clustering algorithms to group the points into clusters. Nevertheless, due to some inherent defects of the vector space model, which can’t differentiate relationships of the terms in documents, these may cause errors in the following operations. Therefore, this study proposes to use the association rule, which is one of the Data mining techniques, to make up for the inadequacy of the traditional document clustering and effectively improve the quality of clustering. This study use association rules to mine the relationships between terms in documents and further improves the shortcomings of the vector space model. At the end, we conducted some experiments with the Reuters-21578 corpus, we have compared the proposed method of document clustering with traditional one, and proved that the proposed method does generate higher quality clusters than the one produced by the traditional method. In the future, we plan to apply the proposed method of document clustering to other clustering algorithms based on the vector space model in order to further improve the quality of clustering.

並列關鍵字

Text Mining ； Data Mining ； Association Rule ； Vector Space Model ； Document Clustering

參考文獻

15.[FPS96a] U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, “The KDD Process for Extracting Useful Knowledge from Volumes of Data,” Communications of the ACM, 39(11), 1996, pp.27-34.

17.[FU96] U. Fayyad, and R. Uthurusamy, “Data mining and knowledge discovery in databases,” Communications of the ACM, 39(11), 1996, pp.24-26

22.[GLW86] A. Griffith, H. C. Luckhurst, P. Willet, “Using Inter-Document Similarity Information in Document Retrieval Systems,” Journal of the American Society for Information Science, Vol.37, pp.3-11, 1986.

25.[Hil68] D. R. Hill, A Vector Clustering Technique, Mechanized Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam, 1968.

26.[HK00] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

被引用紀錄

鍾任明（2005）。運用文字探勘於日內股價漲跌趨勢預測之研究〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu200500608

王美淳（2003）。利用共生詞彙特性發展一個二階段文件群集法〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu200300200

劉世琪（2007）。中文專業文件之自動叢集處理〔碩士論文，元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2007.00238

邱正芳（2006）。應用資料叢集原理執行非結構型文件分群— 以某資訊通報為例〔碩士論文，元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2006.00187

張敏亮（2005）。應用資料探勘於交通事故環境之關聯規則與預測〔碩士論文，亞洲大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0118-0807200916284883

國際替代計量

應用關聯規則技術有效輔助以向量空間模型為基礎之文件群集法

未授權

主題瀏覽