利用共生詞彙特性發展一個二階段文件群集法

群集化(Clustering)是在資料探勘領域中被廣泛應用的技術，將其概念應用於文字探勘的領域中，亦是近來的熱門研究議題。若將群集化技術應用於文件型態的資料時，常會採用向量空間模型(Vector Space Model, VSM)來表達文件資料。儘管在處理文件資料時，應用向量空間模型表達文件是個好方法，然而其在學術研究上卻發現有兩大先天的缺失：一為無法辨識文中詞彙間的關聯性，造成文件誤判。在向量空間模型中，每個關鍵詞彙所構成的維度都是獨立的，無法區別文中詞彙間的關聯性（包括一詞多義、一義多詞、以及共同發生詞彙），使得進行文件相似度的比對時可能會造成誤判的情況，降低文件群集之品質[Sul01]。另一缺失則為維度太高，形成群集失準的問題。向量空間模型的維度是由文件集所有的關鍵詞彙之數量而定，當文件所萃取出來的關鍵字過多時，便會使得向量空間模型的維度增加，導致群集的結果也比較不準確[RG00][Sul01]。後續有學者便針對改善向量空間模型之缺失為主軸進行研究[RG00][DM01]；另外也有學者進行非向量空間模型的文件群集研究[ZE98] [MHB97]。為了改善向量空間模型的兩大缺點，本研究嘗試提出一個二階段的文件群集方式，第一階段先將關鍵字進行群集，第二階段再進而將文件群集；同時本研究透過關聯規則技術的應用來改善向量空間模型的缺失並增進文件群集的品質，此外關鍵字群集後結果還可以幫助群集後的文件作概括性的描述。本研究最後以Reuters-21578文件集進行實驗評估，將本研究所提出的文件群集法與傳統的文件群集法相比較，並證實了本研究所提出的方法確實能產生高品質的文件群集。

關鍵字

文件群集；資料探勘；共生詞彙；文件探勘；關聯規則

並列摘要

Clustering is a general technique on data mining. It has been used in many application domains, especially in text in recent year. When using clustering algorithms in text data analysis, it often adopts the Vector Space Model to represent the text-based documents. The Vector Space Model is a popular retrieval model. However, the model has two main disadvantages in research. First, A dimension created for each distinct term is actually independent in vector space model. It is certainly difficult to distinguish correlation from each term in a document collection, such as synonymy, polysemy, and co-occurring words. It may in a difficult situation for calculating similarity in the entire document collection. Therefore, it may cause the clustering results to lose accuracy. [Sul01] Second, the dimensions in Vector Space Model are on the increase with the large and varied document collections. There are many experts identified that the result of extremely high number of dimensions is not accurate[RG00][Sul01]. Some experts specialize in improving results of document clustering the in Vector Space Model [RG00][DM01]. And Some experts develop a non-VSM-based document clustering method [ZE98] [MHB97]. In order to overcome above two disadvantages of the VSM, this research introduces a two-stage document clustering method: the first-stage is to cluster the keywords; the second-stage is to cluster the documents using the clustering results from the first-stage. This research applies the techniques of association rules to overcome disadvantages and improve the quality of the document clustering. The results of the keywords clustering can also mark conceptual descriptions in documents after clustering. The experiments using the Reuters-21578 corpus, we have compared the proposed method of document clustering with the traditional one. It proves that the proposed method in this research can certainly improve the results of document clustering and bring higher quality clusters.

並列關鍵字

Data Mining ； Text Mining ； Association Rule ； Co-Occurring Words ； Document Clustering

參考文獻

1.鍾明璇，「應用關聯規則技術有效輔助以向量空間模型為基礎之文件群集法」，中原大學資訊管理學系碩士班，民國91年8月。

26.[KL00] H. J. Kim and S. G. Lee, ”A Semi-Supervised Document Clustering Technique for Information Organization,” Proceedings of the ninth International Conference on Information knowledge management (CIKM), 2000.

31.[MHB97] J. Moore, E. H. Han, D. Boley, M. Gini, R. Gros, K. Hasting, G. Karypis, V. Kumar, and B. Mobasher, “Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering,” 7th Workshop on Information Technologies and Systems, (WITS'97), 1997

5.[BM98] L.D.Baker, and A.K.McCallum, “Distributional Clustering of Words for Text Classification” ACM SIGIR, 1998, pp 96-103.

10.[Dh01] I.S.Dhillon, “Co-clustering document and words using Bipartite Spectral Graph Partitioning”, Conference on Knowledge Discovery and Data Mining, ACM, 2001.

被引用紀錄

劉世琪（2007）。中文專業文件之自動叢集處理〔碩士論文，元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2007.00238

詹岳縉（2009）。網路多文件摘要整合及呈現〔碩士論文，長榮大學〕。華藝線上圖書館。https://doi.org/10.6833/CJCU.2009.00018

谷佳臻（2007）。電腦輔助分析軟體運用於質性研究訪談稿內容分析之探討〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-2910200810572104

國際替代計量

利用共生詞彙特性發展一個二階段文件群集法

未授權

主題瀏覽