利用共生詞彙特性發展一個二階段文件群集法

群集化(clustering)是在資料探勘領域中被廣泛應用的技術，將其概念應用於文字探勘的領域中，亦是近來的熱門研究議題。若將群集化技術應用於文件型態的資料時，常會採用向量空間模型(vector space model, VSM)來表達文件資料，然而在學術研究上卻發現有兩個缺失：一為無法辨識文中詞彙間的關聯性，造成文件誤判。在向量空間模型中，每個關鍵詞彙所構成的維度都是獨立的，無法區別文中詞彙間的關聯性（包括一詞多義、一義多詞、以及共同發生詞彙），使得進行文件相似度的比對時可能會造成誤判的情況，降低文件群集之品質。另一缺失則為如維度太高，易造成群集失準的問題。向量空間模型的維度是由文件集所有的關鍵詞彙之數量而定，當文件所萃取出來的關鍵字過多時，便會使得向量空間模型的維度增加，導致群集的結果也比較不準確。為了改善向量空間模型的兩大缺點，本文嘗試提出一個二階段的文件群集法，第一階段先將關鍵字進行群集，第二階段再利用這些關鍵字群集將文件分群；本文透過關聯規則技術的應用，來改善向量空間模型的缺失並增進文件群集的品質，此外，關鍵字群集後的結果還可以幫助文件群集作概括性的描述。本文以Reuters-21578文件集進行實驗評估，將本論文所提出的文件群集法與傳統的文件群集法相比較，實驗結果證實本論文所提出的方法確實能得到高品質的文件群集。

關鍵字

文件群集；關聯規則；文件探勘；共生詞彙

並列摘要

Clustering techniques have been developed in many application domains. When clustering text-based documents, the Vector Space Model (VSM) is often used to represent them. However, the VSM model has two major disadvantages in text-clustering research. First, the correlation between terms such as synonymy, polysemy and co-occurring words cannot be distinguished in VSM. Second, the dimensions will increase if many keywords are retrieved from documents. These disadvantages increase the complexity when calculating similarity between document collections; moreover, the accuracy of the clustering is adversely affected. We propose a two-stage document-clustering method to ameliorate the disadvantages of the VSM model in document clustering. In the first stage, the keywords are clustered; in the second stage, the documents are clustered from the results obtained in the first stage. The Reuters-21578 corpus was applied to test our proposed method. The results indicate that our method can improve the document-clustering quality better than other traditional clustering methods.

並列關鍵字

document clustering ； association rule ； text mining ； co-occurring words

國際替代計量

利用共生詞彙特性發展一個二階段文件群集法

全文下載

主題瀏覽