基 於 動 態 調 整 權 重 之 co-cluster

由於科技的進步，網路的發展，造成資訊量迅速攀升，然而這樣的進步卻相對的造成使用者必須付出更多的時間去瀏覽所需的文件。有鑒於現今搜尋引擎的廣泛使用，人們希望以更高的效率與效能取得資訊，其中分群的技術應用，扮演著重要的角色。在搜尋的過程中，若能先將文件做好適當的分群，則可讓搜尋系統提供更結構性的結果給使用者。如此一來，不僅可以減少搜尋文件的時間，更可加快使用者找到自己想要的文件。本研究利用Co-Clustering 的分群方法為基底並做更進一步的改良，針對分群效能的改善以及feature 權重的增減加以討論，並且以Reuters、20newsgroup 及classic3 資料集做分析，萃取出核心關鍵字，並給予適當的權重，進而過濾一些不必要的雜訊以及加強關鍵字的強度。利用座標的資訊，利用核心關鍵字在距離群中心的距離為基礎做關鍵字之調整權重。接著，利用logistic function 的特性對關鍵字之權重調整到介於0 與1 之間，再將關鍵字賦予調整後權重之後，再做一次Co-Clustering，重複以上的動作達到收斂後，進而得到較高的分群結果。

關鍵字

文件分群；文件分析；資料探勘；合作分群

並列摘要

This paper proposes a weighted co-clustering algorithm and applies it to document clustering problem. The weighted co-clustering is an extension of co-clustering, and it makes use of co-clustering properties to design a dynamic weighting algorithm for terms. Firstly, co-clustering presents both documents and words on the same coordinate system using spectral embedding technique. Secondly, co-clustering clusters documents and words simultaneously, so the documents that are within the same cluster should be clustered together with their corresponding words. Based on these two properties, the weighted co-clustering changes term weights iteratively. In addition, an outlier detection mechanism is proposed in this paper to eliminate outlier documents from clustering process. When the clustering process is completed, these outlier documents are assigned to appropriate clusters. We conduct experiments on three data sets and the experimental results show that the weighted co-clustering can effectively improve the performance.

並列關鍵字

Document Clustering ； Text Analysis ； Information Retrieval ； Co-Clustering

參考文獻

[23] W. Li and A. McCallum, “Semi-supervised sequence modeling with

[24] B. Long, Z. M. Zhang, and P. S. Yu, “Co-clustering by block value

[15] T. Hofmann, J. Puzicha, and M. I. Jordan, “Learning from dyadic data,”

[1] I. S. Dhillon, “Co-clustering documents and words using bipartite

spectral graph partitioning,” in Proceedings of the seventh ACM

國際替代計量

基於動態調整權重之 co-cluster

全文下載

主題瀏覽