根據使用者自訂標籤建構個人化文件分群

隨著網際網路的蓬勃發展，現代人越來越依賴網路找尋各式各樣的資訊，網路已成為獲取資訊的重要來源，如何有效率地管理大量文件，變成一項重要的議題。傳統上管理文件的方法主要為手動整理，手動管理文件不僅耗時間、體力且標準不一，在此狀況下勢必需要利用技術來達到有效文件管理的目的。文件分群技術是協助管理文件的方法之一，概念為依照文件間的相似度將文件分成數個群集，達到群集間具有最低相似度，群集內具有最高相似度的最佳結果。但是傳統文件分群存在兩個缺陷，第一是對使用者而言並不清楚每個群集的意義，面臨無法針對個別群集，給予其明確語意描述的問題，其二為無法針對不同使用者給予合適的分群結果。因此本研究提出LHC (Label Hierarchical Cluster, LHC)演算法改善過往文件分群的缺陷，不僅達到個人化文件分群的目的，此外也能針對個別群集給予合適的標籤，並產生階層化架構。在實驗部分，本研究的受測者為中正大學資訊管理研究所與醫療資訊管理研究所的學生，共計32位，資料集為該學生論文提案的文獻，每個資料集約10~25篇文獻。本研究進行多樣測試，評估準則包含Precision、Recall、F1-measure、Purity、Cluster Precision、Cluster Recall，實驗包含參數調整並觀察其特性、分析資料集數量與準確度間的關係、LHC演算法與傳統知名演算法的比較、文件分群階層化關係呈現等實驗。實驗結果顯示LHC演算法確實明顯優於傳統知名演算法，Cluster Recall指標較不如預期，推估是因為LHC參數設定等問題，導致可能產生較少群集，或是無法產生的受測者所勾選標籤的群集，在此狀況下Cluster Recall會相對偏低。此外文件分群階層化關係的結果顯示，在大部分的資料集皆符合使用者上下階層的概念，少部分的資料集則是與受測者的階層概念有出入，但是即便如此，階層化架構仍是有意義的，使用者可將葉節點視為多標籤分類的文件。整體而言，可驗證出本研究所提出的LHC演算法確實能改善過往分群的缺陷，並達到個人化文件分群的效果。

關鍵字

階層化架構；個人化；文件分群

並列摘要

With the rapid development of the Internet, modern people increasingly rely on the network to find all kinds of information. Internet has become an important source of access to information. How to efficiently manage a large number of documents becomes an important issue. Traditional methods for managing files are manual sorting, but manual sorting is time-consuming, physical consumption and different standards. We need use technique to achieve the purpose of effective document management. Document cluster is one way to help manage files, but there are two drawback in tradition document cluster algorithms. First, user can not understand the meanings for each cluster. Second, tradition document cluster can not give appropriate clustering results for different users. This study proposes algorithms LHC (Label Hierarchical Cluster) algorithms to improve tradition document cluster algorithms. LHC not only achieve personalized document clustering, but also give appropriate labels for individual clusters. In the end, LHC generate hierarchical structure. In the experimental part, the subjects are students who study in CCU information management. There are 32 subjects. The data are literature of the student paper, and there are 10 to 25 documents in each data set. Experimental results show that the LHC algorithm significantly better than the traditional well-known algorithms. Cluster Recall representing less than expected. Because the LHC parameter estimation problems, cluster label may have led to fewer clusters. Furthermore document clustering hierarchical show most of the data sets conform the concept of user-level, but there are small part not conform. But even so, the hierarchical structure is still meaningful, users can be regarded as a leaf node multi-label classification of documents. Overall, LHC algorithm can really improve the tradition clustering algorithms and achieve personal document clustering.

並列關鍵字

hierarchical structure ； personal ； document cluster

參考文獻

Berkhin, P. (2006). A survey of clustering data mining techniques Grouping multidimensional data (pp. 25-71): Springer.

Bordogna, G., & Pasi, G. (2012). A quality driven Hierarchical Data Divisive Soft Clustering for information retrieval. Knowledge-Based Systems, 26, 9-19.

Bouras, C., & Tsogkas, V. (2012). A clustering technique for news articles using WordNet. Knowledge-Based Systems.

Cai, X., & Li, W. (2011). A spectral analysis approach to document summarization: clustering and ranking sentences simultaneously. Information Sciences, 181(18), 3816-3827.

Chen, C.-L., Tseng, F. S., & Liang, T. (2010a). An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data & Knowledge Engineering, 69(11), 1208-1226.

國際替代計量

根據使用者自訂標籤建構個人化文件分群

未授權

主題瀏覽