本研究建構出一個可以對具有階層式架構的資料集進行自動標籤及驗證的系統,其中結合了特徵選取、語意相似性分析及語彙關聯程度的方法。我們至ACM Digital Library 中收集文章以形成本研究之集群資料集,針對該資料集進行分析後再利用該評估架構進行結果的評估。研究結果發現,MMI的集群標籤在二詞集群標籤上具有很高的預測率以及精確度,可以預測到76%的集群,且MRR分數高達0.75,表示該方法所產出的集群標籤很適合用來作為集群分類的名稱;而RMI則是在三詞集群標籤上比MMI擁有較佳的表現程度,因此很適合用來輔助二詞集群標籤在概念上的不足,本研究也證明我們所提出的方法MMI、RMI能有效以及減輕使用者負擔,而且又能達到幾乎接近專家分類的表現成效。 這樣的研究結果讓研究人員可以快速的辨析出該學術領域所探討的主題,並節省對一學術領域入門時所需花費的時間與精神,並且透過自動化的驗證架構對所產生的群集標籤進行評估,以自動化且客觀的方式來驗證系統之群集標籤與專家的分類標籤的吻合程度。
This study developed an automatic labeling system that generates text labels for the hierarchical document clusters. An evaluation framework has also been developed to assess the precision of these labels. The label extraction procedure combines techniques originated from the features’ selection, semantic similarity analysis, correlation of term co-occurrence,,and automatic labeling systems. This system takes the hierarchical datasets collected from ACM Digital Library as the empirical data to evaluate its precision and effectiveness. The experimental results showed that the MMI algorithm may generate precise bigram cluster labels. The MRR of the bigram cluster labels scores as high as 0.75, indicating that they are very suitable category labels. The trigram cluster labels of RMI performed better than the labels generated by MMI method, so it is suitable as a supplemental label of the bigram label. This study also showed that our proposed method MMI, RMI can generate labels that close match the labels made by human.