  • 學位論文


Hierarchical Concept-based Documents Clustering by Applying the Latent Semantic Analysis Iteratively

指導教授 : 陳宗天


以文章內文為依據的文字探勘式的分群方法,只考慮詞語出現的頻率與詞語在文集內的分佈情況,而且也容易因為詞語有同形異義(Polysemy)與異形同義(Synonym)的情況,使得當文集內的文獻議題相近時容易產生分群效果不佳的狀況。由於潛在語意分析能夠評估文件中所隱含知識,適當表現人類在知識上的推演過程,因此可以改善同形異義和異形同義語意影響分群效果的問題,使得文集的分群結果較佳。但在缺乏相關資訊來估算較佳分群數目的情況下,也不容易決定適當的分群數量。因此本研究以潛在語意分析為基礎,透過陡階檢定決定奇異值分解(Singular Value Decomposition)的維度縮減數(Dimension Reduction),最後以K-means應用於文集分群,並透過階層的方式來呈現文獻集的概念分群,有助於了解文集中的概念意涵。 本研究首先利用自行開發的智識結構系統-智識建構者(Intellectual Structurer)由微軟學術資料庫蒐集兩個議題的文集,然後透過因素分析在特徵值(Eigenvalues)大於1的條件下,得出對應兩議題之智識結構文獻集。為了比較本研究的分群方式與一般文字探勘式的分群方法,並對照原本基於潛在語意分析的分群方式,將因素分析產生所有因素內的文獻視為同一文集後,除了研究提出潛在語意分析階層分群法外,此外也納入K-means分群、階層式分群、潛在語意分析(結合K-means)分群,共四種不同分群方法進行文集的分群,最後透過Jensen–Shannon Divergence(JSD)方法顯示潛在語意分析階層分群法產生之集群結果有較佳的內文相似性(Textual Coherence)。


Most of the current text clustering methods are based on texting mining technique that characterize documents using features derived from their terms’ frequency and inverse documents’ frequency (TF/IDF). However, the TF/IDF based methods do not handle Polysemy and synonyms in documents that may negatively affect the clustering result. The Latent Semantic Analysis (LSA) is a method that takes advantage of the semantic structure in correlating terms in documents to reduce the problems of synonyms and Polysemy. LSA has been used in conjunction with flat clustering method (K-Means) for better clustering result. We combined LSA with K-means to derive hierarchical document clusters iteratively, whereas a document cluster ascribed to a general concept could be divided into several sub-clusters denoted by more specific concepts. The hierarchical structure corresponds to an epistemic concept hierarchy that is elusive to the conventional hierarchical clustering methods. We applied the Iterative LSA Hierarchical clustering (Iterative Latent Semantic Hierarchical, ILSH) method to two document corpora. The clustering results are compared with that derived from the K-Means and hierarchical clustering methods. We also use Jensen-Shannon Divergence (JSD) to compare the textual coherence between the clustering results. Our ILSH clustering method has resulted document clusters with a higher textual coherence than other methods. It also produced a corresponding concept hierarchy, which could be used in representing ontological knowledge domain.


王威傑. (2013). 鏈結導出的因素與內文衍伸的集群間之一致性檢定-以多文集驗證之實證研究. 臺北大學資訊管理研究所學位論文.
吳佳昇. (2005). 使用貝氏潛在語意分析於文件分類及資訊檢索. 成功大學資訊工程學系學位論文, 1-85.
吳明陽. (2009). 以共引為基礎應用因素轉軸之比較與驗證. (碩士), 國立臺北大學, 新北市. Retrieved from http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dnclcdr&s=id=%22098NTPU0396008%22.&searchmode=basic
許家榮. (2010). 探究書目耦合與共同引用之智識構圖與內容差異. 臺北大學資訊管理研究所學位論文, 1-93.
陳同孝, 陳雨霖, 劉明山, 許文綬, 林志強, & 邱永興. (2006). 結合K-means及階層式分群法之二階段分群演算法. 電腦學刊, 17(1), 65-75.


張嘉倩(2016)。應用文字探勘於物流服務客訴事件之評價 ─ 以全球商務公司為例〔碩士論文,國立臺中科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0061-2207201615135000
