透過您的圖書館登入
IP:44.200.169.91
  • 學位論文

以核運算方法與LDA主題模型產生文字標籤之比較研究

A Comparative Study of Automatic Text Labeling Using Von Neumann Kernel and LDA Topic Model

指導教授 : 陳宗天
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


學術領域中的眾多文獻可根據文獻間的關聯或文獻內文間之相似性質來產生文獻集群,為了讓研究者更容易瞭解各文獻集群所表達的概念,自動標籤系統(Automatic Labeling System)透過系統分析文獻集群內文,自動化地產生學術文獻集群的標籤。而近年來隱含狄利克雷分佈(Latent Dirichlet Allocation, LDA)也被廣泛運用於各領域,透過統計學與機率學的分析產生主題模型(Topic Model)。本研究期望以特定領域文獻集群作為基礎,透過LDA技術發掘各集群文獻間的相關機率分佈特徵,以建構主題模型的方式產生機率模型參數進而產生各集群之主題關鍵詞分佈並組成標籤。為了檢驗LDA主題模型產生之預測集群標籤準確度,本研究將採用Treeratpituk所提出的標籤評估架構評估LDA主題模型系統組成標籤的品質做為系統執行成效的驗證方法,此方法紀錄相關LDA主題模型系統參數設定組合運作成效與實驗數據進而分析LDA主題模型系統之集群標籤準確度,並採用自動標籤系統之學術集群關鍵字擷取技術進行兩者集群標籤準確度比較,企圖以本研究相關實驗數據分析兩者方法之預測準確度高低進而採取成效最佳方法運用之。研究顯示當集群數量為4且主題數量範圍固定為4~50時,主題字詞數量設置為30時於Precision與MTRR標籤評估方法中表現最佳,且系統標籤表現會隨著主題字詞數量的增加而些許下降;自動標籤系統與LDA主題模型產生的關鍵字詞組皆與集群名稱具有一定相關性且各自具有其對集群的解釋力,其兩者系統產生的集群標綜合表現分別為8.65375與6.59098,以LDA主題模型系統所產生的集群標籤獲得較高的標籤品質分數,具有較高的集群標籤命名準確度。

並列摘要


There are tools and techniques that are capable of grouping vast documents into cohesive clusters based on the relatedness or similarity metrics between these documents. The resulted clusters of documents need to be properly labeled to facilitate a fast and holistic comprehension of the main themes or topics bore by them. There were systems that employed various theoretical or empirical based approaches to label clusters of documents automatically. Our study applied Latent Dirichlet Allocation (LDA) to obtain the most likely keywords for topics in the document clusters. The obtained keywords are then composed into key phrases as the representative labels of the clusters. The appropriateness of the labels are evaluated using the evaluative framework proposed by Treeratpituk. We found the LDA-based automatic labeling system generates proper clusters’ labels. We also compare the effectiveness of the LDA-based labeling system with our home-grown kernel-based system. In most of the cases, the LDA-based system generated better clusters’ labels then our kernel-based system in the experiment.

參考文獻


林佳宜. (2008). 相關文件群集之階層式自動標籤. (碩士), 國立臺北大學, 新北市.
Anthes, G. (2010). Topic models vs. unstructured data. Commun. ACM, 53(12), 16-18. doi: 10.1145/1859204.1859210
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Machine Learning Res., 3, 993-1022.
Ferrer I Cancho, R., & Solé, R. V. (2001). The small world of human language. Proceedings. Biological sciences / The Royal Society, 268(1482), 2261-2265. doi: 10.1098/rspb.2001.1800
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228-5235. doi: 10.1073/pnas.0307752101

被引用紀錄


林盈萱(2016)。主題模型集群與鏈結導出集群之一致性檢定〔碩士論文,國立臺北大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0023-1303201714253370
李治平(2016)。應用區域敏感雜湊對文獻進行分類之研究〔碩士論文,國立臺北大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0023-1303201714253371

延伸閱讀