潛在類別分析於文字探勘之應用

現今網路的使用已經成為主流，因此在網站上擁有大量的文字信息。文字探勘也因此成為一種流行的資料分析方法。潛在類別分析(Latent Class Analysis)是一常用於社會科學的分析方法來尋找潛藏於資料背後的潛在類別。在本文中，我們應用潛在類別分析來評估此分析方法應用於文字探勘的可行性。本文中針對兩個案例進行論證和研究，一個是比較“水滸傳”和“三國演義”的相似性檢測，另一個則是針對新聞文章的分類問題來尋找關鍵詞並據此提供結論和建議。

關鍵字

分類；潛在類分析；文字探勘；相似性檢測

並列摘要

There is a large amount of information on the website that is in text form, and due to the increment of internet usage, text mining has become a popular method for information retrieval. In this paper, we apply Latent Class Analysis (LCA), a technique that is often used in social sciences to reveal underlying latent classes, on text mining and check whether it is an appropriate method on this regard. Two study cases are demonstrated, one is similarity detection that compare two novels, Water Margin and Romance of Three Kingdom, and the other is using classification that classify the categories for news articles to find important keywords. Conclusions and suggestions are provided.

並列關鍵字

Classification ； Latent class analysis ； Similarity detection ； Text mining

參考文獻

Aggarwal, C. C. & Zhai, C. X. (2012). Mining Text Data. New York, NY: Springer Publishing Company.

Google Scholar

Forster, M. R. (2000). Key Concepts in Model Selection: Performance and Generalizability. Journal of Mathematical Psychology, 44, 205- 231.

Google Scholar

Lin, T. H. & Dayton, C. M. (1997). Model Selection Information Criteria for Non-Nested Latent Class Models. Journal of Educational and Behavioral Statistics, 22(3), 249-264.

Google Scholar

Linzer, D. A. & Lewis, J. B. (2011). poLCA: An R Package for Polytomous Variable Latent Class Analysis. Journal of Statistical Software, 42(10), 1-29.

Google Scholar

Matsuo, Y. & Ishizuka, M. (2004). Keyword Extraction from a Single Document Using Word Co-Occurrence Statistical Information. International Journal on Artificial Intelligence Tools, 13(1), 157-169.

Google Scholar

國際替代計量

潛在類別分析於文字探勘之應用

全文下載

主題瀏覽