基於Constrained-PLSA之半監督式文件分群

目前網路上的資料相當龐大，可輕易取得非常多未標記資料；然而監督式學習方法，需要給足夠標記的資料做訓練分類模型，資料標記往往需要浪費大量人力以及時間；而非監督式學習方法雖然不需要標記資料，但是往往使用者在分群之前已經有些背景知識，理論上這些知識應該加入系統，讓系統可快速有效的分群，所以本論文加入少許標記的資料，利用這已知的資訊，來達到更好的效果，同時不用介入過多的人力來幫助資料的分群。本論文提出Constrained-PLSA，這是一種半監督式學習的演算法，將些許標記資訊整合加入Constrained-PLSA演算法中，利用標記的資訊引導未標記的資訊導向正確的方向，使分群效果提升。最後實驗結果顯示只要些許的標記資料可以讓Constrained-PLSA達到穩定且不錯的效果。另外本論文也用Constrained-PLSA探討標籤分析，利用論文資料集做實驗，此資料集每篇文章包含了摘要和標籤兩個資訊，標籤是由使用者看完文章後所給定的關鍵字，因此標籤是一個很重要訊息；本論文分析出四種摘要和標籤的組合方式：Words only、Tags only、Words+Tags和Tags as words，利用這幾種組合方式做實驗，並用不同的分群演算法來討論分析哪個組合方式下，能使標籤有最好效能提升效果，在此實驗中也可看出Constrained-PLSA可以經由些許標記資料，有效提升分群效能。

關鍵字

半監督式學習；機器學習；標籤分析

並列摘要

Text classification is of great practical importance today given the massive volume of online text available. Supervised learning is one of the popular techniques for tackling text classification problems. However, enough labeled data is necessary for supervised learning methods. Labeling must typically be done manually and it is a time-consuming process obviously. In general, unlabeled data may be relatively easy to collect. Although unsupervised learning method doesn’t need any labeled data. But users often have some background knowledge before clustering. Practically, background knowledge should be included into algorithms to improve clustering accuracy. This paper extends PLSA clustering model to propose a Constrained-PLSA method, which is a semi-supervised learning algorithm. The Constrained-PLSA assumes that data is generated by a mixture model and the correspondence between each document and class label is one to one. By introducing the seeding documents as constraints, we show that Constrained-PLSA can estimate maximum likelihood in latent variable models using the Expectation Maximization (EM) algorithm. Experimental results show that Constrained-PLSA with a small amount of examples can effectively improve the performance. In addition, this paper also discusses tag usage using Constrained-PLSA. Academic paper data set is employed in this paper. Each paper consists of abstract and tag information. Tag is given by users after reading the article. This paper analyzes four combinations of abstracts and tags: “words only”, “tags only”, “words + tags” and “tags as words”. The best one is presented in this paper. Meanwhile, the experimental result shows that Constrained-PLSA outperforms other clustering algorithms.

並列關鍵字

PLSA ； machine learning ； semi-supervised learning

參考文獻

[5] W. Wang and Z.-H. Zhou, “A new analysis of co-training,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), J. Furnkranz and T. Joachims, Eds. Haifa, Israel: Omnipress, June 2010, pp. 1135–1142. [Online]. Available: http://www.icml2010.org/papers/275.pdf

[10] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl, “Constrained k-means clustering with background knowledge,” in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 577–584. [Online]. Available: http://portal.acm.org/citation.cfm?id=645530.655669

[20] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.

[23] T.Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Mach. Learn., vol. 42, no. 1-2, pp. 177–196, 2001.

[25] S. Zhong, “Semi-supervised model-based document clustering: A comparative study,” Mach. Learn., vol. 65, pp. 3–29, October 2006. [Online]. Available: http://portal.acm.org/citation.cfm?id=1164582.1164590

國際替代計量

基於Constrained-PLSA之半監督式文件分群

全文下載

主題瀏覽