透過分群並自動標記標籤分析使用者行為

過去在機器學習領域中有很多關於分群法的研究，多數分群法著重於分群的效果，而對於群的意義因為不具備標記資料而無法給予每個群比較有意義的標籤。少數對文件的分群可透過文件內的詞語出現次數，在分群後將最常出現的詞語當作該群的標籤，但對於其他類型的資料就較少這類的研究。在本論文中提出一個架構，對人的行為以及標籤做整合，可以達到分群後給予適當標籤，不需再由使用者去對每一群分別檢視才能得到標籤。　　另外，考慮到另外兩個分群法常面臨的問題，首先是許多演算法需要事先給予群數，例如k-means，面對蒐集較小範圍的資料時，群數可能可以事先得知，但在巨量資料時代，資料可能隨時會出現新的類別，所以本論文的架構使用讓資料來決定群數的分群法，更適用於分群應用上。其次是當所分析資料的是跟使用者行為有關的資料，容易會有遺失值的問題，本論文使用矩陣分解技巧來找出遺失值，提升可處理的問題的範圍。最後實驗利用使用者聽某位歌手的次數，分析出各種音樂類型的族群，以及每位使用者所屬的類別，並利用分群及標籤兩方面的評分，說明此架構的效果確實比過去的方法更適合處理此問題。

關鍵字

分群；標籤；無母數

並列摘要

In the past decades, enormous research studies on clustering have been conducted, and many clustering algorithms have been applied to various application domains. However, most of them focus on improving the performance without defining the meanings of the clusters, since no labeled data can be used to infer the meanings of clusters. Some of the algorithms on documents clustering can use the word features, such as frequencies, probabilities, or topic models, to give the clusters appropriate tags. This technique fails to apply to the other domains. This thesis proposes a framework to cluster users according to their behaviors and automatically tag the clusters. The proposed framework comprises three stages: latent factor discovery, clustering, and tagging. In most application settings, the number of clusters is unavailable especially when the data size is very large. This thesis proposes to use nonparametric clustering algorithms in the second stage, and DDCRP is used in the experiments. The output of the framework is clusters, each of which is associated with tags. We conduct experiments on three data sets and compare with several algorithms, and evaluate with clustering performance and tag accuracy. The experimental results indicate that the proposed approach works well and outperforms other algorithms in most experiments.

並列關鍵字

clustering ； taging ； nonparametric ； labeling

參考文獻

[1] Blei, David M., and Peter I. Frazier. "Distance dependent Chinese restaurant processes." The Journal of Machine Learning Research 12 (2011): 2461-2488.

[3] Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.

[4] Kulis, Brian, and Michael I. Jordan. "Revisiting k-means: New algorithms via Bayesian nonparametrics." arXiv preprint arXiv:1111.0352 (2011).

[5] Farajian, Mohammad Ali, and Shahriar Mohammadi. "Mining the banking customer behavior using clustering and association rules methods."International Journal of Industrial Engineering 21.4 (2010).

[7] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.

被引用紀錄

許雅玲（2010）。日本料理產業創新策略之研究-以小春日本料理(大里店)為例〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-0601201112113459

國際替代計量

透過分群並自動標記標籤分析使用者行為

全文下載

主題瀏覽