基於社群專頁內容分析的用戶興趣探勘技術

在這篇論文中，我們提出了一種基於社群專頁分析學習得到一個主題模型，從而應用在普通用戶在社群網站分享的圖文內容分析上，找到使用者興趣分佈的方法，該分佈可用於對用戶做精准的廣告推薦。這篇論文的工作框架分成圖文內容的預處理與特徵提取、帶有標籤的主題空間(Labeled Topic Space)訓練和使用者興趣發現三部分。本研究選取帶有主題標籤的Facebook粉絲專頁的圖文內容作為訓練資料。首先將文字內容做切詞、去除停止詞、抽取關鍵字，圖片內容經過特徵點檢測、分群、抽取關鍵群等過程整理成可供主題模型處理的文字詞彙(Words)和視覺詞彙(Visual Words)。再用濾波方法將主題不明確的圖文內容過濾。然後把每個粉絲專頁的這些詞彙組成一個文字文檔和一個圖片文檔，分別放到LDA( Latent Dirichlet Allocation)主題模型做訓練。每個粉絲專頁經過LDA模型會得到一個文字部分的主題分佈和一個圖片部分的主題分佈。接著找到主題分佈的數值最高的維度。屬於同一個主題標籤的粉絲專頁進行投票(Voting)，選出得票數最高的維度，把這些粉絲專頁對應的標籤賦給該維度。然後再判斷是否每個維度都得到了唯一的標籤，如果否，就要調整LDA的超參數(Dirichlet Parameter)，再次進行訓練，直到每個維度都得到唯一標籤。訓練完的主題空間每個維度就有了具體主題內容。最後應用到普通使用者的社群網站資料上，只要將使用者的圖文內容整理成文檔，放到訓練好的主題模型裡處理，就可得到一個個人化的興趣分佈，每個維度的數值都是該使用者對某興趣的喜好程度，即完成了通過使用者社群網站內容分析發現使用者的興趣分佈。經實驗結果證明此改進的帶有標籤的LDA主題模型架構在實際應用中的可行性。本研究的貢獻主要有以下四點：可有效地解決傳統非監督式LDA無法建構具體主題空間的問題；通過一定技術手段自動選取文字和圖片中主題更為明確的資料形態來訓練，取長補短，充分利用了多媒體(Muliti-Media)的優勢；改變只能使用超參數的經驗數值的現狀，提出了一種自動選到合適參數的方法；本方法可以使用在普通使用者日常分享的語言和照片內容上，且分類正確度能媲美用語精准的新聞內容的分類結果。關鍵字：主題模型；興趣發現；社群網站分析。

關鍵字

主題模型；興趣發現；社群網站分析

並列摘要

This thesis presents a model based on social group analysis to get a specific topic space, which can be applied to the general user’s posts helping to mine his interest distribution. The distribution can serve for personalized ads recommendation. The framework consists of three steps: the preprocessing and feature extraction step, the Labeled Topic Space learning step, and the user interests mining step. The study chooses the Facebook fan pages which have topic labels as the training data. First, for the text posts, do text segmentation, and remove stop words, and extract keywords. Similarly, run feature detection, clustering, and extract key visual words for image contents. Then filter those noisy and ambiguous posts. In order to get better performance of LDA, after aggregating the posts in one fan page into a text document and a photo document, respectively run the LDA (Latent Dirichlet Allocation) model. Each fan page through the LDA model will output a topic distribution of the text document and a topic distribution of the text part. Afterwards, find the highest value dimension of the distribution. Fan Pages sharing the same topic label vote for the dimensions with the highest values of their own distributions. The dimension getting the most votes can be assigned the topic label of these fan pages. Then check whether each dimension has a unique label. If not, it is necessary to adjust the LDA hyper parameters (the Dirichlet Parameter) and run the LDA again. So far, a topic space each dimension of which has a specific and meaningful topic label has been constructed. When the trained model is applied to the general user posts, we can get a personal interest distribution, the value of each dimension representing the user’s preference of certain topic. The experimental results show that the improved model can effectively mine user’s interests. The main contribution of this study contains four parts: it can solve the problem that the conventional unsupervised LDA can’t reveal the specific meaning of each dimension of the topic space; we propose a method to select the posts which can better explain the topic between texts and photos, taking advantage of multi-media data; the model can automatically choose the appropriate parameters; this method can be applied to the real data shared by users, whose result is comparable to the news data. Keywords: Topic model; interest mining; social media analysis.

並列關鍵字

Topic model ； interest mining ； social media analysis

參考文獻

[1] J. Tang, R. Hong, S. Yan, T. Chua, G. Qi, R. Jain, Image annotation by k nn-sparse graph-based label propagation over noisily tagged web images, ACM Trans. Intell. Syst. Technol. (TIST) 2 (2011) 14.

[2] J. Tang, S. Yan, R. Hong, G. Qi, T. Chua, Inferring semantic concepts from community-contributed images and noisy tags, in: Proceedings of the MM, 2009, 223–232.

[3] J. Tang, Z. Zha, D. Tao, T. Chua, Semantic-gap-oriented active learning for multilabel image annotation, IEEE Trans. Image Process. 21 (2012) 2354–2360.

[4] H. Feng, X. Qian, Recommend social network users favorite brands, PCM (2013).

[5] X. Qian, X. Liu, C. Zheng, Y. Du, X. Hou, Tagging photos using users' vocabularies, Neurocomputing 111 (2013) 144–153.

國際替代計量

基於社群專頁內容分析的用戶興趣探勘技術

主題瀏覽