文獻重要性與LDA分佈機率值之關聯性研究

過去有研究初步證明了由文獻間鏈結關係所導出的集群與由文獻內容所導出的集群之間有一致性，也就是文獻之間的相關性同時也隱喻文獻之間內容的相似性，這樣的研究是以文獻的集群為分析單位。我們想知道經由與某研究議題相關之文獻間的參考關係拓墣所導引出的重要文獻是否也代表了此文獻與這研究議題有較高的相關性，文獻重要性與議題相關性的分析單位是各篇文章。為了證實文獻重要性也隱喻文獻與主題有較高的關聯性，我們應用鏈結分析來導出重要性，用LDA來塑模主題關聯性。主題關聯性是透過文字探勘方法計算重要字詞的機率值所導出，而文獻的重要性則是由文集內之文獻間的鏈結所導出，本研究想探索由文獻參考鏈結所導出的重要性與由文獻內文所導出的主題機率分配之間是否有關聯。若能證明文獻主題之機率分配與文獻之重要性相關，就能由一個新的角度證明文獻鏈結關係與文獻主題有一致性。本研究首先使用由本研究室自行開發的智識結構系統(Intellectual Structurer)從微軟學術資料庫蒐集三個議題的文集，透過蒐集到的引文資訊幫助我們找出適當的門檻值，並將文集進行鏈結分析(PageRank & HITS)得出每篇文獻的重要性權重值，並將門檻值內權重值較高的文獻保留後，使用隱含狄利克雷分佈模型(Latent Dirichlet allocation, LDA)進行LDA主題分佈機率值的運算，並使用K-means方法來與LDA方法中的資訊進行比對驗證，驗證無誤後將引文分析得出每篇文獻的被引用次數、鏈結分析得出每篇文獻的權重值與隱含狄利克雷分佈模型得出的LDA主題分佈機率值三者進行斯皮爾曼等級相關係數(Spearman's rank correlation coefficient)相關性檢定，最後研究結果顯示，鏈結分析的權重值與文獻主題之間呈現顯著相關。

關鍵字

HITS ； PageRank ；鏈結分析；引文分析；文字探勘； LDA

並列摘要

Recent research has found that the relatedness between documents implies their content similarity. The relatedness metrics between documents are derived from the linkage relationships between them, such as citation and co-citation relationships. The content similarity is represented by the cosine values between their VSM (Vector Space Model) document vectors. The linkage relationships between documents have been used to derive the importance metrics through link analysis algorithms, such as PageRank and HITS. From past research, we already knew linkage between documents are correlated with their content similarity. We speculate that the linkage-derived importance metrics may also correlate with the content derived topics. We collected document corpora of three distinct research fields from Microsoft Academic Research academic database, and calculated the importance metrics of these corpora using PageRank and HITS. We kept 100 to 200 documents with the highest importance metrics from each corpus and modeled latent topics of documents in the corpora using the Latent Dirichlet Allocation (LDA) model. The possibilities of documents belong to a topic become a ranked list of probabilities. Through the Spearman’s ranking correlation analysis, we found the sorted topic belonging possibilities correlated with the importance metrics of these documents. The result of this study implies that the more importance of a document, the higher relevance of the document to a topic. We found preliminary evidence that the importance (of document) implies relevance (to a topic).

並列關鍵字

HITS ； PageRank ； Latent Dirichlet allocation ； correlation

參考文獻

陳瑋. (2008). 鉅量引文資料分析. 臺北大學資訊管理研究所學位論文, 1-65.

Barabási, A.-L., & Albert, R. (1999). Emergence of Scaling in Random Networks. Science, 286(5439), 509-512. doi:10.1126/science.286.5439.509

Borodin, A., Roberts, G. O., Rosenthal, J. S., & Tsaparas, P. (2005). Link analysis ranking: algorithms, theory, and experiments. ACM Transactions on Internet Technology (TOIT), 5(1), 231-297.

Brin S., and Page, L. (1998). The Anatomy of a Large-scale Hypertextual Web-search Engine. In Proceedings of the Seventh International World Wide Web Conference.

Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., & Kleinberg, J. (1998). Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks and ISDN Systems, 30(1), 65-74.

被引用紀錄

羅子修（2017）。應用文字探勘技術於消費者產品使用狀況之研究－以手機遊戲線上評論為例〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu201700232

國際替代計量

文獻重要性與LDA分佈機率值之關聯性研究

主題瀏覽