應用分散式架構做知識管理系統資料分析

知識管理系統（Knowledge Management System，KMS）除了有好的搜尋功能以外，還需有好的歸納系統。若是使用人工方式分析文件並分類，將會耗損人力資源。當文件資料欲上傳到知識管理系統時，必須分析出文件關鍵字的重要性，再由分類法將文件歸納。在計算關鍵字權重前，文件需先處理，消除字詞詞性以及刪除對文件無影響的字詞等。我們使用Stemming演算法將字詞詞性還原，以及透過WordNet詞彙查詢，刪除沒有用的字詞。然後再用Jena的RDF API將文件資料寫入RDF中，並以TF-IDF運算出每文件內字詞權重值。當有大量文件只用單台電腦處理，因工作量龐大以致運算時間耗時，因此我們以Hadoop分散式架構將工作量分散於多台電腦處理，並使用Pig資料分析平台，因為Pig善於分析資料運算以及MapReduce最佳化，實驗結果顯示在大量資料情況下，分散式確能減少大量運算時間。

關鍵字

知識管理系統； Stemming ； TF-IDF ； WordNet ； Hadoop ； Jena ； Pig Latin

並列摘要

A knowledge management system should have search and classification functionalities. To manually analyze document is labor intensive. Before a document is uploaded to a knowledge management system, the document must be analyzed for importance of keywords in the document. Then the document can be classified by the classification method. Before a document is analyzed, the word terms with different derivatives should be stemmed and the meaningless stop words be deleted. The snowball stemming algorithm was used to reduce a word to its root form. WordNet was used to extract the relevant words in the documents. The keywords extracted are stored in RDF format by using RDF API of Jena. Then the weights of words are calculated by using TF-IDF. Processing a large number of documents with a single computer is time-consuming. So we used Hadoop distributed architecture to distribute processing to multiple computers. Pig was used for data analysis since it is optimized for MapReduce. Our experiment results showed that it can reduce significant processing time for large amount of data.

並列關鍵字

Hadoop ； WordNet ； Stemming ； TF-IDF ； Pig Latin ； Jena

參考文獻

[1] M. Alavi and D. E. Leidner. Review: Knowledge Management and Knowledge Management Systems: Conceptual Foundations and Research Issues. MIS Quarterly, Vol. 25, Issue. 1, pp. 107-136, 2001.

[9] Mingmin Xu, Liang He and Lin Xin. A Refined TF-IDF Algorithm Based on Channel Distribution Information for Web News Feature Extraction. Second International Workshop on Education Technology and Computer Science, Volume 2, March 2010, pp. 15-19.

[10] M.F. Porter. An algorithm for suffix stripping, Morgan Kaufmann Publishers Inc, 1997.

[11] O'Leary, D.E. Using AI in Knowledge Management：Knowledge Base and Ontologies. Intelligent Systems and their Applications, IEEE, Vol. 13, Issue. 3, pp. 34-39.

Media, 2012.

國際替代計量

應用分散式架構做知識管理系統資料分析

全文下載

主題瀏覽