知識管理系統(Knowledge Management System,KMS)除了有好的搜尋功能以外,還需有好的歸納系統。若是使用人工方式分析文件並分類,將會耗損人力資源。 當文件資料欲上傳到知識管理系統時,必須分析出文件關鍵字的重要性,再由分類法將文件歸納。在計算關鍵字權重前,文件需先處理,消除字詞詞性以及刪除對文件無影響的字詞等。我們使用Stemming演算法將字詞詞性還原,以及透過WordNet詞彙查詢,刪除沒有用的字詞。然後再用Jena的RDF API將文件資料寫入RDF中,並以TF-IDF運算出每文件內字詞權重值。當有大量文件只用單台電腦處理,因工作量龐大以致運算時間耗時,因此我們以Hadoop分散式架構將工作量分散於多台電腦處理,並使用Pig資料分析平台,因為Pig善於分析資料運算以及MapReduce最佳化,實驗結果顯示在大量資料情況下,分散式確能減少大量運算時間。
A knowledge management system should have search and classification functionalities. To manually analyze document is labor intensive. Before a document is uploaded to a knowledge management system, the document must be analyzed for importance of keywords in the document. Then the document can be classified by the classification method. Before a document is analyzed, the word terms with different derivatives should be stemmed and the meaningless stop words be deleted. The snowball stemming algorithm was used to reduce a word to its root form. WordNet was used to extract the relevant words in the documents. The keywords extracted are stored in RDF format by using RDF API of Jena. Then the weights of words are calculated by using TF-IDF. Processing a large number of documents with a single computer is time-consuming. So we used Hadoop distributed architecture to distribute processing to multiple computers. Pig was used for data analysis since it is optimized for MapReduce. Our experiment results showed that it can reduce significant processing time for large amount of data.