隨著網際網路的發展,電子文件資訊不斷擴張,文件自動化分類也成為相當重要的研究議題之一。文件自動化分類的目的即是透過分類演算法的計算來學習人類分類的隱藏知識,進而取代人力進行文件分類,不僅可以縮短分類的時間,也可以減少因人為因素所造成的誤判。透過語意進行自動分類的方法行之有年,但透過高階語意概念進行前置處理的研究相對少,ConceptNet是由麻省理工學院媒體研究室所開發的常識知識庫和自然語言處理工具,屬於高階語意的應用。而在分類的過程裡,會依照文章中常出現的固定特徵字來決定類別,這些特徵字為語意分析中相當重要的詞彙資源。因此,為了將文件有效分類,本研究以ConceptNet為基礎,結合文字探勘技術過濾詞彙,利用ConceptNet的extract_concepts與Assertion方法解析文章字詞並予以分類,建置一個具有概念的詞彙庫,計算文章類別權重值,透過分類器檢視成效。實驗結果顯示,本研究提出以ConceptNet進行文章分類,能利用較少的詞彙達到分類的效果。
The development of the internet has resulted in an increase in electronic documents. Automatic classifications has become a very important research topic. In order to discover how humans classify these documents, automatic classifications attempts to calculate the algorithm in order to reduce manual document classifications. Automatic classifications not only helps work to be done more efficiently, but it also reduces the number of mistakes made by humans. Automatic classifications with semantics has already been present for many years, but the pre-processing, with high-level semantic concepts, has not been present for that long. ConceptNet is a common sense knowledge base and natural-language-processing tool, which is a high-level semantic application, developed by MIT Media Lab. In order to classify documents effectively, the combination of ConceptNet and text mining technology, can be used to filter useless words. The extract concepts and assertion methods, provided by ConceptNet, is used to analyze words. Finally, Support Vector Machine (SVM) can be used to test and verify whether the experimental hypothesis is feasible or not. In this study, experimental results show that we can use fewer words to achieve classification results by using ConceptNet.