對於文集事先分類能夠幫助研究者有效的了解文集所包含之議題,但文集分類的方法必須借助文字探勘技術,而現有之文字探勘技術是以英文文本作為探勘標的所開發出的方法,對於中文文獻並無法完全的適用。因為在中文的斷字切詞上(Tokenize)就與英文的處裡有很大的不同,中文的斷詞需要有詞庫來做比對,而詞庫是需要維護的,若是有能夠不需要詞庫比對就能夠有效分類的演算法,對於中文文集分群或分類的處理流程會更方便。而區域敏感雜湊的特徵擷取方式與一般的斷字切詞有很大的不同,它的運作方式是採用固定的長度有規律性的擷取,這樣便不需要與詞庫做比對,就能夠直接的處理中文文集之分類。 本研究利用LSH以及過去所使用的VSM演算法撰寫能夠對中文文集分類之系統,並於線上學術資料庫蒐集6個議題之文獻集作為系統輸入,將蒐集之文獻集分為不同訓練數量的資料,透過系統的特徵擷取處理以及運算處理後,產生各訓練數量之分類結果,以檢視在訓練集數量成長的情況之下,是否對於其分類的準確度有影響,而為了驗證此系統對比過去分類方法其分類效果亦為良好,將這些分類結果透過混淆矩陣來檢視分類結果的準確度,作為衡量兩演算法之分類能力的依據,而最終的結果顯示以LSH演算法進行中文文獻分類也能夠達到良好的效果。
Documents classification is an important technique that usually utilizes traditional text mining methods, which often involve several text pre-processing steps such as stop words removal and tokenization. However, text mining technique was developed to process English corpora, it is not directly applicable to process Chinese corpora. For example, terms or words are separated by leading and trailing spaces in English that makes the work of tokenizing English corpora straightforward. However, there are no spaces between words in Chinese corpora that makes tokenizing the Chinese corpora a daunting task. To overcome the difficulties of Chinese tokenization, we utilize an algorithm that doesn’t need tokenization called Locality Sensitive Hashing (LSH). It breaks a text file into strings (shinglings) of fixed length which is a language-neutral method. We classified research papers from different research fields using LSH and VSM algorithm in conjunction with KNN method respectively, and compared their classification results. First, we collected 6 different topics, each topic includes 60 Chinese articles. And then we randomly chose 10 to 50 articles as the model training data, and the remaining articles as the test data. We calculated the accuracy of both classification methods and found their accuracy on par, which provides empirical evidence that the LSH algorithm could be used to classify Chinese papers properly.