應用區域敏感雜湊進行中文文獻分類之研究

對於文集事先分類能夠幫助研究者有效的了解文集所包含之議題，但文集分類的方法必須借助文字探勘技術，而現有之文字探勘技術是以英文文本作為探勘標的所開發出的方法，對於中文文獻並無法完全的適用。因為在中文的斷字切詞上(Tokenize)就與英文的處裡有很大的不同，中文的斷詞需要有詞庫來做比對，而詞庫是需要維護的，若是有能夠不需要詞庫比對就能夠有效分類的演算法，對於中文文集分群或分類的處理流程會更方便。而區域敏感雜湊的特徵擷取方式與一般的斷字切詞有很大的不同，它的運作方式是採用固定的長度有規律性的擷取，這樣便不需要與詞庫做比對，就能夠直接的處理中文文集之分類。　　本研究利用LSH以及過去所使用的VSM演算法撰寫能夠對中文文集分類之系統，並於線上學術資料庫蒐集6個議題之文獻集作為系統輸入，將蒐集之文獻集分為不同訓練數量的資料，透過系統的特徵擷取處理以及運算處理後，產生各訓練數量之分類結果，以檢視在訓練集數量成長的情況之下，是否對於其分類的準確度有影響，而為了驗證此系統對比過去分類方法其分類效果亦為良好，將這些分類結果透過混淆矩陣來檢視分類結果的準確度，作為衡量兩演算法之分類能力的依據，而最終的結果顯示以LSH演算法進行中文文獻分類也能夠達到良好的效果。

關鍵字

中文文獻分類；中文斷詞；區域敏感雜湊

並列摘要

Documents classification is an important technique that usually utilizes traditional text mining methods, which often involve several text pre-processing steps such as stop words removal and tokenization. However, text mining technique was developed to process English corpora, it is not directly applicable to process Chinese corpora. For example, terms or words are separated by leading and trailing spaces in English that makes the work of tokenizing English corpora straightforward. However, there are no spaces between words in Chinese corpora that makes tokenizing the Chinese corpora a daunting task. To overcome the difficulties of Chinese tokenization, we utilize an algorithm that doesn’t need tokenization called Locality Sensitive Hashing (LSH). It breaks a text file into strings (shinglings) of fixed length which is a language-neutral method. We classified research papers from different research fields using LSH and VSM algorithm in conjunction with KNN method respectively, and compared their classification results. First, we collected 6 different topics, each topic includes 60 Chinese articles. And then we randomly chose 10 to 50 articles as the model training data, and the remaining articles as the test data. We calculated the accuracy of both classification methods and found their accuracy on par, which provides empirical evidence that the LSH algorithm could be used to classify Chinese papers properly.

並列關鍵字

Text Classification ； Chinese Tokenization ； Locality Sensitive Hashing ； Chinese Document Corpora

參考文獻

Andoni, A., & Indyk, P. (2006). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Paper presented at the Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on.

Boyack, K. W., & Klavans, R. (2010). Co‐citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389-2404.

Chowdhury, G. (2010). Introduction to modern information retrieval: Facet publishing.

Chum, O., Philbin, J., & Zisserman, A. (2008). Near Duplicate Image Detection: min-Hash and tf-idf Weighting. Paper presented at the BMVC.

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.

被引用紀錄

李治平（2016）。應用區域敏感雜湊對文獻進行分類之研究〔碩士論文，國立臺北大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0023-1303201714253371

國際替代計量

應用區域敏感雜湊進行中文文獻分類之研究

主題瀏覽