應用區域敏感雜湊對文獻進行分類之研究

將文集中內容相似的文章群集在一塊是議題與內容分析常用的方法，不過世界上的語言有千百種，在文獻分類上並非所有語言都可以使用過去英文分類所常使用的斷字切詞(Tokenization)、去除停用詞(Delete stop word)、字詞原型化(Word Stemming)、字詞權重計算的方式進行分類，因此本研究提出透過區域敏感雜湊(Locality Sensitive Hashing, LSH)進行分類，LSH過去被用於網路內文的相似性比對，其不受限於任何語言，因此本研究選用LSH作為文獻分類之分類演算法。LSH主要透過擷取Shingling的方式作為文獻的特徵，而Shingling擷取之長度k由文獻佐證與實測顯示，對於研究論文或期刊這類較龐大的文獻，使用k=9的擷取長度準確率會較為準確，擷取Shingling後利用Minhash將Shingling集合縮減成較小之簽名(signature)矩陣，再利用LSH將特徵透過雜湊的方式進行分類。本研究為證明LSH可對文獻進行分類且確保其準確率，因此將過去常用於英文文獻分類的空間向量模型(Vector Space Model, VSM)做為比較基準，並結合K-NN與K-means將兩分類流程建立在同一個分類的基準上進行比較。結合K-NN的流程中使用不同的文獻篇數產出各篇數不同的訓練模型，再對待分類文獻進行分類，最後透過混淆矩陣對兩分類流程所產出之結果，檢視模型之性能。結合K-means的流程中透過集群與內文一致性檢定的方式，評估分群結果。本研究從IEEE網站上搜尋6主題之文獻(Cloud Computing、Information Analysis、Enterprise Resource Planning、Image Processing、Music及Criminal Justice)，並透過LSH與VSM兩流程對此6個主題的文獻進行訓練分類，而結果顯示當每個類別提供50篇以上的文獻進行模型訓練時，其待分類文獻的準確率皆接近90%，證明LSH可應用於文獻分類之研究。

關鍵字

空間向量模型；區域敏感雜湊；文獻分類

並列摘要

Most of the contemporary text clustering or classification methods are based on texting mining technique that characterize documents using features derived from the frequency of term’s occurrence and the distribution of these terms in the document corpus. In these methods, a document is represented by a vector of terms’ frequency and their inverse document frequency (TF-IDF). The similarity between documents is measured by the cosine of their vector representation. However, the TF/IDF based methods were devised to process English language. Other languages may have a very different form that may render the term-based method inapplicable. We propose a language neutral classification/clustering method that utilizes the Locality Sensitive Hashing (LSH) that does not rely on lexical terms in a document. We compare the accuracy between the classical Vector Space Model (VSM) and LSH method and found their performance on par. Our proposed method is indeed a language neutral text classification/clustering scheme that is suitable for very large document corpus.

並列關鍵字

Document Classification ； Locality Sensitive Hashing ； Vector Space Model

參考文獻

Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185.

Aslam, J. A., & Pavlu, V. (2007). Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions. Paper presented at the European Conference on Information Retrieval.

Baoli, L., Qin, L., & Shiwen, Y. (2004). An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing (TALIP), 3(4), 215-226.

Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389-2404.

Broder, A. Z. (1997, June). On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings (pp. 21-29). IEEE.

國際替代計量

應用區域敏感雜湊對文獻進行分類之研究

主題瀏覽