利用關聯規則與潛在語意分析以 運用相關回饋資訊於文件分類的方法

在資訊檢索中，向量空間模型 (Vector Space Model)為常見表示方法，過去在向量空間模型上的相關回饋研究，是以使用者對於系統所回傳的相關文件清單，萃取字詞作為回饋的特徵值，然而此方法僅考慮字詞出現的頻率，而透過潛在語意分析 (Latent Semantic Analysis, LSA)，能找出字詞與文件間隱含的關係。本研究發展出一套特徵擷取的方法，分為兩大部分。第一部分為關聯規則特徵器，針對使用者回饋前20篇相關與非相關文件各別實施關聯規則，將文件視為一連串的交易，交易內的項目即為字詞，接著將高於最小支持度 (Minimum Support)及最小信賴度 (Minimum Confidence)門檻值的字詞取出來，將這些關聯性強的字詞作為文件特徵。第二部分為特徵結合器，除了關聯規則特徵器萃取出強關聯的字詞，再加上萃取僅出現在相關或非相關文件且出現次數不高的字詞，能代表特定類別的關鍵字。文件套用字詞特徵後，以TF-IDF計算字詞權重，接著將字詞-文件矩陣實施奇異值分解 (Singular Value Decomposition, SVD)，選擇適當維度降維後，重建字詞-文件矩陣，發掘字詞與文件間潛在的語意關係。實驗結果發現，經本研究特徵擷取方法，能有效改善未經特徵篩選且以TF-IDF作為文件特徵的分類效能，其中，以特徵結合器加上潛在語意分析的文件分類效果最佳。本研究證明實作關聯規則與潛在語意分析，運用在相關回饋資訊上，除了降低儲存空間外，更能有效改善文件分類準確度。

關鍵字

相關回饋；潛在語意分析；關聯規則

並列摘要

In the field of information retrieval, vector space model (VSM) is a common representation method. In the method, the main technique in the application of relevance feedback was based on the aggregation of term frequencies in feedback documents. To uncover and apply the hidden relationships between terms and documents, this study has developed a feature selection method. It includes two parts. The first part is related to association rules feature. It aims to deal with the top 20 relevant and non-relevant documents from user feedback and extract association rules. Let documents be a set of transactions and terms be a subset of the items. Extract the terms that are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Then, set these association terms as documents features. The second part is related to feature-combination. In addition to association rules terms, feature-combination extracts those occurs in relevance and non-relevance documents only and appears infrequently. These keywords can represent specific class. After the application of features on documents, terms will be weighted by TF-IDF. Let term-document matrix implement singular value decomposition (SVD), then choose the appropriate dimension to reduce and re-build term-document matrix. Re-build matrix can explore potential semantic relationships between terms and documents. Experiment results show that our feature selection methods effectively improve classification performance compared with feature selection by TF-IDF as document characteristics. The best document classification result is feature-combined plus LSA method. This study demonstrates that utilizing association rules and LSA in the application of relevance feedback information in document classification could not only reduce storage space but also improve classification accuracy.

並列關鍵字

無資料

參考文獻

1. Salton, G., A. Wong, and C.-S. Yang, A vector space model for automatic indexing. Communications of the ACM, 1975. 18(11): p. 613-620.

2. Ruthven, I. and M. Lalmas, A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 2003. 18(2): p. 95-145.

3. Salton, G. and C. Buckley, Term-weighting approaches in automatic text retrieval. Information processing & management, 1988. 24(5): p. 513-523.

4. Salton, G. and M.E. Lesk, Computer evaluation of indexing and text processing. Journal of the ACM (JACM), 1968. 15(1): p. 8-36.

6. Rocchio, J.J., Relevance feedback in information retrieval. The Smart retrieval system - experiments in automatic document processing, 1971: p. 313-323.

國際替代計量

利用關聯規則與潛在語意分析以運用相關回饋資訊於文件分類的方法

未授權

主題瀏覽