Clustering-Based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF

PU learning occurs frequently in Web pages classification and text retrieval applications because users may be interested in information on the same topic. Collecting reliable negative examples is a key step in PU (Positive and Unlabeled) text classification, which solves a key problem in machine learning when no labeled negative examples are available in the training set or negative examples are difficult to collect. Thus, this paper presents a novel clustering-based method for collecting reliable negative examples (CCRNE). Different from traditional methods, we remove as many probable positive examples from unlabeled set as possible, which results that more reliable negative examples are found out. During the process of building classifier, a novel TFIDF-improved feature weighting approach, which reflects the importance of the term in the positive and negative training examples respectively, is presented to describe documents in the Vector Space Model. We also build a weighted voting classifier by iteratively applying the SVM algorithm and implement OCS (One-class SVM), PEBL (Positive Example Based Learning) and 1-DNFII (Constrained 1-DNF) methods used for comparison. Experimental results on three real-world datasets (Reuters Corpus Volume 1 (RCV1), Reuters-21578 and 20 Newsgroups) show that our proposed C-CRNE extracts more reliable negative examples than the baseline algorithms with very low error rates. And our classifier outperforms other state-of-art classification methods from the perspective of traditional performance metrics.

並列關鍵字

text classification ； reliable negative examples ； clustering ； C-CRNE ； WVC

被引用紀錄

余松樺（2015）。基於巨觀與微觀類別模型的一個機率式影像分類方法〔碩士論文，義守大學〕。華藝線上圖書館。https://doi.org/10.6343/ISU.2015.00151

謝耀陞（2014）。以視覺字典為基礎的影像分類〔碩士論文，義守大學〕。華藝線上圖書館。https://doi.org/10.6343/ISU.2014.00154

李庭逸（2014）。以純氧化矽沸石奈米顆粒製備低介電膜及抗腐蝕膜之研究〔博士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2014.02926

國際替代計量

Clustering-Based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF

全文下載

主題瀏覽