基於虛擬標記資訊之半監督式特徵值擷取演算法

由於科技的發達與網路的普及，帶領我們進入了一個資訊爆炸的時代，要如何妥善的利用這些資源便成為一門相當重要的課題。在使用這大量的資料之前，我們必須先把這些資料分門別類地整理好，但由於資訊量過於龐大，若使用人力來對這些資料做分類，需要花費一段相當可觀的時間。為了解決此問題，我們必須倚賴電腦與分類演算法的輔助，對這些資料進行自動化的分類。在使用分類演算法之前，我們必須對資料進行特徵值擷取，使得分類演算法能夠依照擷取出來的特徵值判斷資料所屬的類別，所以特徵值挑選的好壞對於分類演算法的分類結果影響相當大，為了找出這些良好的特徵值，我們需要一個適合的特徵值擷取演算法來幫助我們達到此目的。在本論文中，我們提出了一個半監督式特徵值擷取演算法，稱為「虛擬標記半監督式特徵值擷取演算法」。我們將虛擬標記資訊融入特徵值擷取演算法中，藉由虛擬標記資訊的引入，使得我們可以取得未標記資料中隱含的分類資訊，並利用本論文提出的虛擬標記互資訊公式，將這些分類資訊與標記資料中的標記資訊做最適當的整合，本演算法藉由整合後的資訊來找出最佳的特徵值集合，最後利用此特徵值集合來執行分類演算法，如此就可以使分類演算法達到最好的效果。在本論文的實驗中，我們將虛擬標記半監督式特徵值擷取演算法實際應用在幾個不同類型的資料集上，並與其他的特徵值擷取演算法比較。實驗結果顯示本演算法的效能相當優異，其效能也優於其他的比較演算法。

關鍵字

機器學習；特徵值擷取

並列摘要

Feature selection is an important task in machine learning. Practically, the quality of features affect the result of machine learning algorithms. In supervised feature selection, sufficient labeled data is necessary. However, labeling, a time-consuming process, is typically done manually. Conversely, unlabeled data is relatively easy to collect. Although unsupervised feature selection does not require labeled data, additional prior information should be considered when labeled data is available. Therefore, this paper proposes a semi-supervised feature selection algorithm to consider both labeled and unlabeled data. This proposed semi-supervised feature selection algorithm is called Soft-label semi-supervised feature selection algorithm. This algorithm applies Semi-supervised logistic regression algorithm to obtain soft-label information of unlabeled data, and applies proposed soft-label mutual information formula to combine label information and soft-label information to find the best feature subset. In the experimental section, we conduct experiments on several datasets, and experimental results indicate that the proposed algorithm can effectively improve classification performance.

並列關鍵字

machine learning ； feature selection

參考文獻

［1］ T. Joachims, “Text categorization with suport vector machines: Learning with many relevant features.”, 10th European Conference on Machine Learning, pp. 137-142, London, UK, 1998.

［4］ K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled documents using em.”, Machine Learning, vol. 39, pp.103-134, 2000.

［7］ I. S. Dhillon, “Co-clustering documents and words using bipartite spectral graph partitioning.”, 7th ACM SIGKDD international conference on Knowledge discovery and data mining KDD ’01, pp. 269-274, New York, USA, 2001.

［11］ J. Zhao, K. Lu, and X. He, “Locality sensitive semi-supervised feature selection”, Elsevier Science Publishers, Neurocomputing Volume 71,pp. 1842-1849, 2008.

［12］ Z. Zhao and H. liu, “Semi-supervised Feature Selection via Spectral Analysis.”, SIAM International Conference on Data Mining, pp.641-646, 2007.

國際替代計量

基於虛擬標記資訊之半監督式特徵值擷取演算法

全文下載

主題瀏覽