透過您的圖書館登入
IP:18.117.142.128
  • 學位論文

在有限標記下之高維度數據特徵快速強韌選取方法之研究

Robust and Fast Feature Selection Methods for High-dimensional Data with Limited Labels

指導教授 : 陳榮靜
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


特徵選擇旨在篩選數據最具代表性之屬性,減少冗餘特徵干擾,進而提高後續基於該數據之分類、聚類等任務性能,是模式識別領域重要技術之一。近年来,隨著多媒體和計算機技術快速發展,高維數據呈現暴發式增長。由於收集該類數據之標籤信息需要耗費大量人力成本,使得有效已標記數據異常稀少,從而給現有特徵選擇方法帶來了巨大挑戰。因此,如何在有限標籤信息環境下,設計有效特徵選擇模型變得越發重要。 在數據標籤信息有限環境下,為不同數據規模應用之需求,本論文分別從無監督和半監督應用角度,結合流形學習、多任務學習和聚類分析等經典機器學習方法,提出三種通用特徵選擇模型,以實現特徵子集高效選擇。首先,針對數據規模不大之高維數據,提出基於l1範數圖學習的半监督特徵選擇模型,該模型藉助l1範數約束拉普拉斯矩陣,剔除相似矩陣中數據節點間冗餘連接,進而抑制外界噪聲干擾;其次,針對数据量大且有效標籤信息稀少之應用場景,提出一種多任務學習半監督特徵選擇模型。該模型藉助多個相似學習任務間具有模型相關性特點,利用低秩約束和特徵選擇實現任務間相關信息之遷移學習和最優子集篩選,進而有效提高多任務特徵選擇效果;最后,針對海量高維無標記數據,提出利用數據簇類結構約束最優特徵子空間的選擇,引入自適應學習機制以增強模型對數據的適應性,最終提出一種快速魯棒之特徵選擇模型。 實驗結果表明,在標籤信息有限場景下,相比其它經典演算法,本文所提出三種模型均能夠實現高效特徵選擇,具有一定通用性且較高準確性。

並列摘要


Feature selection is one of the most representative techniques in the area of pattern recognition, which aims at filtering data attributes, removing redundant features and improving the performance of the follow-up classification or clustering tasks. In recent years, with increasingly powerful multimedia and computer technologies, high-dimensional data have been rapidly generated. As it is extremely expensive to collect sufficient labels for such a large amount of data, a growing number of data with few labels are presented, which presents a great challenge to existing feature selection methods. Therefore, how to design reasonable and effective feature selection model becomes more and more important for data with limited label information. Under such a circumstance, to meet the requirements of different scale of data with limited labels, this dissertation designs several semi-supervised and unsupervised feature selection algorithms and combines them with various kinds of applications, such as multi-label learning, multi-task learning, and clustering. Firstly, to handle small-scale high-dimensional data, we propose a semi-supervised feature learning model, where the Laplacian matrix construction is constraint by the -norm and is robust to outliers by removing redundant connects among nodes. Secondly, a semi-supervised feature selection model based on multi-task learning is proposed for the large-scale data. Such a model is independent on the graph construction and is able to explore the shared information among tasks by a low-rank regularization. Transferring the relevant information among tasks, it can properly preserve the most important features. Finally, for the large-scale high-dimensional data without labels, we propose a flexible objective function to adaptively perform feature learning with clustering, which is suitable for data with different kinds of distributions. Experimental results show that the proposed three models can efficiently select the most representative features with high accuracy over other classic algorithms in the limited-label scenario. What is more, the proposed models are general and can be extended to other applications.

參考文獻


Amores, J., Sebe, N. and Radeva, P. (2007), "Context-Based Object-Class Recognition and Retrieval by Generalized Correlograms,"IEEE Trans. Pattern Anal. Mach. Intell., Vol. 29, No. 10, pp. 1818–1833.
Ando, R. K. and Zhang, T. (2005), "A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data,"J. Mach. Learn. Res., Vol. 6, pp. 1817–1853.
Argyriou, A., Evgeniou, T. and Pontil, M. (2008), "Convex Multi-task Feature Learning,"Machine Learning, Vol. 73, No. 3, pp. 243–272. https://doi.org/10.1007/s10994-007-5040-8
Cai, D., He, X. and Han, J. (2005), "Document clustering using locality preserving indexing,"IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 12, pp. 1624–1637. https://doi.org/10.1109/TKDE.2005.198
Cai, X., Nie, F. and Huang, H. (2013), "Multi-View K -Means Clustering on Big Data,"In The 23rd International Joint Conference on Artificial Intelligence (pp. 2598–2604).

延伸閱讀