利用高維索引技術解決大規模分類問題

近幾年來，線性分類器在大規模資料分類問題上有良好的發展與表現。然而，實際上仍存在著兩個重要的議題尚未被解決。第一個問題是現實生活中所收集到的資料有很大部分是沒有辦法被線性分類器所解釋，如果線性分類器將這些雜訊納入考慮的話，將會影響線性分類器的表現。第二個問題，對於一般使用者而言，記憶體的容量遠小於硬碟容量，所以很容易發生資料無法放進記憶體的情形，這時候線性分類器便需要不斷地在硬碟和記憶體之間不斷地讀取以及寫入，這是個非常花時間的過程。於是本篇論文提出一個索引架構套用在線性分類器的最佳化過程以同時解決這兩個議題。每筆資料都用高維度的特徵向量空間表示，我們將這些資料透過概似高維索引技術建立索引值，讓我們可以很有效率地取出具有用的資料，忽略對線性分類器有害的資料；也因為這樣，我們可以只將這些有用的資料讀進記憶體，進而同時解決兩個議題，我們做了數個實驗比較我們的架構和其他目前最先進的方法，而結果顯示我們的架構有較佳的表現。

關鍵字

高維索引； Ramp loss ；支持向量機；機器學習

並列摘要

Recently, linear classifier has been shown to be able to handle large-scale classification problem well. However, there are two main issues accompanied by large-scale classification problem. First, there may exist many unexplainable or noise instances in the datasets which will hurt the linear classifier’s performance. Second, when data is too large to load in memory, the linear classifier will spend much time on reading/writing between memory and disk. In this thesis, we propose an indexing optimization framework to solve these two issues simultaneously. We apply approximate indexing technique on high dimensional features space to help us efficiently retrieve the informative instances rather than outliers, and so that we can only load those instances into memory. We conduct several experiments to compare our framework with the state-of-the art methods, and the results show that we have a better performance.

並列關鍵字

high dimensional indexing ； ramp loss ； support vector machine ； machine learning

參考文獻

[1] C .-J. Hsieh, K.-W. Chang, C.-J. Lin , S. S. Keerthi, and S. Sundararajan , “A dual coordinate descent method for large-scale linear SVM,” in ICML , 2008.

[2] S. Shalev-Shwartz, Y. Singer , and N. Srebro, “Pegasos : primal estimated sub-gradient solver for SVM,” in ICML , 2007.

[5] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Trading Convexity for Scalability,” ICML, 2006

[6] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509-517, 1975.

[8] P. Jain, S. Vijayanarasimhan, and K. Grauman. Hashing Hyper- plane Queries to Near Points with Applications to Large-Scale Active Learning. In NIPS, 2010.

國際替代計量

利用高維索引技術解決大規模分類問題

全文下載

主題瀏覽