透過您的圖書館登入
IP:18.225.31.159
  • 學位論文

行動裝置大規模影像辨識

Large Scale Mobile Visual Recognition

指導教授 : 徐宏民
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


行動裝置大規模影像辨識是一種於行動裝置上對於圖片/影像內容 之場景、物件乃至情境等層面進行語意分析之技術。隨著行動裝置的普及以及各式媒體分享服務的流行,行動裝置大規模影像辨識逐漸受到重視。雖然大規模影像辨識在經過長年的發展之後,已經有許多有效的方法可以運用在現實生活中,這些方法大多需要大量的運算資源,因此只能運行於高階的伺服器之上。行動裝置由於其物理上的限制,只有相當有限的運算能力及儲存空間,無法使用現有的大規模影像辨識技術。 在本論文中,我們提出兩種實現行動裝置上大規模影像辨識的系統設計以因應不同的使用情境。在系統無法存取無線網路的情況下,我們提出一種新的線性降維方法– Kernel Preserving Projection (KPP)。有別於傳統的降維方法,KPP 在設計時即針對降維後特徵的可分辨率(Seperability) 進行優化,因此能夠在低維度下提供較好的辨識能力。KPP 同時考慮的行動裝置上的資源限制,採用稀疏線性投射進行降維,以減少運算時所需的儲存空間以及計算量。 當系統有網路連線的能力時,我們提出使用伺服器-客戶架構來提升系統可分辨的語意數量。此一系統架構最大的挑戰在於如何在有限的網路頻寬下確保系統的反應時間。對此,我們提出低頻寬視覺辨識(Low-Bandwidth Recognition) 的概念,並且對各種不同的傳輸策略進行實驗以優化視覺變視頻寬。我們的實驗結果指出,影像縮圖(Thumbnail) 能夠同時保存多種影像特徵,是一種有效率的傳輸方式。我們更進一步提出結合影像縮圖以及基於局部視覺特徵之特徵標籤(Feature Signature) 以提升低頻寬下之辨識能力。 我們同時對深度學習(Deep Learning) 在視覺辨識上應用之特性進行系統性的實驗及探討。深度學習是當前最被看好的視覺辨識演算法,然而深度學習在實際應用上還有許多困難未曾被解決,如參數的選擇以及對訓練資料的大量需求。我們提出使用轉移學習的方法來達成在稀疏資料中進行深度學習,使得深度學習能夠被運用在更廣泛的視覺變是問題上。我們的實驗同時為參數的選擇提供一些線索,以利於深度學習的實際運用。這些結果對於探討深度學習對行動裝置大規模影像辨識之影響提供了良好的基礎。

並列摘要


Scalable mobile visual classification – classifying images/videos in a large semantic space on mobile devices in real time – is an emerging problem as observing the paradigm shift towards mobile platforms and the explosive growth of visual data. Though seeing the advances in detecting thousands of concepts in the servers, the scalability is handicapped in mobile devices due to the severe resource constraints within. However, certain emerging applications require such scalable visual classification with prompt response for detecting local contexts (e.g., Google Glass) or ensuring user satisfaction. In this thesis, we point out the ignored challenges for scalable mobile visual classification and provide feasible solutions under different resource constraints. For systems operate without mobile network, we propose an unsupervised linear dimension reduction algorithm – kernel preserving projection (KPP), to reduce the size of classifiers and computational cost. We further introduce sparsity to the projection matrix to ensure its compliance with mobile computing (with merely 12% non-zero entries). Experimental results on three public datasets confirm that the proposed method outperforms existing dimension reduction methods. What is even more, we can greatly reduce the storage consumption and efficiently compute the classification results on the mobile devices. When the mobile network is available under limited bandwidth, we propose to adopt the client-server framework to ensure the scalability. The main challenge of the framework should be the recognition bitrate, which is theamount of data transmission under the same recognition performance. We exploit and compare various strategies such as compact features, feature compression, feature signatures by hashing, image scaling, etc., to enable low bitrate mobile visual recognition. We argue that thumbnail image is a competitive candidate for low bitrate visual recognition because it carries multiple features at once and multi-feature fusion is important as the size of semantic space increases. We further suggest a new strategy that combines single (local) feature signature and the thumbnail image, which achieves significant bitrate reduction from (average) 102,570 to 4,661 bytes with merely (overall) 10% performance degradation. We also investigate the properties of Deep Convolution Network, which appears to be a promising direction for large scale visual recognition. These studies serve as the basis of further investigation on how DCNs will affect visual recognition on mobile devices. Our preliminary studies reveal the correlation between meta-parameters and the performance of DCN given the properties of the target problem and data. These results lead to a heuristic for meta-parameter selection for future DCN research, which does not rely on the time consuming meta-parameter search. We also point out that the lack-of-training-sample problem limits the usage of DCN on a wide range of computer vision problems where obtaining training data are difficult. To solve the problem, we propose to adopt transfer learning to learn a better representation of natural images using large image corpora with sufficient labeled samples and diversity. We show that by means of transfer learning from image to video, we can learn a frame-based recognizer with only 4k videos, which is far less than the million scale image data sets required by previous works of DCNs.

參考文獻


[1] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with
binary coins. J. Comput. Sys. Sci., 66(4):671–687, June 2003.
[2] T. Ahonen et al. Face recognition with local binary patterns. In ECCV, 2004.
[3] H. Bay et al. Surf: Speeded up robust features. In ECCV, 2006.
[4] A. Berg, J. Deng, and F.-F. Li. Large scale visual recognition challenge 2010.

延伸閱讀