利用三維模型提升影像搜尋和分類效能

影像的分類和搜尋是分析大規模影像和視訊數據的關鍵技術。雖然有相當數量的研究工作對此議題做探討,然而在大視角下的影像搜尋、細微影像分類、穿戴式相機上的視訊摘要,仍然具有非常大的挑戰性。在此論文中,我們有效地解決在計算機視覺和多媒體分析領域中近年來四個重要議題。第一個議題是細微影像分類。不同於傳統的影像分類,主要是根據部件的存在與否來作為分類的依據; 細微影像分類試圖找出部件之間的細微差異來區分物體。針對此問題,我們提出一套方法能夠共同優化三維模型疊合和細微影像分類。精細的三維模型能夠提供較多的訊息相較於傳統的二維方法,因此可以提高細微分類的性能。同時,所預測的類別標籤還可以提高三維模型疊合的精確度,例如: 透過提供更準確的初始形狀模型。我們評估所提出的方法在一個新建立的細微影像分類車子數據庫,證明我們的方法優於其他最先進的方法。此外,我們還做了一系列的分析來探討細微影像分類和三維模型之間的關係。第二個議題是基於屬性的車輛搜尋系統。我們將車子圖片從不同的角度校正至相同的視角並取出高階語意的屬性,例如: 車頭、照明燈和輪胎的樣式,並透過這些屬性來搜尋資料庫中的車輛。在實驗中,我們比較了不同的三維模型疊合方法,並驗證校正後的屬性能夠提升搜尋效能。實驗結果顯示我們的方法顯著地優於先前以內容為基礎的影像搜尋方法。第三個議題是基於草圖的多視角影像搜尋系統。我們自動地重建使用者所描繪的二維草圖至一個近似三維草圖模型,然後生成多視角草圖作為擴展子查詢以提高檢索性能。為了學習合成的草圖權重,我們提出了一種新的多維查詢特徵表示法來描述查詢草圖和資料庫影像之間的相似度,並且將問題轉化成一個凸優化問題。實驗顯示我們的方法優目前最先進的方法在一個公開的影像資料庫。最後一個議題是穿戴式相機上的視訊摘要。我們提出了一個共同優化的方式,可以有效地產生重要的視訊摘要而不需要使用者手動給定影片的類別。我們的方法同時偵測重要的視訊畫面和預測類別標簽立即地生成摘要。當觀察足夠的視頻後,在早期就能準確地推斷出目標類別標籤,並且使用特定的類別模型來偵測重要的影片畫面節省運算量。在一個公開的數據集中,我們的方法和目前最先進的方法有差不多表現效能,然而他們的方法必須先手動給定影片的類別標籤。早期類別預測可以顯著地降低計算成本同時保持原有的性能,顯示我們方法的有效性。

關鍵字

細微影像分類；三維模型重建；視訊摘要；穿戴式攝影機

並列摘要

Image classification and retrieval are key techniques for managing the exponentially growing image and videos collections, e.g., consumer photos, surveillance videos, and egocentric videos. It is still very challenging to retrieve objects under large pose transformations, classify objects with subtile differences and extract a brief summary of unconstrained egocentric videos. In this dissertation, we aim to leverage 3D representation to improve image retrieval and classification performance, and generate compact and informative highlights for egocentric videos. We investigate four important and emerging topics in computer vision and multimedia community. The first one is fine-grained classification. Different from conventional basic-level classification, which relies on the presence or absence of parts; fine-grained classification (i.e., subordinate-level categorization) finds salient distinctions between part/landmark-level characteristics of objects. We develop an approach than jointly optimizes 3D model fitting and fine-grained classification. Detailed 3D object representations encode more information (e.g., precise part locations and viewpoint) than traditional 2D-based approaches and can therefore improve fine-grained classification performance. Mean- while, the predicted class label can also improve 3D model fitting accuracy, e.g., by providing more detailed class specific shape models. We evaluate our method on a new fine-grained 3D car dataset (FG3DCar), demonstrating our method outperforms several state-of-the-art approaches. Furthermore, we also conduct a series of analyses to explore the dependence between fine-grained classification performance and 3D models. The second one is attribute-based car retrieval under unconstrained environment. We rectify the car images from disparate views into the same reference view and search the cars based on informative attributes (i.e., parts) such as grille, lamp, and wheel with the fitted 3D models. In the experiments, we compare different 3-D model fitting approaches and verify the significant impact of part rectification on car retrieval performance. The experimental results on car retrieval demonstrate that our approach significantly outperforms previous content-based image retrieval (CBIR) methods. The third one is sketch-based multi-view image retrieval. We automatically convert two (guided) 2D sketches into an approximated 3D sketch model, and then generate multi-view sketches as expanded sub-queries to improve the retrieval performance. To learn the weights among synthesized views (sub-queries), we present a new multi-query feature to model the similarity between subqueries and dataset images, and formulate it into a convex optimization problem. Our approach shows superior performance compared with the state-of-the-art approach on a public multi-view image dataset. Moreover, we also conduct sensitivity tests to analyze the parameters of our approach based on the gathered user sketches. The last one is video summarization on egocentric cameras. We propose a joint approach that can efficiently generate compact and informative summaries while not requiring the class label to be given in advance. Our approach simultaneously detects video highlights and estimates the class labels, and generate summaries immediately without watching the whole video sequence. After observing enough video, we correctly infer the target class label early and only use the class-specific model to summarize video highlights to save the computational cost. Experimental results on a public egocentric dataset show that the our method is very competitive with the state-of-the-art methods that require class labels to be known during testing. Moreover, the early class prediction aspect of our method can significantly reduce the computational cost while retaining the original performance, demonstrating the efficiency and effectiveness of our method for video highlighting.

並列關鍵字

fine-grained classification ； attribute-based image retrieval ； sketch-based image retrieval ； 3D model fitting ； 3D deformable model reconstruction ； egocentric video summarization

參考文獻

[1] Jia Deng, Jonathan Krause, and Li Fei-Fei. Fine-grained crowdsourcing for fine- grained recognition. In CVPR, 2013.

Google Scholar

[2] Matthew J. Leotta and Joseph L. Mundy. Predicting high resolution image edges with a generic, adaptive, 3-d vehicle model. In IEEE Conference on Computer