透過您的圖書館登入
IP:18.119.28.237
  • 學位論文

多重資訊融合於多媒體之應用

Multiple Information Fusion for Multimedia Applications

指導教授 : 李德財
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


多媒體內容分析研究中,如何有效地融合多重資訊及結合不同領域資料是本博士論文主要探討的課題。多媒體內容分析一直是多媒體資料理解技術中最主要的關鍵,而此技術也常於多媒體的應用當中所探討,例如:影像重排、視訊事件偵測以及影像分類等。儘管在多媒體資料的特徵描述發展上有顯著的進步,但普遍性的推論是,單一特徵的描述方式無法將複雜多媒體資料的特性完整表現以進行應用。在本博士論文中,我們針對多媒體內容分析問題,發展出互補性的資訊融合方法和提出結合不同領域資料之演算法進行研究與探討。 多重資訊融合方式大致可分為「建模前融合」與「建模後融合」兩大類。建模前融合是先將不同的特徵資訊先行整合後再建立模型,建模後融合則是使用每個特徵資訊所學習後的各個模型的輸出為之結合,而資訊融合的方式決定,大多是應用導向。在本博士論文中,我們針對多重資訊融合設計多種不同的特徵融合方法。具體而言,本博士論文主要有三個多重資訊互補合作的部分,並有下列的主要貢獻: 第一,我們透過多核心機器學習演算法,有效的結合不同異質性視覺特徵,以改善影像重排的問題。第二,我們設計多種的雙模組BoW(Bag-of-Word)表示法並搭配不同的聚集方法(pooling methods),以處理視訊事件偵測的問題。經由廣泛的實驗,我們證明了共同的音訊與視訊特徵結構可被有效地利用,並且在視訊事件偵測的問題中改善了偵測的效能。第三,我們提出矩陣秩數最優化演算法,它具有以下兩種顯著的特色:(1)對於建模後融合,本演算法可以解決不同模型所造成的不同分數比例問題;(2)矩陣最優化演算法並提供另一種方式,以處理不同領域資料的問題。詳細的介紹如下: 在第一部份,我們根據影像內容分析,使用多核心機器訓練,以解決影像重排問題;由於不足夠的資訊常限制影像辨識的效果,因此使用多種視覺資訊,將可增強影像搜尋的效能。在影像重排問題中,我們使用機器學習的方法來增強影像相似性的估測,此機器學習方法可以將不同的特徵轉換到多個核心空間。並且實現多重特徵的融合。另外,我們在探討視訊事件偵測問題上,使用不同的聚集方法並提出共同的音訊與視訊表示法。在事件分類器的學習,多核心演算法被應用於結合不同大小的編碼簿(codebooks)。此部分使用三種公開的數據庫,進行實驗測試,實驗結果顯示,多重模組表示法產生有意義性的效能改善,例如,在CCV數據庫,使用主平均準確率為衡量標準,我們的算法可達到最好的性能,且有將近7.36\%更勝於其他融合的方法。 在第二部分,我們提出矩陣秩數最優化演算法,來融合各個特徵模型所得到的預測分數。具體而言,我們對每一個特徵模型所產生的預測分數向量中,依序取任兩個測試資料的分數並將之轉換成對應的分數融合關係矩陣。因此,分數的融合問題可轉換成尋求一個多模型共享的關係矩陣,而每個模型所產生的矩陣,是由單一共享矩陣與每個模型的稀疏誤差所形成。在此部分,所提出的方法,不僅可以解決不同特徵的模型所產生的不同分數比例問題,同時也去除了預測誤差使得更有效的恢復預測分數。在物件分類與視訊事件偵測問題中,我們提出的方法在實驗性的驗證上,達到最好的效能。例如,在TRECVID MED 2011數據庫上,我們與多核心平均法(kernel average)和平均建模後融合方法比較分別獲得5.2\%與4.7\%的效能提升。 本博士論文的第三部分也是最後一部分,我們考慮結合不同領域資料以增加原本領域的數據集。不同於在原始數據集(目標領域)直接整合多重特徵,我們提出低秩矩陣重構技術,以擷取不同領域(來源領域)且與目標領域相關的資料,來幫助多媒體內容分析。關鍵的構想乃是藉由使用目標領域的資料,線性的重構出來源領域資料,配合nuclear norm的條件限制,再做最優化的計算,使得可轉換的來源資料被低秩結構重構並同時去除雜訊資料。廣泛性的實驗中,比較最先進的方法,我們提出的結合不同領域資料技術在Caltech256數據庫中,獲得了3.2\%的最佳效能。

並列摘要


In this dissertation, we address the problem of multimedia content analysis by leveraging information captured by various descriptors and incorporating different domain data. Content analysis is a key component for multimedia understanding, and it is hence an inherent part in a wide range of multimedia applications, such as image re-ranking, video event detection and image classification. Although the design of descriptors for multimedia data has made significant progress, a general conclusion is still that no single descriptor can well characterize multimedia data in nowadays complex applications. We address this problem by exploiting the complementary information captured by various descriptors and incorporating different domain data, and develop approaches to multimedia content analysis. Approaches to information fusion can be roughly divided into two categories, emph{early fusion} and emph{late fusion}. While early fusion approaches integrate feature fusion into the process of the model construction, late fusion approaches combine features after learning the model for each descriptor. The goodness of these two strategies for information fusion is typically application-dependent. We develop a set of feature fusion approaches that utilize both fusion strategies. Specifically, this dissertation is composed of three mutually collaborative information fusion parts, and can distinguish itself with the following three main contributions. First, we solve the image re-ranking by leveraging multiple kernel learning, which effectively combines heterogeneous visual features, and facilitates image search. Second, we design multiple bi-modal BoW representations as well as different pooling strategies for video event detection. It turns out that the underlying structure of the joint audio-visual feature space of complex videos can be effectively exploited. Our approach to multimodal analysis leads to superior performance in video event detection. Third, we present a rank minimization algorithm with the following two salient features: (1) it is isotonic to the numeric scales of the scores, resulting from different models used, for late fusion; (2) it provides a new way to handle different domain data by domain adaptation. The details of the three parts are given as follows. In the first part, we propose to use multiple kernel learning for image re-ranking based on image content analysis. As image recognition is often limited by insufficient visual information, adopting multiple visual features may enhance recognition tasks and increase effectiveness of image search. Therefore, we present a learning algorithm to boost the image similarity measure for image re-ranking. Specifically, we propose a method to represent various features in a unified domain, i.e., kernels, and carry out feature fusion by incrementally combining the kernels. For video event detection, we make use of both visual and audio information by creating a joint audio-video representation, in which different pooling strategies re-quantize the visual and audio words into the bi-modal words. We use multiple kernel learning to combine multimodal representations with various sizes of codebooks during event classifier learning. Experiments on three benchmark datasets consistently show that the proposed multiple bi-modal representations yield significant performance improvement. For example, on CCV dataset, our proposed algorithm achieves the best performance in terms of mean average precision (MAP) and outperforms the most widely used multimodal fusion methods by $7.36\%$. In the second part, we propose a robust rank minimization algorithm to fuse the predicted confidence scores of multiple models, each of which is obtained based on a certain kind of feature. Specifically, we convert each confidence score vector obtained from one model into a pairwise relationship matrix, in which each entry characterizes the comparative relationship of scores of two test samples. We formulate the score fusion problem as one that seeks a shared pairwise relationship matrix, based on which each original matrix from individual model can be reconstructed by the combination of the shared matrix and sparse residues. Our method not only achieves isotonicity, i.e., scale invariance, among the numeric scores of different models, but also recovers a robust prediction score for the individual test sample by removing the prediction error. Experimental results show that the proposed method achieves strong performance gains on various tasks including object categorization and video event detection. For example, on TRECVID MED 2011, our method outperforms the two baseline methods, i.e., emph{kernel average} and emph{average late fusion}, by $5.2\%$ and $4.7\%$, respectively. In the third and last part of this dissertation, we consider domain adaptation and use it as a way of enriching the given information. Different from directly integrating multiple features in the original data domain, i.e., target domain, we propose a robust low rank reconstruction technique to capture the relatedness of different domain data, i.e., source domains. The key idea is that we transform the visual data in the source domain into an intermediate representation such that each transformed source data can be linearly reconstructed by the data of the target domain. Through formulating the problem as a constrained nuclear norm optimization, the valuable data can be successfully reconstructed by low rank structure while filtering out the noises and outliers in the source domain. One of our experiments on Caltech 256 dataset shows that the proposed method outperforms domain adaptation baseline methods by $3.2\%$ in terms of MAP, which represents a significant improvement over the state-of-the-art methods.

參考文獻


[3] M. Beal, N. Jojic, and H. Attias. A graphical model for audiovisual object tracking.
[5] A. Bosch, A. Zisserman and X. Munoz Image Classification using random forests
visual recognition. In ICML, 2010. 23, 36
[8] J.-F. Cai, E. J. Candes and Z. Shen A singular value thresholding algorithm for
[9] J.-F. Cai, E. Candes, and Z. Shen. A singular value thresholding algorithm for

延伸閱讀