近來,越來越多的多媒體資料產生在我們生活中,然而,沒有適當的處理和搜尋技術。許多人對多媒體技術感興趣,例如影片註解和影片檢索。其中,影片註解是將預先定義的義概念,根據影片的內容加以註解,對於影片搜尋是重要的步驟。大多數的學者關注在如果解決低階特徵值和高階概念之間的語義隔閡。 大多數基於學習的影片語義分析方法希望得好的語義模型,然而需要大量的訓練樣本才能達到好的效能。然而,註解大量的資料是非常浪費時間和人力,有效的訓練樣本的取得不易。此類語意模型的好壞,應該取決於訓練資料的分佈,而不是資料庫的大小。一般訓練集的選擇是使用隨機選擇或是選部份的影片資料,忽略訓練集的相似和廣泛特徵,並不能達到有效的結果。 在本篇論文裡,我們提出了幾種方式來做訓練資料的選取和減少使用者介入,有基於分群選擇、空間分散性、時間分散性與基於散佈選擇。藉由這些方法,我們希望用影片資料的空間與時間分佈來建立少量又有效的訓練集。因此,如果選擇的訓練資料能代表整體影片特徵,當訓練集大小遠小於原始影片資料,分類的效能是好的。在實驗中,我們將影片的語意分成五類:人、鄉村風景、城市風景、地圖與其它。實驗結果證實,這些方法對訓練集的選擇是有效的,效能勝過隨機選擇。
Most learning-based video semantic analysis methods hope to obtain the good semantic model that require a large training set to achieve good performances. However, annotating a large video is labor-intensive and the training data set collection is not easy either. Generally, most of training set selection schemes adopted a random selection or select parts of the video data, they neglect similar and extensive characteristics of the training set. In this thesis, we propose several different methods to construct the training set and reduce user involvement. They are clustering-based, spatial dispersiveness, temporal dispersiveness, and sample-based. Using those selection schemes, we hope to construct a small size and effective training set by using spatial and temporal distribution, clustering information of the whole video data. Therefore if the selected training data can represent the characteristic of the whole video data, the classification performance will be better even when the size of the training set is smaller than that of the whole video data. We can choose the best samples for training a semantic model and use SVM to classify the category of each sample. This thesis intends to classify the shots of the semantic into the five categories: person, landscape, cityscape, map and others. Experimental results show that these methods are effective for training set selection in video annotation, and outperform random selection.