隨著智慧型手持裝置以及社群網站的流行,人們開始習慣累積大量個人相片用於分享,也因此需要有效管理大量個人相片的方法,包括考量主題的影像檢索。近年來考量主題的影像檢索領域已有長足進步,主要是利用網路上能搜集到的大量相片以及其文字標註,來建立影像主題分析的基礎以用於檢索。然而當檢索目標轉為個人相片資料庫時,由於個人相片有個人化主題,但欠缺大量相片及相關的文字標註,原本在考量主題的影像檢索領域常用的方法都無法被應用至個人相片上。因此在本論文中我們提出了一套針對個人相片的考量主題的檢索架構,讓使用者得以透過語音標註其個人相片,並以此作為檢索依據。我們並針對此架構設計了相關實驗,也建立了一套能實際應用於個人相片的檢索系統。 在前期研究中,我們利用一組帶有清晰朗讀的語音標註的個人相片作為實驗語料。使用機率式潛藏主題分析模型來整合影像及語音特徵給予的資訊後,再利用訓練得出之主題模型作為檢索依據。以此架構所建立之檢索系統,即便擁有語音標注的相片僅有10\%左右,仍能得到不錯的檢索結果。 以前期研究之結果為基礎,我們在進階研究當中試圖更真實模擬實際應用上會遇到之難題。首先收集了一組數量更為龐大之個人相片集,其語音標註也不再是清晰朗讀語音,而接近自發性真實說話之語速及流利度,並且會包含背景雜訊,導致語音辨識錯誤率遠高於前期研究。因此我們採用了語者調適、語言模型內插等來提高語音辨識效能,並且利用詞圖來提高語音特徵之強健性。我們也採用了視覺詞取代原本前期研究中的影像特徵,並導入Columbia374影像主題偵測器來提供額外影像資訊。除了原本的機率式潛藏主題分析方法外,我們亦使用了非負矩陣分解來進行主題分析作為比較對象,最終實驗結果顯示非負矩陣分解的效果遠勝過機率式潛藏主題分析。 最後我們也依據實驗結果實作了一個考量主題的個人相片檢索系統,並採用了檢索結果多樣化的概念來讓使用者得以迅速瀏覽各種類型的相片。結果顯示以本論文提出之架構為基礎,已有機會發展成功的考量主題的個人相片檢索系統。
With the prevalence of hand-held smart devices and social networks, people tend to collect tons of personal photos for sharing. Efficient approaches to manage personal photos are therefore highly desired. Semantic image retrieval has been very successful in recent years, in which huge quantity of photos and their annotations available over the Internet were used to derive semantic relationships between high-level semantic terms and the photos for retrieval. However, when personal photos are considered, the personal annotations for personal photos can be very sparse, completely impossible for development of the above semantic relationships. So those successful approaches of semantic image retrieval cannot be used for personal photos. In this dissertation, we adopt a new scenario and propose a new framework to tackle this problem: allowing users to annotate their photos using voice while taking pictures, and analyze the semantic relationships between the annotations and photos by fusing the speech and image features together. A series of research works are therefore developed in order to construct a practical solution for semantic image retrieval of personal photos. In the preliminary research, we collected some personal photos with clean read speech annotations describing roughly defined categories of information. By fusing low-level image features with speech features in probabilistic latent semantic analysis (PLSA), very good results were obtained with only 10\% of the photos manually annotated. In the second-stage work, we re-collected a larger database of personal photos with fluent and free form speech annotations as experimental dataset. The recognition errors became a much more challenging problem. We adopted cepstral normalization, acoustic model adaption, and language model interpolation to improve the recognition results. We also used expected term frequency derived from lattices as more robust speech features. We further used visual words as representative image features rather than the low-level image features used in preliminary research and tried to integrate Columbia374 derived from content-based image detectors as additional image information. Moreover, we replaced the PLSA model with non-negative matrix factorization (NMF) to analyze the latent "topics". The experimental results showed that NMF model outperformed the PLSA model in this task. Finally, we implemented a prototype system based on these results. In addition, we adopted the concept of diversifying retrieval results for better presentation. All these results show that the proposed framework is an effective solution to the problem of semantic image retrieval of personal photos.