透過您的圖書館登入
IP:3.139.64.39
  • 學位論文

針對大規模社群多媒體之影像圖建構與語意標記

Image Graph Construction and Semantic Annotation for Large-Scale Social Multimedia

指導教授 : 徐宏民

摘要


近年來,配備相機的行動裝置盛行於消費者市場,加上社群網路上多媒體共享的風潮,使得網路上多媒體資料的規模呈現爆炸性增長。這些原始的多媒體資料的儲存通常缺乏組織,使得在後續搜尋與利用上面臨很大挑戰。在這些大規模多媒資料上,我們可以透過隱藏其中的關係與語意,協助我們建立實用的多媒體應用軟體。在本論文中,我們將著重於處理大規模多媒體的兩個問題:數據量與語意。 在第一部分,我們探討如何處理龐大的多媒體資料量。為了改善大規模影像資料的瀏覽與搜尋體驗,我們探討有效建構影像圖的方法,藉此表示影像之間的視覺與語意關聯。我們會應用建立好的影像圖,架構有效且具延展性的以群組為基礎的影像搜尋系統。二進位碼是用來儲存與搜尋影像資料的一種相當簡潔的表示方式。然而,如何有效索引與搜尋編碼為較長二進位碼的超大規模影像資料,仍舊是一大挑戰。我們提出針對超大規模影像二進位碼的新搜尋架構,應用了 GPU 設備達到較過去研究更好的效能與儲存效率。 其次,針對有關多媒體語意的問題,我們將提出數種從社群網路共享的多媒體資料中,擷取出語意的方法。 影像之間存在著視覺和語意關聯。利用這些關聯有助我們更佳探索與應用影像資料。然而,現有的影像搜尋系統一般是使用分頁的影像列表,來顯示搜尋結果。這種列表在使用者擁有明確搜尋目標時,沒有太大問題。但是,當查詢具有較高歧義性時,使用者通常難以在冗長的影像列表中找到目標。此外,這種分頁影像列表對於行動裝置也是一大問題。那是因為行動裝置通常僅配備有限尺寸的顯示螢幕。因此,我們提出建立以群組為基礎的影像搜尋系統,將影像搜尋結果以語意和視覺群組方式加以呈現。我們利用影像間的視覺和語意關聯,在離線階段建構影像圖。此一設計使得系統有效回應線上用戶查詢。為了延展至大規模影像,我們提出使用現代的平行化技術MapReduce來解決系統延展性問題。與在單一機器上建構影像圖相較,我們的建構方法快了69倍。 為了解決處理超大規模影像資料面臨的問題,二進位碼進來被視為編碼與搜尋影像的有效技術。二進位碼簡潔的格式在處理巨量影像資料時,提供了更的儲存效率。此外,與其他影像表示方式相較之下,二進位碼的相似度計算要快上許多。舉例來說,對於數百萬筆二進位碼的相似度計算,可以透過簡單的線性搜尋方式在不到一秒內完成。這些優點使得二進位碼成為超大規模影像資料應用的重要元件。 然而,當要求對十億張以上影像進行超過三十二位元的二進位編碼時,如何有效儲存與搜尋這些二進位碼依舊是一大挑戰。我們對於超大規模二進位碼的儲存與搜尋提出新的架構。相較於過去提出的多重雜湊索引方式,我們的隨機抽樣索引方法在儲存上更有效率也更加簡單,並支援包括確切與近似最接近影像的搜尋。 藉由運用 GPU 的平行運算能力,我們達到較過去研究更為快速的搜尋效率。為了進一步改進索引的儲存效率,我們提出二進位碼的壓縮方法。利用基於 GPU 而發展的解壓縮方法,壓縮後的索引並不會大幅犧牲搜尋效能。 缺乏適當標注的大規模影像資料阻礙了影像瀏覽與搜尋應用的開發。這個問題激發了有效自動影像標注方法的發展。給定一張沒有文字資訊的影像,自動影像標注方法能選取最佳的文字標注。該領域先前的研究多半專注在監督式學習方法。這些方法由於效能不彰、語彙不足問題、取得訓練資料與學習時間成本太高,而難以付諸實用。因此,我們認為藉由用戶貢獻的影像網站,例如Flickr,進行以搜尋為主的自動影像標注,會是此一問題的替代解決方案,其想法式從視覺上相似的用戶影像與其標簽上,為有待標注的影像,取得最適合的標注。然而,這些用戶提供的標簽通常充滿雜訊且數量不足。為此,我們提出標簽擴展方法,並使用標簽和影像間的視覺與語意一致性,來解決此一問題。我們提出的方法表現顯著地優於先前方法,且提供更為多樣化的標注。 微博(Microblogging)是網際網路上新形態的溝通方式,進來吸引許多研究者的目光。仰賴微博即時與對話特性,其用戶在各自的社群網路上更新狀態並分享經驗。這種特性使得微博成為用戶分享與討論地震與運動賽事等現實事件的一個重要工具。我們提出一種嶄新且彈性的方案,能透過分析微博服務上的訊息,偵測出運動賽事的即時事件。我們以Twitter作為實驗平台,收集了大規模關於18場知名賽事的Twitter訊息。我們同時收集了這些賽事的影片資料。我們提出的方法在Twitter訊息數量上,利用移動門檻值的突波偵測,找出賽事中的重點片段。而以tf-idf為基礎的方法,則應用在訊息內容上以取得語意。根據實驗結果,我們發現這個方法能在運動事件偵測與辨識上,達到可觀的效能。此外,我們的方法還能找出用之前做法難以發現的花絮事件。 並非所有影像都能吸引人們的目光。人們通常會被有趣的影像吸引,而忽略無聊的影像。影像趣味性(Image interestingness)的重要性並不少於其他已經受到研究者重視的主觀影像特質,但卻未曾被詳細探討過。在此提案書中,我們將針對影像趣味性的視覺與社群面向進行探討。我們利用眾包(crowdsourcing)工具來探討人對這些主觀特質的知覺,並分析其一致性與可靠性。結果顯示人們在決定一張影像有趣或不有趣時,有著一定的一致性。我們分析了影像趣味性社群與視覺層面,以及影像美觀性之間的關聯。我們發現,社群趣味性和視覺趣味性與影像美觀性之間,只存在著低關聯性,顯示人們經常散佈分享的影像並不一定是美觀或具視覺趣味的。其次,影像美觀與視覺趣味性之間,有著高關聯性,顯示美觀的影像通常會引發人們的視覺關注。接著我們希望了解何種影像特徵能引發社群趣味性,也就是在社群網路上收到更多讚賞和喜愛。我們訓練了分類器來預測視覺和社群趣味性,並找出各種影像特徵的貢獻。我們發現顏色最能預測社群趣味性,而質地(texture)則最能預測視覺趣味性。進一步,我們探討了社群與視覺影像趣味性和影像顏色之間的關係。我們發現,具有興奮效果的顏色,最常出現在具有高社群趣味性的影像之中。這一點可以用過去關於顏色與其行動效果的研究加以解釋,能對社群網路上的行銷廣告提供有用且重要的建議。

並列摘要


In recent years, mobile devices equipped with cameras prevail on consumer markets. These devices plus the emerged trend of multimedia sharing on social networks, makes the scale of multimedia data grow explosively. These raw multimedia data are usually stored without well organized. That causes significant challenge to further retrieving and using these content. With regard to the large-scale multimedia content, we can explore and leverage hidden relations and semantic meanings to help us create useful multimedia applications. In this dissertation, we focus on two problems faced in dealing with large-scale multimedia: data volume and semantics. First, for the data volume problem, in order to improve navigation and search experience over large-scale image data, we investigate the efficient method to construct image graphs that represent visual and semantic relations between images. We leverage constructed graphs to build efficient and scalable group-based image search system. Binary codes are very compact representation for storing and searching image data. However, how to efficient index and search very large-scale images encoded as longer binary codes is still a challenging problem. We propose a new search framework for very large-scale binary image codes that leverages GPU devices to achieve better performance and storage efficiency than previous works. For the second problem with regard to multimedia semantics, we propose several methods to extract semantics from multimedia content shared in social networks. There exist bother visual and semantic relations between images. These relations can be explored to help us better navigate and use image collections. However, current image search systems generally use multi-pages image list to display their search results. The list causes no significant harm when the user's search target is obvious. However, in the case with the query of higher ambiguity, it is usually difficult for users to find their search targets in such long image list. The kind of paged image lists causes browsing problem for mobile devices too. That is because mobile devices are usually only equipped display screen with limited size. Thus, we propose to build a group-based image search system that summarizes image search results in semantic and visual groups. We leverage visual and semantic relations of images to construct image graphs at offline stage. This design makes the system be efficient at responding user online query. In order to scale up for large-scale images, we propose to use modern parallel technology MapReduce to solve scalability issue in this system. Compared with constructing graphs on single machine, our graph construction method is 69 times faster. In order to solve the data volume problem faced by processing very large-scale image data, binary codes are recently recognized as enabling and promising technique for encoding and searching images. The compact representation of binary code provides better storage efficiency when dealing with huge image data. Besides, compared with other image representations, the pairwise similarity computation of binary codes is much faster. For example, the similarity comparison between a query and millions of binary codes can be done in less than one second with very simple baseline method of linear scanning. These advantages make binary codes as an important component for applications on very large-scale image data. However, when it is required to encode very large-scale image data (at least 1 billion images) as longer binary codes (more than 32 bits), how to efficiently store and search these binary codes still is a challenging problem. We propose a new framework to store and search very large-scale binary codes that leverages GPU devices. Compared with multiple hashing index method proposed in previous work, our random-sampling index approaches are more storage efficient and simpler. It supports both exact and approximate nearest neighbor search on binary codes. By leveraging the parallel computation of GPU, we also achieve faster search time performance than previous works. In order to further improve storage efficiency of our index, we propose a compression scheme for binary codes called bit compression. With GPU-based decompression method, compression version of index would not sacrifice too much search performance. Large-scale image data without properly annotated hinders image browsing and searching application. This problem motivates the development of effective automatic image annotation method. Given an image without textual information, automatic image annotation method can select best textual annotations for the image. Prior works in this area mostly focus on supervised learning approaches. These approaches are not practical due to poor performance, out-of-vocabulary problem, and being time-consuming in acquiring training data and learning. Thus, we claim that automatic image annotation by search over user-contributed photo sites (e.g., Flickr) would be an alternative solution to this problem. The intuition behind it is to select most suitable annotations for unlabeled image from the tags associated with visually similar user-contributed photos. However, the tags are generally few and noisy. To solve this problem, we propose a tag expansion method and use visual and semantic consistency between tag and image. We show that the proposed method significantly outperforms prior works and even provide more diverse annotations. Microblogging as a new form of communication on Internet, has attracted the attention from researchers recently. Relying the real-time and conversational properties of microblogging, its users update their statuses and share experience within their the social network. Those characteristics also make microblogging an important tool for users to share or discuss real world events such as earth quake or sport game. We propose a novel and flexible solution to detect and recognize real-time events from sport games based on analyzing the messages posted on microblogging services. We take Twitter as the experiment platform and collect a large-scale dataset of Twitter messages that are called tweets for 18 prominent sport games covering four types of sports in 2011. We also collect corresponding sport videos for those games. The proposed solution applies moving-threshold burst detection on the volume of tweets to detect highlights in sport games. A tf-idf-based weighting method is applied on the tweets within detected highlights for semantic extraction. According to the experiments we perform on the tweet and video datasets, we find that the proposed methods can achieve competent performance in sport event detection and recognition. Besides, our method can find non pre-defined tidbits that are difficult to detect in previous works. Not all images are interesting to people. People are drawn by interesting images and ignore tasteless ones. Image interestingness has the importance no less than other subjective image properties that have received significant research interest, but has not been systematically studied before. In this proposal, we focus on visual and social aspects of image interestingness. We rely on crowdsourcing tools to survey human perceptions for these subjective properties and verify data by analyzing consistency and reliability. We show that people have an agreement when deciding if an image is interesting or not. We examine the correlation between the social, visual aspects of interestingness and aesthetics. By exploring the correlation, we find that: (1) Weak correlation between social interestingness and both of visual interestingness and image aesthetics indicates that the images frequently re-shared by people are not necessarily aesthetic or visually interesting. (2) High correlation between image aesthetics and visual interestingness implies aesthetic images are more likely to be visually interesting to people. Then we wonder what features of an image lead to social interestingness, e.g. receiving more likes and shares on social networking sites? We train classifiers to predict visual and social interestingness and investigate the contribution from different image features. We find that social and visual interestingness can be best predicted with color and texture, respectively, providing a way to manipulate social and visual liking of images with image features. Further, we investigate the correlation between social/visual image interestingness and image color. We find that colors with arousal effect show more frequently in images with higher social interestingness. That could be explained by previous studies for activation-related affect of colors and provides useful and important advice when advertising on social networking sites.

參考文獻


[6] F. Jing, C. Wang, Y. Yao, K. Deng, L. Zhang, W.-Y. Ma, Igroup: web image search results clustering, in: ACM Multimedia 2006, 2006.
[45] M. Calonder, V. Lepetit, C. Strecha, P. Fua, Brief: Binary robust independent el- ementary features, in: Proceedings of the 11th European Conference on Computer Vision: Part IV, 2010.
[48] D. Zhang, J. Wang, D. Cai, J. Lu, Self-taught hashing for fast similarity search, in: Proceedings of the 33rd International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, 2010.
[62] X.-J. Wang, L. Zhang, F. Jing, W.-Y. Ma, Annosearch: Image auto-annotation by search, in: Proceedings of the 2006 IEEE Computer Society Conference on CVPR, Vol. 2, 2006, pp. 1483 – 1490.
[63] C. Wang, F. Jing, L. Zhang, H.-J. Zhang, Scalable search-based image annotation of personal images, in: Proceedings of the 8th ACM international workshop on MIR, 2006, pp. 269 – 278.

延伸閱讀