從多模態資料建立個人知識庫與生活事件檢索

隨著科技的進展,穿戴式裝置越來越普及,人們更傾向於透過這些穿戴式裝置來紀錄自己的生活。其中紀錄的方式也從過往的以文字佐圖的圖文生活日誌轉變成以影片配上語音的影像網誌。然而,隨著資料量的大量增加,如何建立有效的檢索方式來達成快速的記憶回顧已經成了一個棘手的難題。其中困難的點不僅止於圖片與文字間的語意鴻溝,更包含了對於以事件高低階方式解讀的差異性。在這篇論文中,我們嘗試引入外部語意知識來建立文字與影像的檢索,用以解決圖片與文字間的語意鴻溝,並更以三個由旅遊性質為主的Youtuber建立了的影像網誌資料集,用該資料集訓練並建立了能夠透過圖文互補的特性所建置個人知識庫的模型。我們分別使用了兩種不同的資料集進行了以下兩個實驗:(1)以多模態資料檢索生活紀錄者經驗中之特定事件,以及(2)自動化建立個人知識庫。在檢索日常生活紀錄之特定事件的部分,我們透過外部的影像識別模型抽取出圖片資訊並結合外部資源的語意知識,以增強訓練媒合文字的編碼;在自動化建立個人知識庫的部分,我們以預訓練的影像抽取模型得到影片資訊,結合將編碼後的文字訊息將其分類出該影片所包含的事件,做到了個人知識庫建立。透過上述兩個我們我提出的方法,不僅僅能帶給生活紀錄檢索上的表現有所提升,更能有效運用圖文互補的特性提升了建立個人知識庫的效能。

關鍵字

生活紀錄；生活事件檢索；詞嵌入向量；圖文嵌入學習；個人知識庫

並列摘要

As the progress of the science and technology, wearable device has been increasing popular, and people tend to record their daily life with those devices. In the past, people used blog to record their life, and most of the blog contained lines of word as illustration for the pictures. Nowadays, people record their daily life as a Vlog which contains video with voice information. However, as the enormous growth of data, how to process personal data efficiently has become a critical problem. The difficulties of this topic is not only affected by the semantic gap between words and images, but also affected by the way people interpreting an images. In this paper, we utilize the external knowledge for solving the semantic gap between words and images. We also purpose a bread new video dataset with subtitles. Those videos are recorded by mainly three Youtubers and all content of the videos are about traveling. We build a model which can utilize the complementary property between words and images for constructing a personal knowledge base. We use two different dataset for processing the following two experiments individually: (1) to retrieve specified lifeloggger's events for memory recall, and (2) to automatically construct the personal knowledge base. For retrieving lifelogger's events, we extract information with a pre-trained model through the images. Moreover, we combine those information with external resources to enhance the training of semantic embedding. For the construction of personal knowledge base, our model summarize the possible events happened in the video with extracted video information and encoded subtitles. Those approaches proposed in this paper are not only enhance the performance of lifelog retrieval, but also effectively exploit the complementary property between words and images for constructing personal knowledge base.

並列關鍵字

Lifelog ； Visual Lifelog Retrieval ； Word Embedding ； Image-text Embedding Learning ； Personal Knowledge Base

參考文獻

[1] Krisztian Balog and Tom Kenter. Personal knowledge graphs: A research agenda. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of In- formation Retrieval, ICTIR ’19, page 217–220, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368810. . URL https: //doi.org/10.1145/3341981.3344241.

Google Scholar

[2] Jon Louis Bentley. K-d trees for semidynamic point sets. In Proceedings of the Sixth Annual Symposium on Computational Geometry, SCG ’90, page 187–197, New York, NY, USA, 1990. Association for Computing Machinery. ISBN 0897913620.. URL https://doi.org/10.1145/98524.98564.

Google Scholar

[3] VannevarBush. AsWeMayThink. AtlanticMonthly, 176(1):641–649, March1945. ISSN 1072-5520. . URL http://www.theatlantic.com/doc/194507/bush.

Google Scholar

[4] João Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750, 2017. URL http://arxiv.org/abs/ 1705.07750.

Google Scholar

[5] Jingyu Cui, Fang Wen, and Xiaoou Tang. Real time google and live image search re- ranking. In Proceedings of the 16th ACM International Conference on Multimedia, MM ’08, page 729–732, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605583037. . URL https://doi.org/10.1145/1459359. 1459471.

Google Scholar

國際替代計量

從多模態資料建立個人知識庫與生活事件檢索

全文下載

主題瀏覽