電影與動畫橋段中的隱喻理解與偵測

理解與領悟影片內容對於許多實際應用至關重要，例如搜索或推薦系統等。儘管近年來深度學習的發展利用視覺上的線索提高了各種任務的性能，但對於更挑戰性的問題如推理意圖、動機或因果關係的深度認知仍有很大的進步空間。現今宣稱能測試影片推理能力的資料集都著重於比表層視覺上的訊號，例如動作、物件或物件間的關係，又或者可以利用文本上的偏差來解決任務。鑑於此，我們提出了一項新的任務以及一個新的數據集：Trope Understanding in Movies and Animations (TrUMAn)，目標在利用視覺訊號來開發與測試具有影片推理能力的深度學習系統。隱喻 (Trope) 是創作者們常用來在作品中表達想法與概念的手段，我們通過讓機器學習解決隱喻理解的任務來使其深度認知的能力更加強大，並且我們認為能藉此將將數據挖掘的應用及演算法的性能提升到一個新的水平。為了解決這個具有挑戰性的 TrUMAn 數據集，我們提出了一個 Trope Understanding and Storytelling (TrUSt) 模型和一個新的 Conceptual Storyteller 模組，該模組通過在潛空間中學習看影片說故事的能力來強化我們的影片編碼模組，並且能將由模型生成的故事輸入到隱喻理解模組中來提供模型進一步的訊息以學習隱喻理解。我們的實驗結果顯示，現有任務中最好的深度學習模型在利用影片的輸入訊號上只能達到 12.01% 的準確度。此外，即使在利用具有人工標示影片介紹的最理想情況下，利用 BERT 的語意理解模型也只能達到最多 28% 的準確度。而我們提出的 TrUSt 在僅利用影片輸入訊號的情況下提高了模型效能並達到了 13.94% 的準確度。我們同時也提供了詳細的實驗分析來為未來進行的相關研究鋪路。目前我們的資料集: TrUMAn 已經在下列網址公開：https://www.cmlab.csie.ntu.edu.tw/project/trope。

關鍵字

資料集；隱喻理解；多模態學習；深度感知；影片推理

並列摘要

Understanding and comprehending video content is crucial for many real world applications such as search and recommendation systems. While recent progress of deep learning has boosted performance on various tasks using visual cues, deep cognition to reason intentions, motivation, or causality remains challenging. Existing datasets that aim to examine video reasoning capability focus on visual signals such as actions, objects, relations, or could be answered utilizing text bias. Observing this, we propose a novel task, along with a new dataset: Trope Understanding in Movies and Animations (TrUMAn), intending to evaluate and develop learning systems beyond visual signals. Tropes are frequently used storytelling devices for creative works. By coping with the trope understanding task and enabling the deep cognition skills of machines, we are optimistic that data mining applications and algorithms could be taken to the next level. To tackle the challenging TrUMAn dataset, we present a Trope Understanding and Storytelling (TrUSt) with a new Conceptual Storyteller module, which guides the video encoder by performing video storytelling on a latent space. The generated story embedding is then fed into the trope understanding model to provide further signals. Experimental results demonstrate that state-of-the-art learning systems on existing tasks reach only 12.01% of accuracy with raw input signals. Also, even in the oracle case with human-annotated descriptions, BERT contextual embedding achieves at most 28% of accuracy. Our proposed TrUSt boosts the model performance and reaches 13.94% performance. We also provide detailed analysis to pave the way for future research. TrUMAn is publicly available at: https://www.cmlab.csie.ntu.edu.tw/project/trope

並列關鍵字

dataset ； trope understanding ； multi-modal learning ； deep cognition ； video reasoning

參考文獻

Yoshua Bengio. From system 1 deep learning to system 2 deep learning. NeuripS, 2019.

Google Scholar

Chen-Hsi Chang, Hung-Ting Su, Juiheng Hsu, Yu-Siang Wang, Yu-Cheng Chang, Zhe Yu Liu, Ya-Liang Chang, Wen-Feng Cheng, Ke-Jyun Wang, and Winston H. Hsu. Situation and behavior understanding by trope detection on ﬁlms. In WWW, 2021.

Google Scholar

Kuo-HaoZeng,Tseng-HungChen,Ching-YaoChuang,Yuan-HongLiao,JuanCarlos Niebles, and Min Sun. Leveraging video descriptions to learn video question answering. In AAAI, 2017.

Google Scholar

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually reﬁned attention over appearance and motion. In ACM Multimedia, 2017.

Google Scholar

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.

Google Scholar

國際替代計量

電影與動畫橋段中的隱喻理解與偵測

全文下載

主題瀏覽