透過您的圖書館登入
IP:18.117.172.41
  • 學位論文

多模態影片理解及其擴展

Multi­-modal Video Comprehension and Beyond

指導教授 : 徐宏民
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


機器學習模型的多模式影片理解對應了人類的視覺和文字的感知,對於各種應用至關重要。然而,以往基於監督學習的研究面臨兩個根本性的挑戰:(1)需要大量的人工標注數據和(2)缺乏進行系統二推理的能力。本研究提出三種創新的解決方法來應對以上挑戰。 首先,為了降低數據標注成本,本研究提出影片問答生成這樣新任務,該任務的目標為自動生成用於訓練影片問答系統的問題與答案。與以往依賴於文字描述的問答生成方法不同,本研究直接輸入影片並避免了訊息遺失。為了解決影片問答生成問題,我們設計了一個生成器­預測器網絡,通過試圖回答問題來鼓勵模型輸出可回答的問題。 其次,我們提出了一種解決因果影片問答問題的方案:從語言模型中提取因果常識。不同於傳統僅依賴於視覺觀察的影片問答,因果影片問答整合了常識,以進行更複雜的推理。現有基於文字描述的問答方法僅限於提取關聯知識,因此不適用於因果影片問答。為了解決這個挑戰,我們利用在訓練過程中學習到廣泛因果關係的語言模型來提取常識。我們將意圖­動作的配對輸入語言模型得到回應後,然後將其轉化為問題­答案的配對,並將之用於訓練影片問答系統。 第三,為了檢驗和發展機器的系統二推理能力,我們提出了橋段理解這樣新任務。做為創作中的敘事工具,理解橋段需要對於因果和動機的推理能力。我們收集了兩個橋段理解資料集,並於實驗中發現先進模型與人類表現之間的顯著差距。為了解決這個問題,我們提出了兩種方法:(1)多層次理解模型以理解時間(故事情節)和空間(角色關係)兩個維度。(2)角色設定理解和說故事模型,通過學習在潛在空間中對視覺和文字描述特徵進行對齊,以學習文字描述中人進行的詮釋。 實驗結果證明了我們提出方法的有效性,在影片問答生成、橋段理解和零樣本因果影片問答超越了前沿的方法。此外,我們提供了詳細的分析以開闢新的研究路徑。

並列摘要


Understanding multi­-modal video inputs, reflecting human visual and textual cognition, is crucial for various applications. However, previous supervised learning studies have faced two fundamental challenges: (1) the need for significant human annotation and (2) the lack of ability to perform system 2­level reasoning. In this research, we propose three novel components to address these challenges. First, to lower the cost of data annotation, we present a task called Video Question­ Answer Generation (VQAG), which automatically generates question­answer pairs for training Video QA systems. Unlike previous QA­generation methods that rely on cap­tions, we directly input video and avoid information loss. To address VQAG, we design a Generator­Pretester Network that encourages the model to output answerable questions by attempting to answer them. Secondly, we propose a solution for tackling causal Video QA by extracting causal commonsense knowledge from language models. Unlike traditional Video QA that solely relies on visual observations, Causal Video QA integrates commonsense knowledge for more sophisticated reasoning. Existing caption­based QA methods are limited to extract­ ing association knowledge, making them unsuitable for causal Video QA. To address this challenge, we utilize language models that have observed vast causal relations during train­ ing to extract commonsense knowledge. We prompt the models with intention­action pairs and extract responses, which are then transformed into question­answer pairs. These pairs can be used to train Video QA systems. Third, we propose a novel task, Trope Understanding, to examine and develop machines’ System 2 reasoning capabilities. Understanding storytelling devices called tropes requires causal and motivational reasoning skills. We collect two movie trope understand­ing datasets and highlight the significant gaps between state­of­the­art models and human performance. To address this, we propose two methods: (1) Multi­level Comprehension Model, which comprehends both temporal (storyline) and spatial (character relations) di­mensions. (2) Trope Understanding and Storytelling Model leverage human interpretation by learning to align visual and textual features in a latent space. Experimental results demonstrate the effectiveness of our proposed components, out­ performing traditional state­of­the­art methods on Video Question Generation, Trope Un­derstanding, and Zero­Shot Causal Video Question Answering. Moreover, we provide detailed analysis to pave the way for future work.

參考文獻


[1] H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran. Self­ supervised learning by cross­modal audio­video clustering. In NeurIPS, 2020.
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
[3] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In NeurIPS, 2016.
[4] R. G. B. Jasani and D. Ramanan. Are we asking the right questions in MovieQA? In ICCV Workshops, 2019.
[5] R. G. B. Jasani and D. Ramanan. Are we asking the right questions in MovieQA? In ICCV Workshops, 2019.

延伸閱讀