機器學習模型的多模式影片理解對應了人類的視覺和文字的感知,對於各種應用至關重要。然而,以往基於監督學習的研究面臨兩個根本性的挑戰:(1)需要大量的人工標注數據和(2)缺乏進行系統二推理的能力。本研究提出三種創新的解決方法來應對以上挑戰。 首先,為了降低數據標注成本,本研究提出影片問答生成這樣新任務,該任務的目標為自動生成用於訓練影片問答系統的問題與答案。與以往依賴於文字描述的問答生成方法不同,本研究直接輸入影片並避免了訊息遺失。為了解決影片問答生成問題,我們設計了一個生成器預測器網絡,通過試圖回答問題來鼓勵模型輸出可回答的問題。 其次,我們提出了一種解決因果影片問答問題的方案:從語言模型中提取因果常識。不同於傳統僅依賴於視覺觀察的影片問答,因果影片問答整合了常識,以進行更複雜的推理。現有基於文字描述的問答方法僅限於提取關聯知識,因此不適用於因果影片問答。為了解決這個挑戰,我們利用在訓練過程中學習到廣泛因果關係的語言模型來提取常識。我們將意圖動作的配對輸入語言模型得到回應後,然後將其轉化為問題答案的配對,並將之用於訓練影片問答系統。 第三,為了檢驗和發展機器的系統二推理能力,我們提出了橋段理解這樣新任務。做為創作中的敘事工具,理解橋段需要對於因果和動機的推理能力。我們收集了兩個橋段理解資料集,並於實驗中發現先進模型與人類表現之間的顯著差距。為了解決這個問題,我們提出了兩種方法:(1)多層次理解模型以理解時間(故事情節)和空間(角色關係)兩個維度。(2)角色設定理解和說故事模型,通過學習在潛在空間中對視覺和文字描述特徵進行對齊,以學習文字描述中人進行的詮釋。 實驗結果證明了我們提出方法的有效性,在影片問答生成、橋段理解和零樣本因果影片問答超越了前沿的方法。此外,我們提供了詳細的分析以開闢新的研究路徑。
Understanding multi-modal video inputs, reflecting human visual and textual cognition, is crucial for various applications. However, previous supervised learning studies have faced two fundamental challenges: (1) the need for significant human annotation and (2) the lack of ability to perform system 2level reasoning. In this research, we propose three novel components to address these challenges. First, to lower the cost of data annotation, we present a task called Video Question Answer Generation (VQAG), which automatically generates questionanswer pairs for training Video QA systems. Unlike previous QAgeneration methods that rely on captions, we directly input video and avoid information loss. To address VQAG, we design a GeneratorPretester Network that encourages the model to output answerable questions by attempting to answer them. Secondly, we propose a solution for tackling causal Video QA by extracting causal commonsense knowledge from language models. Unlike traditional Video QA that solely relies on visual observations, Causal Video QA integrates commonsense knowledge for more sophisticated reasoning. Existing captionbased QA methods are limited to extract ing association knowledge, making them unsuitable for causal Video QA. To address this challenge, we utilize language models that have observed vast causal relations during train ing to extract commonsense knowledge. We prompt the models with intentionaction pairs and extract responses, which are then transformed into questionanswer pairs. These pairs can be used to train Video QA systems. Third, we propose a novel task, Trope Understanding, to examine and develop machines’ System 2 reasoning capabilities. Understanding storytelling devices called tropes requires causal and motivational reasoning skills. We collect two movie trope understanding datasets and highlight the significant gaps between stateoftheart models and human performance. To address this, we propose two methods: (1) Multilevel Comprehension Model, which comprehends both temporal (storyline) and spatial (character relations) dimensions. (2) Trope Understanding and Storytelling Model leverage human interpretation by learning to align visual and textual features in a latent space. Experimental results demonstrate the effectiveness of our proposed components, out performing traditional stateoftheart methods on Video Question Generation, Trope Understanding, and ZeroShot Causal Video Question Answering. Moreover, we provide detailed analysis to pave the way for future work.