透過您的圖書館登入
IP:216.73.216.156
  • 學位論文

透過電影中的隱喻探討大型語言模型的影片推理能力

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

指導教授 : 徐宏民

摘要


大型語言模型不僅在語言任務中表現出色,還在影片推理方面展示了其有效性。本文介紹了一個名為Tropes in Movies (TiM) 的新資料集,旨在作為探索 兩個關鍵但此前被忽視的影片推理技能:(1)抽象感知(Abstract Perception): 理解和標記影片中的抽象概念,(2)長時長組合推理(Long-range Compositional Reasoning):規劃和整合中間推理步驟,以理解包含大量影格的長影片。通過利用電影敘事中的情節,TiM 評估了最先進且基於大型語言模型的方法的推理能力。我們的實驗表明,目前的方法,包括圖片描述-推理器、大型多模態模型指令微調和視覺編程,在應對抽象感知和長時長組合推理的挑戰時,僅比隨機方法稍有優越。為了解決這些不足,我們提出了角色互動增強的 Face-Enhanced Viper of Role Interactions(FEVoRI)和Context Query Reduction(ConQueR),通過在推理過程中加強關注角色互動和逐步改進影片上下文和情節查詢,顯著提高了性能,提升了 15 個 F1 點。然而,這一性能仍落後於人類水平(40 vs. 65 F1)。此外,我們引入了一種新評估標準,用於評估抽象感知和長時長組合推理在任務解決中的必要性。這是通過使用抽象語法樹分析視覺編程生成的程式碼來完成的,從而確認了 TiM 相較其他資料集複雜。

並列摘要


Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and inte- grating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range Compositional Reasoning. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR), which enhance Visual Programming by fostering role interaction awareness and progressively refining movie contexts and trope queries during reasoning processes, significantly improving performance by 15 F1 points. However, this performance still lags behind human levels (40 vs. 65 F1). Additionally, we introduce a new protocol to evaluate the necessity of Abstract Perception and Long-range Compositional Reasoning for task resolution. This is done by analyzing the code generated through Visual Programming using an Abstract Syntax Tree (AST), thereby confirming the increased complexity of TiM

並列關鍵字

LLMs LMMs reasoning visual programming tropes

參考文獻


[1] M. BCS. Star: A benchmark for situated reasoning in real-world videos.
[2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A.Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[3] C.-H. Chang, H.-T. Su, J.-H. Hsu, Y.-S. Wang, Y.-C. Chang, Z. Y. Liu, Y.-L. Chang, W.-F. Cheng, K.-J. Wang, and W. H. Hsu. Situation and behavior understanding by trope detection on films. In Proceedings of the Web Conference 2021, pages 3188–3198, 2021.
[4] J.-P. Chou, A. F. Siu, N. Lipka, R. Rossi, F. Dernoncourt, and M. Agrawala. Talestream: Supporting story ideation with trope knowledge. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–12, 2023.
[5] J. Chung and Y. Yu. Long story short: a summarize-then-search method for long video question answering. 2023.

延伸閱讀