透過您的圖書館登入
IP:18.218.102.138
  • 學位論文

大型語言模型的跨模態理解:多模互動中的非語言描繪

Cross-Modality Understanding in Large Language Model: Non-verbal Depiction in Multimodal Interaction

指導教授 : 謝舒凱
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


大型語言模型(LLM)的發展為自然語言處理這個領域帶來新一波的任務以及研究方向,由於大型語言模型生成文字的能力可以透過自然語言直接提示(prompt)和教導(instruction)以解決許多任務,也在業界產生全新的應用和發展。多模態大型語言模型(Multimodal LLM, MLLM)也在幾個月內迅速發展,目前已經有可以解讀影音內容的多模態大型語言模型。本研究探討此刻最新發展的多模態大型語言模型對於「描繪」(Depiction)這項溝通策略的跨模態理解能力,「描繪」是日常生活中人們頻繁使用的溝通方式,所指為創造和呈現能讓聽者想像被描述場景的具象場景,經常會透過手勢、聲音、臉部表情等非語言的方式出現,因此,能夠整合視覺、聽覺、和語言文字等模態的能力對大型語言模型的未來發展極為重要。 本研究論文蒐集100個美國訪談節目的影片,先在視覺和聲音兩個模態進行臉部辨識、姿勢抓取、語音轉寫、語者識別等前處理,完成後進行標記以取出含有「描繪」的影音片段,最後使用Video-LLaMA這個多模態大型語言模型進行四個實驗。實驗資料集會分為四種不同類型的描繪:附加描繪(adjunct depiction)、指引描繪(indexed depiction)、嵌入描繪(embedded depiction)和獨立描繪(independent depiction)四個實驗中分別使用了不同的提示設計,包括零樣本(zero-shot)或少量樣本(few-shot)提示、關聯思考(Chain-of-Thought)提示等變因的操作。根據實驗結果,目前最新的大型語言模型在手勢的判讀上仍難以準確做出有效的整合理解和判斷解釋在。研究結果提出了大型語言模型在手勢的理解能力的現有限制,以及未來朝這個方向繼續發展的重要性。

並列摘要


Large Language Models (LLMs) have revolutionized Natural Language Processing, showcasing remarkable achievements and rapid advancements. Despite significant progress in meaning construal and multimodal capabilities, LLMs still struggle with accurately interpreting iconic gestures that occur in "depiction" at the time of writing. Depiction, a prevalent communicative method in daily life, involves creating and presenting physical, iconic scenes that enable recipients to imagine the depicted meaning. It is crucial for multimodal LLMs to comprehend and potentially acquire this communicative strategy. This research paper presents an investigation into the capabilities of LLMs with a dataset comprising 100 video clips from four American talk shows. A pipeline is developed to automatically process the multimodal data, and the identified depiction segments are utilized to assess the performance of Video-LLaMA, a multimodal large language model capable of interpreting video. Four experiments are designed to evaluate whether LLMs can identify and accurately interpret four distinct types of depictions: adjunct depiction, indexed depiction, embedded depiction, and independent depiction. The four experiments utilize different prompt designs, including zero-shot, few-shot, zero-shot-CoT (i.e., zero-shot Chain-of-Thought), and few-shot-CoT. Experimental results reveal that current state-of-the-art LLMs are unable to successfully complete these tasks. The findings underscore the existing limitations of LLMs in capturing the nuanced meaning conveyed through depiction. Addressing these challenges will be crucial for advancing the capabilities of LLMs and enabling more sophisticated multimodal interactions in the field of Natural Language Processing.

參考文獻


Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Sys- tems, 35, 23716–23736.
Bavelas, J., Gerwing, J., & Healing, S. (2014). Effect of dialogue on demonstrations: direct quotations, facial portrayals, hand gestures, and figurative references. Discourse Processes, 51(8), 619–655.
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https: //lmsys.org/blog/2023-03-30-vicuna/
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
Clark, H. H. (2016). Depicting as a method of communication. Psychological review, 123(3), 324.

延伸閱讀