大型語言模型在語音到文字摘要生成任務中的評估: 基於會議語料庫的研究

隨著現代企業和組織日益依賴會議來進行溝通和決策，會議記錄和摘要的生成變得尤為重要。手動撰寫會議記錄和摘要過程繁瑣且容易出錯，並且品質往往依賴於紀錄者的能力，因此過去便有相關研究使用語言模型進行會議摘要。而近年來，大型語言模型崛起，如 GPT、Gemini 和 Llama 系列，目前已經有先關研究證明大型語言模型在自然語言處理相關任務如: 文本生成、翻譯、問答系統等有優異的表現。會議記錄文本通常較長且結構複雜，包含多個發言者的對話，涉及話題廣泛，導致會議摘要的生成與一般文章摘要有所不同。然而目前針對使用大型語言模型生成會議摘要的研究仍然相對較少，因此本研究旨在填補這一空缺。本研究評估多種大型語言模型在生成會議摘要方面的效果。通過使用 AMI 會議語料庫，並結合不同的預處理方法（如 Google 語音識別和 Whisper Base 轉錄）及不同的Prompt設計，來比較這些模型的表現。研究結果顯示，GPT-4 在大多數情況下表現最佳，但在需要高度精確率的情境中，GPT-3.5 更具優勢。而Gemini 1.5 Pro 在召回率方面表現突出。本研究提供使用不同大型語言模型在實際生成會議摘要時的建議，可以依據不同需求選擇相應解決方案，期望這些發現能夠幫助企業和組織選擇適合的技術來提高會議記錄和摘要的效率與準確性。

關鍵字

大型語言模型；會議摘要；自動語音識別；提示工程

並列摘要

Enterprises and organizations increasingly rely on meetings for communication and decision-making, the generation of meeting summaries has become particularly important. The manual process of writing meeting summaries is cumbersome and prone to errors, with the quality often depending on the recorder's ability. Consequently, there has been related research using language models for meeting summarization in the past. In recent years, the rise of large language models such as GPT, Gemini, and Llama series has demonstrated exceptional performance in natural language processing tasks like text generation, translation, and question-answering systems. Meeting transcripts are usually lengthy and complex, involving multiple speakers' dialogues and covering a wide range of topics, making meeting summarization different from general article summarization. However, there is relatively little research on using large language models for meeting summarization, and this study aims to fill this gap. This study evaluates the effectiveness of various large language models in generating meeting summaries. By using the AMI meeting corpus and combining different preprocessing methods (such as Google Speech Recognition and Whisper transcription) with different prompt designs, and compare the performance of these models. The research results show that GPT-4 performs best in most cases, but GPT-3.5 is more advantageous in situations requiring high precision. On the other hand, Gemini 1.5 Pro excels in recall rate. This study provides recommendations for using different large language models in practical meeting summarization scenarios, allowing for the selection of appropriate solutions based on specific needs. It is hoped that these findings can help enterprises and organizations choose suitable technologies to improve the efficiency and accuracy of meeting minutes and summaries.

並列關鍵字

Large Language Models ； Meeting Summarization ； Automatic Speech Recognition ； Prompt Engineering

參考文獻

Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/. 2024.

Google Scholar

O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10):1533–1545, 2014.

Google Scholar

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

Google Scholar

S. Alshaina, A. John, and A. G. Nath. Multi-document abstractive summarization based on predicate argument structure. In 2017 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), pages 1–6. IEEE, 2017.

Google Scholar

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Google Scholar

延伸閱讀

查找全文

主題瀏覽