提升大語言模型推理效率：優化非自迴歸生成模型與推測解碼

基於大型語言模型（LLM）的生成系統在近年來取得了顯著的成功，但其實際應用經常因推理過程中的高延遲而受限，尤其是由自回歸解碼造成的延遲。本論文旨在解決提升LLM效率與維持高輸出品質的雙重挑戰。本研究的第一個重點在於全非自回歸模型（Fully Non-Autoregressive Models）。這類模型透過單次前向傳播生成輸出，在機器翻譯等任務中展現了顯著提升推理速度的潛力。然而，相較於自回歸模型，非自回歸模型經常面臨顯著的輸出品質下降問題。為了解決這一限制，我們提出了- CTCPMLM，這是一種基於編碼器(Encoder Only)的大型語言模型，透過連續時序分類（Connectionist Temporal Classification, CTC）損失進行訓練，有效解決了序列對齊和預測長度的挑戰。此外，我們採用遮罩插入（MASK Insertion）策略進行上採樣，取代傳統的重複插入方法，並實施嵌入蒸餾策略進一步提升非自回歸模型的品質。實驗結果顯示，CTCPMLM不僅在多個數據集上超越了自回歸模型的性能，還實現了高達16.35x的速度提升，確立了其作為非自回歸模型的最新領先地位。本研究的第二個重點探索了推測解碼（Speculative Decoding）。這是一種新興的策略，通過一個高效的草稿模型生成多個令牌，並由目標模型並行驗證這些令牌的正確性。儘管推測解碼在加速LLM推理方面展現了顯著的潛力，其性能卻高度依賴於超參數-窗口大小。傳統方法通常依賴靜態啟發式策略或動態調整，這些方法需要額外的訓練或額外GPU分配。為了解決這些限制，我們提出了一種全新的框架——階層式推測解碼與動態窗口（Hierarchical Speculative Decoding with Dynamic Window, HSDDW）。該框架無需額外訓練，並引入了自驗證機制，使草稿模型能夠自主決定生成何時停止。同時，通過整合利用不同規模模型的階層結構，我們顯著優化了整體速度與效率。此方法在多個基準測試中表現競爭力，顯著提升了計算性能與可擴展性。綜上所述，本論文在高效LLM推理領域做出了重要貢獻，通過引入創新的方法與框架，有效應對了速度與輸出品質之間的關鍵權衡。這些進展為未來的研究奠定了基礎，促進LLM應用的實用性與可擴展性。

關鍵字

大語言模型生成；加速推論；推測解碼；非自迴歸

並列摘要

Large Language Model-based generative systems have achieved remarkable success, yet their practical deployment is often hindered by excessive inference latency, primarily caused by autoregressive decoding. This thesis aims to address the dual challenge of enhancing the efficiency of LLMs while maintaining high output quality. The first focus of this work is on fully non-autoregressive models. These models, which generate outputs in a single forward pass, offer significant potential for improving inference speed in tasks like machine translation. However, fully non-autoregressive models often suffer from a substantial degradation in output quality compared to their autoregressive counterparts. To overcome this limitation, we propose - CTCPMLM, an encoder-based LLM trained with the Connectionist Temporal Classification loss, which effectively addresses sequence alignment and prediction length challenges. Additionally, we adopt a MASK insertion scheme for up-sampling, replacing traditional token duplication, and implement an embedding distillation strategy to refine NAT model quality further. Experimental results demonstrate that CTCPMLM not only surpasses the performance of autoregressive models on multiple datasets but also achieves an impressive 16.35x speedup, establishing it as a state-of-the-art NAT approach. The second focus of this thesis explores speculative decoding, an emerging strategy that employs an efficient draft model to generate multiple tokens, which are subsequently verified in parallel by a target model. While SD shows significant potential for accelerating LLM inference, its performance is highly dependent on the hyperparameter—the window size. Traditional methods often rely on static heuristics or dynamic adjustments that require additional training or meticulous resource allocation. To address these limitations, we propose Hierarchical Speculative Decoding with Dynamic Window (HSDDW), a novel framework that eliminates the need for additional training. HSDDW introduces a self-verify mechanism, allowing the draft model to autonomously decide when to stop generation, and incorporates a hierarchical structure leveraging models of varying sizes to further optimize speed and efficiency. This approach demonstrates competitive results across multiple benchmarks, significantly improving both computational performance and scalability. Overall, this thesis makes substantial contributions to the field of efficient LLM inference by introducing innovative methods and frameworks that address the critical trade-offs between speed and output quality. These advancements set a foundation for future research, paving the way for more practical and scalable LLM applications.

並列關鍵字

LLM Generation ； Efficiency Inference ； Speculative Decoding ； Non-Autoregressive

參考文獻

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via selfspeculative decoding. CoRR, abs/2309.08168, 2023.

Google Scholar

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International onference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023

Google Scholar

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation, 2023.

Google Scholar

Weilin Zhao, Yuxiang Huang, Xu Han, Wang Xu, Chaojun Xiao, Xinrong Zhang, Yewei Fang, Kaihuo Zhang, Zhiyuan Liu, and Maosong Sun. Ouroboros: Generating longer drafts phrase by phrase for faster speculative decoding, 2024.

Google Scholar

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding, 2024.

Google Scholar

延伸閱讀

全文下載

主題瀏覽