本研究的目標是探索輕量化微調方法在自監督語音模型中的應用,以更有效率地的使用自監督式語音模型。研究表明,自監督學習對於各種語音任務都有著很大的潛力,可以透過微調的方式被應用於不同的下游語音任務中。然而,傳統的微調方法在處理數百萬個參數的自監督學習模型時存在著參數使用效率低的問題。為了解決這個問題,我們引入了附加器,這是一種在自然語言處理中常用的輕量級模塊,來讓自監督式預訓練語音模型更好且更有效率地被應用到下游任務當中。 在本研究中,我們將自監督式預訓練語音模型的參數凍結,僅對附加器部分的參數進行微調。考慮到目前對於適配器在自監督語音任務中的有效性缺乏研究,我們通過在預訓練的語音自監督學習模型中添加不同的適配器模塊來填補這一空白。 具體而言,我們將不同的高效微調方法應用於基於SUPERB基準的自監督語音模型。我們提出了一個適配器框架,用於處理多個下游語音處理任務,例如語音識別、分類和說話者識別。 通過這項研究,我們希望能夠有效利用高效微調方法來提升語音模型的性能,並為語音處理領域中的多個下游任務提供更好的解決方案。
In this study, we aim to explore efficient fine-tuning methods for self-supervised speech representation learning. Recent research has demonstrated the potential of self-supervised learning for various speech tasks. However, traditional fine-tuning approaches suffer from inefficiency in parameter usage when dealing with large-scale self-supervised models. To address this issue, we introduce adapter modules, a lightweight module commonly used in natural language processing. Our approach involves freezing the parameters of the self-supervised learning model and only fine-tuning the adapter modules for downstream tasks. Considering the lack of research on the effectiveness of adapters in self-supervised speech tasks, we fill this gap by incorporating different adapter modules into pre-trained speech self-supervised learning models. Specifically, we apply different efficient fine-tuning methods, including adapter fine-tuning and prompt fine-tuning, on self-supervised speech models based on the SUPERB benchmark. We propose an adapter framework that can handle multiple downstream speech processing tasks, such as speech recognition, classification, and speaker identification. Through this research, we aim to effectively leverage efficient fine-tuning methods to enhance the performance of speech models. Additionally, we strive to fill the research gap in the application of adapters in self-supervised speech tasks and provide better solutions for multiple downstream tasks in the field of speech processing.