透過您的圖書館登入
IP:3.142.171.64
  • 學位論文

任務導向的語音增強之損失函數研究

Investigation of Cost Function for Task-Oriented Speech Enhancement

指導教授 : 林守德
共同指導教授 : 曹昱

摘要


近年來,由於深度學習的蓬勃發展,語音增強演算法的除噪能力也大幅的進步。但是,在進步之餘,基於深度學習的語音增強演算法仍然有一些值得改進和探討的方向。例如大部分的文獻用於訓練模型的損失函數(loss function)只用簡單的均方誤差(mean-square error, MSE)。然而不同的語音增強應用可能會有不同的偏重要求:助聽器的使用者可能會特別需要除噪演算法能提升語音的理解度(intelligibility)。對於環境不會吵雜到聽不清楚的使用情況,有效地提升語音的品質(quality)就顯得重要。而對於語者驗證(automatic speaker verification (ASV))的門禁系統,語音增強的主要目的則是希望語者驗證的錯誤率能在吵雜環境下依然夠低。由於除噪模型在沒看過的測試環境下無法完美地還原乾淨語音,使用和要求目標不一致的損失函數(如:MSE)無法達到最好的解。 本篇論文專注於使用不同的損失函數於語音增強模型的訓練中。由於short-time objective intelligibility (STOI)是常用來評估語音理解度的指標,論文中的第一部分將STOI直接當作損失函數來訓練Fully convolutional neural network (FCN)。傳統的以深度學習為基礎的語音增強模型大多是作在時頻域(time-frequency domain)上並且以幅(frame)為單位作處理,因而很難直接最佳化跨越幅計算的STOI。而我們提出的FCN是直接作用在時域的波型(waveform)上,並且以整個句子為處理單位。 Perceptual evaluation of speech quality (PESQ)則是經常被用來評估語音的品質。和STOI相比,PESQ的計算更加複雜,並且包含一些不可微分的函數,因而無法像STOI一樣直接被用來當作損失函數。本篇論文的第二部分即是針對PESQ分數作最佳化。我們透過另一個神經網路(稱作Quality-Net)來模仿PESQ函數的行為,並用這個從訓練資料學到的Quality-Net來引導語音增強模型的訓練。由於參數固定的Quality-Net容易被更新後的語音增強模型產生出的語音樣本所欺騙而給出很高的評估分數(真實的PESQ分數卻不高),因而我們導入對抗學習(adversarial learning)的機制使Quality-Net和語音增強模型輪流被更新,我們稱這樣的模型架構為MetricGAN。和強化學習(reinforcement learning)一樣,MetricGAN可以將評估函數當作黑盒子(black box)而不需要知道其計算細節。 最後,為了展示MetricGAN的其他應用,我們將其用於最小化語者辨識模型在吵雜環境下的錯誤拒絕率(false rejection rate)。 實驗結果顯示這些方法都可以進一步提升相對應的客觀評估分數。而聽測結果也證實考慮STOI的損失函數可以進一步提升語音理解度;最佳化PESQ分數的模型產生的語音信號也有較高的語音品質。

關鍵字

語音增強 深度學習 損失函數 STOI PESQ

並列摘要


In recent years, due to the development of deep learning, the denoising ability of speech enhancement algorithms has also greatly improved. However, there are still some directions worth exploring for deep-learning-based speech enhancement algorithms. For example, most studies apply simple mean-square error (MSE) as the loss function for the model training. However, different speech enhancement applications may have different requirements: hearing aids users may particularly need noise reduction algorithms to improve speech intelligibility. For use cases where the environment is not too noisy to understand the speech, it is important to effectively improve the quality of speech. For speaker-verification-based access systems, the main purpose of speech enhancement is to make the error rate of automatic speaker verification (ASV) can still be low even in noisy environments. Since the denoise model cannot perfectly generate clean speech in a test environment that has not been seen, the optimal solution cannot be achieved using a loss function (e.g., MSE) that does not meet the required purpose. This study focuses on the investigation of different loss functions in the training of speech enhancement models. Because short-time objective intelligibility (STOI) is a commonly used indicator to evaluate speech intelligibility, the first part of this study applies STOI directly as a loss function to train fully convolutional neural network (FCN). Traditional deep learning-based speech enhancement models mostly work on the time-frequency representation and process the speech in a frame-wise manner, so it is difficult to directly optimize the STOI (the “short-time” calculation in STOI is based on 30 frames). On the other hand, the proposed FCN directly works on the time-domain waveform, and uses the whole utterance as the processing unit. Perceptual evaluation of speech quality (PESQ) is often used to evaluate the quality of speech. Compared with STOI, the calculation of PESQ is more complicated and includes some non-differentiable functions, so it cannot be used directly as a loss function like STOI. The second part of this study is to optimize the PESQ score. We use another neural network (called Quality-Net) to mimic the behavior of the PESQ function, and use this learned Quality-Net to guide the training of the speech enhancement model. Because the fixed Quality-Net is easily fooled (gives a high evaluation score, but the true PESQ score is low) by the speech samples generated by the updated speech enhancement model, we introduce adversarial learning to make Quality-Net and speech enhancement models alternatively updated. We call such learning framework, MetricGAN. As reinforcement learning, MetricGAN can treat the evaluation function as a black box without knowing its detailed calculation. Finally, to show other applications of MetricGAN, we use it to minimize the false rejection rate (FRR) of a speaker verification model under noisy environments. Experimental results show that these methods can further improve the corresponding objective evaluation scores. The results of listening test also confirmed that incorporating STOI into the loss function can further improve speech intelligibility; the speech signal generated by the PESQ-optimized model also has higher speech quality.

並列關鍵字

speech enhancement deep learning loss function STOI PESQ

參考文獻


[1] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, "Raw waveform-based speech enhancement by fully convolutional networks," in Proc. APSIPA, 2017.
[2] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal Processing Letters, vol. 21, pp. 65-68, 2014.
[3] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "Global variance equalization for improving deep neural network based speech enhancement," in IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), 2014, pp. 71-75.
[4] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "Dynamic noise aware training for speech enhancement based on deep neural networks," in INTERSPEECH, 2014, pp. 2670-2674.
[5] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing vol. 23, pp. 7-19, 2015.

延伸閱讀