透過您的圖書館登入
IP:3.144.94.187
  • 學位論文

基於複數神經網路以及生成對抗網路的跨領域語音加強模型

Cross­-domain Speech Enhancement Model based on Complex Neural Network and Generative Adversarial Network

指導教授 : 雷欽隆

摘要


早期的語音增強模型有幾個缺點,第一個是對噪音較大或是不穩定的訊號效果不好,第二個是無法準確地消除頻率較高的噪音,因此有人提出了使用深度學 習的模型來解決這些問題。 深度學習模型的輸入大部分使用的是帶有噪音訊號的聲音轉換而來的頻譜圖,少部分會直接使用原始的波形圖,頻譜圖可以幫助我們的模型更容易地學到 訊號中帶有的資訊,但是深度學習的模型沒有辦法處理轉換成頻譜圖後產生的虛數,因此許多方法只針對實數的部分或者是訊號強度的部分去做處理。後來,複數神經網路的出現幫助我們解決了這個問題,因此在我們的方法中,我們也採用了複數神經網路的架構,並且加入了U-­Net架構。 另外,模型輸出的訊號和乾淨的聲音訊號的距離並不能準確的表示聲音品質的好壞,因此我們將訊號品質的分數當作我們訓練模型的目標,並採用了 metricGAN 的的技術,透過另外訓練一個判別器模型,讓我們的模型能夠產生品 質更好的聲音。 我們的方法有幾個優點,第一,我們採用了複數神經網路的架構,讓機器能夠看到完整的頻譜圖資訊,第二,我們同時使用了頻譜圖以及波形圖的資訊,讓機器能夠獲得更多關於訊號的內容,第三,我們的模型藉由將聲音的品質分數當作訓練目標,讓模型產生的聲音能夠獲得更高的品質分數。我們的實驗使用了VoiceBank以及DEMAND資料集作為訓練集以及測試集,其中訓練集包含了28個說話者以及共40種不同的噪音條件,並且使用了各種測試分數作為評斷標準,而我們的模型在這些分數上獲得了比其他方法更好的成績。

並列摘要


Early speech enhancement models have several shortcomings. The first is that these models are not effective for enhancing the too noisy or unstable audios, and the second is that they cannot accurately eliminate higher-frequency noises. Therefore, some researches propose to use deep learning models to solve these problems. Most inputs of the deep learning models use spectrograms converted from audios with noise signals, and a small number of them directly use the original waveforms. The spectrograms can help our model learn the information from the signal more easily, but the deep learning model has no way to deal with the imaginary number generated by STFT, so many methods only deal with the real number part or the signal magnitude. Later, the emergence of complex neural networks helped us solve this problem, so in our method, we also adopted the architecture of complex neural networks and added the U-Net architecture. In addition, the distance between the enhanced audio and clean audio does not represent the sound quality well. Therefore, we take the signal quality score as the goal of our training model and adopt the metricGAN by training another discriminator model in our approach to produce better quality sound. Our method has several advantages. First, we use a complex neural network architecture to allow the machine to understand the complete spectrogram information. Second, we use both spectrogram and waveform information to make the machine get more information about the signal. Thirdly, our model uses the sound quality score as the training target, so that the sound produced by the model can get a higher quality score. Our experiment uses VoiceBank and DEMAND data sets as training and testing sets. There are 28 speakers and 40 noisy conditions in our training data. We use various metric scores as the judging criteria, and our model obtains better results than other methods on these scores.

參考文獻


[1] D. Bagchi, P. Plantinga, A. Stiff, and E. Fosler­Lussier. Spectral feature mapping with mimic loss for robust speech recognition. CoRR, abs/1803.09816, 2018.
[2] M. Berouti, R. Schwartz, and J. Makhoul. Enhancement of speech corrupted by acoustic noise. In ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages 208–211, 1979.
[3] H. S. Choi, J.­H. Kim, J. Huh, A. Kim, J.­W. Ha, and K. Lee. Phase­aware speech enhancement with deep complex u­net, 2019.
[4] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux. Phase­sensitive and recognition­boosted speech separation using deep recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 708–712, 2015.
[5] S. Fu, C. Liao, Y. Tsao, and S. Lin. Metricgan: Generative adversarial networks based black­box metric scores optimization for speech enhancement. CoRR, abs/1905.04874, 2019.

延伸閱讀