透過您的圖書館登入
IP:18.117.107.90
  • 學位論文

語音抗噪模型於未見環境之調適策略探討

A Study of Model Adaptation Scheme for SpeechEnhancement under Unseen Noisy Environments

指導教授 : 陳祝嵩
共同指導教授 : 曹昱(Yu Tsao)
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


基於學習的語音抗噪模型為解決未見的抗噪環境中的不匹配問題需要進行模型調適。現有的調適策略需要對不同的環境蒐集足量的資料。然而,針對目標環境資料的稀缺性以及多樣性造成現有的調適方法難以應用。甚至在有限的記憶體下無法儲存所有遇見的資料。在這篇論文中,我們提出了兩種策略:NASTAR 和 SERIL,立即性的準備針對目標環境的配對資料來調適模型,且能夠在不重新練的情況下返回已經過的環境時維持一定的抗噪表現。我們的方法能夠豐富訓練資料來減緩減緩稀缺問題以及減緩對原環境發生的災難性遺忘的問題。NASTAR 分別使用一噪音萃取器以及噪音檢索模型來重新利用利用現有的公開資料集模擬目標環境的資料。在給定僅使用一條特定環境的語料的條件下,我們能夠有效的利用現有的資料集進行再抽樣策略來重新訓練模型;SERIL 利用基於正規項的連續性學習演算法,利用先前模型的狀態來限制模型連續訓練的過程。在不存取先前資料的情況下,調適模型回到先前環境時仍保有一定程度的表現。為評量我們的結果,我們使用三種標準的語音抗噪評量指標:STOI、PESQ、SI-SDR。我們實驗結果顯示,在資料稀少且多樣的調適情況下,NASTAR 和 SERIL 能夠顯著地得到較高的分數。

並列摘要


To solve the mismatch problem for a learning-based Speech Enhancement model, the SE model needs model adaptation. Existing model adaptation schemes require preparing enough data against different target noisy environments. However, the data scarcity and variability of target environments lead to the difficulty to apply existing schemes. Moreover, limited memory resources prohibit the devices from storing all the encountered data. In this paper, we propose two model adaptation methods: NASTAR and SERIL, immediately preparing the pair data of the target environments to adapt the SE model and preserving high-level performances as the adapted model returns to previous environments without re-training. Our methods can enrich data to avoid the scarcity problem of target environments and reduce the catastrophic forgetting effect within previous environments. NASTAR respectively uses a noise extractor and a noise retrieval model to reuse available datasets to simulate data of target environments. Given one target sample, we can effectively utilize existing datasets to re-train a model with a target-conditional resampling scheme. SERIL applies the regularization-based continual learning strategy to the adaption of the SE model, constraining the sequential training process by the previous model's status. The adapted model with the SERIL method maintains acceptable performance without storing the data of previous environments. To measure our results, we choose three standard SE metrics: STOI, PESQ, and SI-SDR. Our experiment result shows that NASTAR and SERIL significantly obtain higher metrics scores while adapting the model under the data scarcity and variability problem.

參考文獻


[1]S. Fu, C. Liao, Y. Tsao, and S. Lin, “MetricGAN: Generative adversarial networksbased black-box metric scores optimization for speech enhancement,” inProc. ofICML, 2019.
[2]R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform gener-ation model based on generative adversarial networks with multi-resolution spectro-gram,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2020, pp. 6199–6203.
[3]J. Kim, M. El-Khamy, and J. Lee, “T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement,” inICASSP 2020 - 2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6649–6653.
[4]A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in thewaveform domain,”arXiv preprint arXiv:1911.13254, 2019.
[5]T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for con-trastive learning of visual representations,” inProc. of ICML, 2020.

延伸閱讀