適合漸凍人使用之語音轉換系統初步研究

漸凍人症（肌萎縮性脊隨側索硬化症，Amyotrophic lateral sclerosis，ALS）為一種神經退化性疾病，這種疾病目前還沒有治癒的方法，並會讓漸凍人慢慢失去說話能力，最終導致無法利用語音與人溝通，而失去自我認同。因此，我們需要為漸凍人建立適合其使用之語音溝通輔具（voice output communication aids, VOCAs），尤其是讓其能具有個人化的合成語音，即病友發病前的聲音，以保持自我。但大部分在ALS後期，已經不能講話的病友，都沒有事先妥善保存好個人的錄音，最多只能找出有少量大約20分鐘的低品質語音，例如經過失真壓縮（MP3）、只保留低頻寬（8 kHz），或是具有強烈背景雜訊干擾等等，以致無法建構出適合ALS病友使用的個人化語音合成系統。針對以上困難，本論文嘗試使用通用語音合成系統搭配語音轉換演算法，並在前級加上語音雜訊消除（speech denoising），後級輔以超展頻模組（speech super-resolution）。以能容忍有背景雜訊的錄音，並能將低頻寬的合成語音加上高頻成分（16kHz）。以盡量能從低品質語音，重建出接近ALS病友原音的高品質合成聲音。其中，speech denoising使用WaveNet，speech super-resolution則利用U-Net架構。並先以20小時的高品質（棚內錄音）教育電台語料庫，模擬出成對的高雜訊與乾淨語音語句，或是低頻寬與高頻寬語音，分別訓練WaveNet與U-Net模型，再用以處理病友的低品質語音錄音音檔。實驗結果顯示，訓練出來的WaveNet與U-Net模型，可以相當程度還原具雜訊或是低頻寬的教育電台語音檔。並能用來替ALS病友重建出高品質的個人化合成聲音。

關鍵字

類神經網路； ALS ； WaveNet

並列摘要

ALS (Amyotrophic lateral sclerosis) is a neurodegenerative disease. There is no cure for this disease, and it will make the ALS patients eventually lose their ability to use their own voice to communicate with others. Therefore, a personalized voice output communication aids (VOCAs) is essential for ALS patients to improve their daily life. However, most of the ALS patients have not properly reserved their personal recordings in the early stage of the disease. Usually, only few low-quality speech recordings, such as distortion compressed, narrow band (8 kHz), or noisy speech, are available for developing their own personalized VOCAs. In order to reconstruct high-quality synthetic sounds close to the original sound of ALS patients, voice conversion with speech denoising and bandwidth expansion capacities were proposed in this paper. Here, a front-end WaveNet- and a backend U-Net-based speech enhancement and super-resolution neural networks, respectively, were constructed and integrated with the backbone voice conversion system. The experimental results showed that the WaveNet and U-Net models can restore the noisy and narrowband speech, respectively. Therefore, it is promising to be applied to reconstruct high-quality personalized VOCAs for ALS patients.

並列關鍵字

Neural network ； ALS ； WaveNet

參考文獻

Park, S.C., Park, M.K., & Kang, M.G. (2003). Super-resolution image reconstruction: a technical overview. IEEE Signal Processing Magazine, 20(3), 21-36. doi: 10.1109/MSP.2003.1203207

Parveen, S. & Green, P. (2004). Speech enhancement with missing data techniques using recurrent neural networks. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1, 733-736. doi: 10.1109/ICASSP.2004.1326090

Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., …Wang, Z. (2016). Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceeding of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/CVPR.2016.207

Shin, K.Y., Park, K.R., Kang, B.J., & Park, S.J. (2009). Super-Resolution Method Based on Multiple Multi-Layer Perceptrons for Iris Recognition. In Proceedings of the 4th International Conference on Ubiquitous Information Technologies & Applications, 1-5. doi: 10.1109/ICUT.2009.5405701

Zen, H. (2015). Acoustic Modeling for Speech Synthesis: from HMM to RNN. In Proceedings of 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015). doi: 10.13140/RG.2.1.2892.3922

國際替代計量

適合漸凍人使用之語音轉換系統初步研究

全文下載

主題瀏覽