透過您的圖書館登入
IP:3.135.183.89
  • 學位論文

藉助跨語言聲音單位對映之遷移學習達成使用低資源之端到端語音合成及辨識

Low-resourced End-to-end Speech Synthesis and Recognition by Transfer Learning with Cross-lingual Sound Unit Mapping

指導教授 : 李琳山

摘要


近年來科技日新月異,先進技術伴隨著手機以及穿戴式裝置走入人類的生活,其中一例是語音助理,能藉助語音合成(Text-to-speech)以及語音辨識(Speech Recognition)的技術透過簡單對話完成任務。這兩項技術在近年的類神經網路發展下都獲得了巨大的成功,在合成音質、流暢度以及辨識精準度上一次次得皆超越以往;只是這些模型的訓練都需仰賴大量的標註(Annotated)資料,而這些標註資料需要投入大量的金錢、人力,它對於很多低資源語言來說是無法負擔的。因此,本論文主軸在於探討如何更有效地使用轉移學習(Transfer Learning)技術來幫助那些資源較匱乏的語言建立語音合成以及語音辨識模型,亦即設法讓以高資源語言的充沛標註資料訓練而得的模型可以用在低資源語言上。 在跨語言轉移學習中,模型常會遇到輸入端或是輸出端不匹配的問題,像是語音合成模型的輸入端以及語音辨識模型輸出端的語音單位在不同語言是不一樣的,因而造成轉移學習的效果不理想。若我們知道不同語言間語音單位的對映,那我們就可以利用這樣的對映關係解決不匹配的問題以增進模型轉移學習的效果。因此,本論文提出了自動跨語言語音單位對映(Sound Unit mapping)的方法,並應用在最近流行的類神經網路語音合成模型Tacotron以及鏈結式時序分類器語音辨識模型的轉移學習上,並以實驗驗證所找出的對映是否對轉移學習有幫助。 本論文使用多種衡量標準,包括衡量語音合成模型的人類受試者主觀(Subjective)語音自然度(Naturalness)評量,或是評斷語音辨識模型結果的客觀(Objective)字母錯誤率(Character Error Rate)。結果都顯示,本論文所提出的自動跨語言語音單位對映可以在這兩個任務上提昇轉移學習的效果,在只有少量標註資料的條件下產生更佳的聲音以及更精準的辨識結果。

關鍵字

語音合成 語音辨識

並列摘要


In recent years, science and technology have been advancing with time passing. Advanced technologies have come into our lives along with mobile phones and wearable devices. One example is voice assistant, which assists human complete tasks through simple conversations with the help of Text-to-speech and Speech Recognition technologies. These two technologies have achieved great success with the development of neural networks in recent years, and they have surpassed themselves in synthetic sound quality, fluency, and recognition accuracy. However, the training of these models relies on a large amount of annotated data, which requires a lot of money and manpower, which is unaffordable for many low-resourced languages. Therefore, the main point of this paper is to explore how to use Transfer Learning technology more effectively to help those languages with fewer resources to build speech synthesis and speech recognition models. That is, we try to re-use the model trained with abundant annotation data of high-resourced languages on low-resourced languages. In cross-lingual transfer learning, models often encounter input or output mismatch. For example, the sound unit input of the speech synthesis model and the output of the speech recognition model differ across languages, which causes transfer learning not ideal. If we know the mapping of these sound units across languages, then we can use this mapping relation to solve the mismatch problem and improve transfer learning. Therefore, this paper proposes an automatic cross-lingual sound unit mapping method and applies it to transfer learning on the recently popular neural speech synthesis model Tacotron and the temporal connectionist classification speech recognition model. We use experiments to verify whether the mappings are helpful for transfer learning. This paper uses a variety of measurements, including the subjective assessment of the naturalness of speech synthesis model by the human subjects, or the evaluation of the objective Character Error Rate on the speech recognition model results. The results show that the cross-lingual sound unit mapping proposed in this paper can improve the effect of transfer learning on these two tasks and produce better sounds and more accurate recognition results with only a small amount of labeled data.

並列關鍵字

Speech synthesis Speech recognition

參考文獻


[1] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio.,” .
[2] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783.
[3] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
[4] Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017.
[5] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio, “Char2wav: End-to-end speech synthesis,” 2017.

延伸閱讀