透過您的圖書館登入
IP:18.118.145.114
  • 學位論文

半監督式學習之端到端語音辨識及辨識合成串連框架

Semi-supervised End-to-end Speech Recognition and a Cascading Recognition-Synthesis Framework

指導教授 : 李琳山

摘要


隨著機器學習以及相關硬體技術之發展進步,基於深層學習的端到端語音辨識系統(End-to-end Speech Recognition System)逐漸開始取代傳統多模組式系統。相對簡單的模型架構與純數據引導(Data Driven)的訓練方法雖然是端到端語音辨識的優勢所在,卻也無可避免的造成了其技術發展最大的隱憂:過度倚賴基於大量人工標註之語音文字配對資料的監督式學習(Supervised Learning)。近年學界業界均意識到此一問題,並開始著手於降低端到端語音辨識對於人工標註資料倚賴之研究。有鑑於此,本論文提出三種使用不同資源來改善上述缺點之半監督式(Semi-supervised)端到端語音辨識方法。 第一種方法利用無語音對應之純文本所訓練得到的詞嵌入(Word Embed-ding)作為引導,提出針對序列到序列(Sequence-to-sequence)語音辨識模型之正規化(Regularization),將詞嵌入所攜帶的語意資訊作為語音辨識訓練時額外的目標。此外,更進一步將詞嵌入加入文字解碼過程,使語音辨識模型輸出受詞嵌入空間之相對關係影響,以期產生更符合文意之辨識結果。實驗結果顯示詞嵌入能夠簡單有效的使語音辨識系統受益於純文字資料,需要付出的運算資源成本極低,且能夠與既有的純文字技術相容。 第二種方法在語音辨識模型與語言模型間引入了對抗式訓練(AdversarialLearning),將原本語音辨識中輸出結果需要通過語言模型修正的過程提前至訓練階段,使語音辨識模型能夠直接從純文字資料中學習語言知識。實驗結果顯示該方法能夠讓端到端語音辨識受益於大量的純文字資料,在經標註之語音對應文字資料有限的情況下顯著提升辨識率。 第三種方法則提出了語音辨識合成串連框架,從大量未經人工標註之純語音資料中學習語音表徵(Speech Representation),並利用少部份經標註資料完成語音表徵與音素之一一對應。經對應的語音表徵能夠有效降低監督式訓練對於語音辨識之重要性。實驗證實了在僅使用20分鐘以內標註資料的情況下,使用語音表徵之語音辨識模型可以將語音辨識錯誤率有效降低。

並列摘要


With the fast advance of development and progress of machine learning and related hardware technologies, end-to-end speech recognition systems based on deep learning have gradually begun to replace traditional multi-module systems. Although end-to-end speech recognition has certain advantages (e.g. relatively simple model architecture and pure data-driven training methods), it also has some downsides: excessive reliance on a large number of manually labeled speech and text matching data Supervised learning. In recent years, both research community and industry have realized this problem and have begun to study to reduce the reliance of end-to-end speech recognition on artificially labeled data. In view of this, this paper proposes three semi-supervised end-to-end speech recognition methods that use different resources to improve the above shortcomings. The first method uses word embeddings trained in plain text with no speech correspondence as a guide and proposes to normalize the sequence-to-sequence speech recognition model using the semantic information carried by the word embedding as an additional training target. In addition, the word embedding is further added to the text decoding process, so that the output of the speech recognition model is affected by the relative relationship of the word embedding space, in order to produce a more textual recognition result. The experimental results show that word embedding can simply yet effectively make the speech recognition system benefit from pure text data, the cost of computing resources can be lowered, and it is compatible with existing pure text technology. The second method introduces adversarial training between the speech recognition model and the language model, which advances the process of correcting the output results of the original speech recognition through the language model to the training stage, such that speech recognition model can learn the language directly from the pure text data knowledge. Experimental results show that this method can allow end-to-end speech recognition to benefit from a large amount of pure text data, and significantly improve the recognition rate when the labeled speech corresponding to text data is limited. The third method proposes a cascading recognition-synthesis framework, which learns speech representation from a large amount of pure speech data that has not been manually labeled, and uses a small part of the labeled data to complete the one-to-one correspondence between the speech representation and phoneme. The corresponding speech representation can effectively reduce the importance of supervised training for speech recognition. Experiments have proved that using the speech recognition model of speech representation can effectively reduce the speech recognition error rate under the condition of using annotated data within only 20 minutes.

參考文獻


Frederick Jelinek, “Continuous speech recognition by statistical methods,”Pro-ceedings of the IEEE, vol. 64, no. 4, pp. 532–556, 1976.
Lalit Bahl, Peter Brown, Peter De Souza, and Robert Mercer, “Maximum mutualinformation estimation of hidden markov model parameters for speech recognition,”inICASSP’86. IEEE International Conference on Acoustics, Speech, and SignalProcessing. IEEE, 1986, vol. 11, pp. 49–52.
Lawrence R Rabiner, “A tutorial on hidden markov models and selected applicationsin speech recognition,”Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
Mark Gales, Steve Young, et al., “The application of hidden markov models inspeech recognition,”Foundations and TrendsR©in Signal Processing, vol. 1, no. 3,pp. 195–304, 2008.
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed,Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara NSainath, et al. “Deep neural networks for acoustic modeling in speech recogni-tion: The shared views of four research groups,”IEEE Signal processing magazine,vol. 29, no. 6, pp. 82–97, 2012.

延伸閱讀