基於生成對抗網路之非監督式音素辨識

隨著機器學習技術的日新月異，監督式的語音辨識技術已經可以達到不錯的準確率並早已融入人們的日常生活之中。這類的監督式語音辨識技術必須仰賴大量的人工標註資料來訓練模型，但標註資料的取得常需投入大量資源。相較之下，在巨量數據(Big Data)時代人們事實上可以輕易取得大量的未標註資料，這也是為什麼非監督式語音辨識技術有其吸引力與必要性。因此本論文由音素(Phoneme)辨識開始，提出了兩種不同的非監督式音素辨識架構，並且都是使用了最近被廣泛研究的生成對抗網路(Generative Adversarial Network)來達成非監督式的學習。在以往的非監督式語音處理技術中，僅能找出語音訊號中相似的音型(SpeechTokens)，並沒有辦法辨識出這些音型是對應到哪些詞或是音素。因此本論文所提出的第一種方法是透過生成對抗網路來學習音型與音素之間的映射關係，來達到語音辨識的效果。然而透過實驗與其他研究可以發現建構一個非監督式語音辨識系統最大的困難點在於必須克服語音訊號的彈性長度及分段結構的特性，也就是每一個辨識單元例如字、詞、或是音素會分別對應到可長可短且連續的聲音訊號。因此本論文的第二種方法仍舊是透過生成對抗網路來學習，但改良了處理語音訊號的方式，並提出也使用隱藏式馬可夫模型(Hidden Markov Model, HMM)的協同訓練法，透過生成對抗模型與隱藏式馬可夫模型的協同交替學習來提升整體的辨識準確率。

關鍵字

生成對抗網路；非監督式；語音辨識

並列摘要

With the rapid development of machine learning technology, supervised speech recognition technology has been able to achieve good accuracy and has been integrated into people's daily life. Such supervised speech recognition technology must rely on a large amount of manually labeled data to train the model, but obtaining labeled data often requires a lot of resources. In contrast, in the era of Big Data, people can easily obtain a large amount of unlabeled data, which is why unsupervised speech recognition technology has its appeal and necessity. Therefore, this thesis starts with phoneme recognition, and proposes two different unsupervised phoneme recognition architectures, and both use recently studied Generative Adversarial Network to achieve unsupervised learning. . In the past unsupervised speech processing technology, it was only possible to find similar speech patterns in speech signals (Speech Tokens), and there was no way to identify which words or phonemes these speech patterns correspond to. Therefore, the first method proposed in this paper is to learn the mapping relationship between phonemes and phonemes by generating an adversarial network to achieve the effect of speech recognition. However, through experiments and other studies, it can be found that the biggest difficulty in constructing an unsupervised speech recognition system is that it must overcome the characteristics of the flexible length and segmentation structure of the speech signal, that is, each recognition unit such as a word, word, or phoneme Corresponds to long and short and continuous sound signals. Therefore, the second method of this paper is still learning by generating adversarial networks, but it improves the way to process voice signals and proposes a collaborative training method that also uses Hidden Markov Model (HMM). The collaborative and alternate learning of the model and the hidden Markov model improves the overall recognition accuracy.

並列關鍵字

Generative Adversarial Network ； Unsupervised ； ASR

參考文獻

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing

Google Scholar

systems, 2012, pp. 1097–1105.

Google Scholar

[2] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan C ernocky, and Sanjeev Khudanpur,“Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010.

Google Scholar

[3] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.

Google Scholar

[4] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.

Google Scholar

國際替代計量

基於生成對抗網路之非監督式音素辨識

全文下載

主題瀏覽