透過您的圖書館登入
IP:18.189.2.122
  • 學位論文

基於深度學習的語者驗證之研究與實作

A Study and Implementation on Speaker Verification Based on Deep Learning

指導教授 : 張智星

摘要


本論文研究並實作了數種基於深度學習的文本無關(text-independent)語者驗證 系統。本論文用經過語音前端處理的聲學特徵(MFCC)作為輸入,以語者辨識或語 者分群為目標來訓練神經網路。訓練完畢後,神經網路的一部分被作為特徵提取器 (feature extractor),用來提取給定語音中的語者特徵。對於每位註冊語者,我們用訓 練完成的神經網路從他的每段語音中提取出語者特徵,對所有得到的特徵向量進行平 均,將平均後的結果作為這位語者的語者模型。在驗證階段,我們用同樣的神經網路 從給定的測試語音中抽取出相應的語者特徵,再與需要驗證的語者模型計算餘弦相似 度。若相似度超過某個閾值,則驗證成功,反之則驗證失敗。本論文在神經網路架構 上嘗試了多種設計,並在 NIST SRE2010 語料的 8conv 部分進行了系統的訓練和測試。 實驗結果顯示,本論文提出的系統的效能在驗證語音較短時,相對於 i-vector 系統展 現出明顯優勢。當以音檔全長註冊,以 2 秒語音驗證時,本論文中最好的系統 EER 僅 有 9.75%,近乎 i-vector 系統的一半。在語者辨識的部分,本論文中最好的系統準確率 達到 85%以上。

並列摘要


This dissertation has studied and implemented several text-independent speaker verification systems based on deep learning. In this thesis, acoustic features after speech front-end processing (such as MFCC) are used as the input, and neural networks are trained with the aim of speaker identification or speaker clustering. After training, part of the neural network is used as a feature extractor to extract the speaker feature from a given utterance. For each enrollment speaker, we use the trained neural network to extract speaker features from each of his/her utterance, and use the averaged feature vector as his/her speaker model. In the verification phase, we use the same neural network to extract speaker feature from the given test utterance, and then calculate the cosine similarity between it and the speaker model which to be verified. If the similarity exceeds the predefined threshold, the test speaker is accepted by the system, otherwise he/she is rejected. In this dissertation, we tried various designs on the neural network architecture and conducted experiments in the 8conv part of NIST SRE2010 corpus. The experimental results show that the performance of the system presented in this paper shows a clear advantage over the i-vector system when the test utterance is short. Specifically, when enrolled with full-length utterances and verifying with utterances of only 2 seconds, the best system EER in this paper is only 9.75%, which is almost half of the i-vector system. In terms of speaker identification, the best system accuracy in this paper reaches more than 85%.

參考文獻


[1] Bimbot, F., et al. (2004). "A tutorial on text-independent speaker verification." EURASIP Journal on Advances in Signal Processing 2004(4): 101962.
[2] Cumani, S., et al. (2013). Probabilistic linear discriminant analysis of i-vector posterior distributions. Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, IEEE.
[3] Dehak, N., et al. (2011). "Front-End Factor Analysis for Speaker Verification." IEEE Transactions on Audio, Speech, and Language Processing 19(4): 788-798.
[4] Hansen, J. H. L. and T. Hasan (2015). "Speaker Recognition by Machines and Humans: A tutorial review." IEEE Signal Processing Magazine 32(6): 74-99.
[5] Reynolds, D. A., et al. (2000). "Speaker Verification Using Adapted Gaussian Mixture Models." Digital Signal Processing 10(1-3): 19-41.

延伸閱讀