透過您的圖書館登入
IP:3.14.135.79
  • 學位論文

基於深度學習使用雙向LSTM結合雙特徵因子應用於身分人識別

Applications of Bi-LSTM Combined with Two-Feature Factor Base on Deep Learning for Speaker Recognition

指導教授 : 陳永隆
本文將於2025/09/23開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


現今的社會中,類神經網路的技術日趨興盛,最為熱門的如影像辨識與聲音辨識等等。其中影像辨識的發展最為成熟,而在聲音辨識上,類神經網路技術也取得了不少進步與發展,過去聲音辨識大多數以傳統的方法為主,如高斯混合模型(Gaussian Mixed Model, GMM)與i-vector。而聲音辨識在任務上有許多代表的任務,如字詞分類、情緒辨識與身分人識別…等等。近年由於類神經網路的興起,聲音辨識上也獲得比起以往更大的成就,最為廣泛的就是各種分類任務與身份辨識。本文提出了九個無關文本的身分人識別方法,無關文本代表著辨識過程不受音訊的字詞涵義影響而達到辨識的效果,九個方法皆基於遞迴神經網路(Recuurence Neural Network, RNN)中的長短期記憶(Long Short-Term Memory, LSTM)。九個方法分別為長短期記憶結合梅爾倒頻譜係數使用三元組損失進行身分人識別(Long and short-term memory combined mel-frequency cepstral coefficients using triplet loss for speaker recognition, LSTM-MFCC-TL)、雙向長短期記憶結合梅爾倒頻譜係數使用三元組損失進行身分人識別(Bidirectional long and short-term memory combined mel-frequency cepstral coefficients using triplet loss for speaker recognition, BLSTM-MFCC-TL)、卷積雙向長短期記憶結合梅爾倒頻譜係數使用三元組損失進行身分人識別(Convolutional bidirectional long and short-term memory combined mel-frequency cepstral coefficients using triplet loss for speaker recognition, CBLSTM-MFCC-TL)、雙向長短期記憶結合梅爾倒頻譜係數與神經網路特徵使用三元組損失進行身分人識別(Bidirectional long and short-term memory combined mel-frequency cepstral coefficients with neural network features using triplet loss for speaker recognition, BLSTM-MFCCNN-TL)、卷積雙向長短期記憶結合梅爾倒頻譜係數與神經網路特徵使用三元組損失進行身分人識別(Convolutional bidirectional long and short-term memory combined mel-frequency cepstral coefficients with neural network features using triplet loss for speaker recognition, CBLSTM-MFCCNN-TL)、雙向長短期記憶結合梅爾倒頻譜係數與卷積神經網路特徵使用三元組損失進行身分人識別(Bidirectional long and short-term memory combined mel-frequency cepstral coefficients with convolutional neural network features using triplet loss for speaker recognition, BLSTM-MFCCCNN-TL)、卷積雙向長短期記憶結合梅爾倒頻譜係數與卷積神經網路特徵使用三元組損失進行身分人識別(Convolutional bidirectional long and short-term memory combined mel-frequency cepstral coefficients with convolutional neural network features using triplet loss for speaker recognition, CBLSTM-MFCCCNN-TL)、卷積雙向長短期記憶時間平均結合梅爾倒頻譜係數與卷積神經網路特徵使用三元組損失進行身分人識別(Convolutional bidirectional long and short-term memory with time average combined mel-frequency cepstral coefficients with convolutional neural network features using triplet loss for speaker recognition, CBLSTMT-MFCCCNN-TL)與卷積雙向長短期記憶時間平均結合梅爾倒頻譜係數與卷積神經網路特徵使用Softmax、加法餘量損失與三元組損失進行身分人識別(Convolutional bidirectional long and short-term memory with time average combined mel-frequency cepstral coefficients with convolutional neural network features using softmax, aam and triplet loss for speaker recognition, CBLSTMT-MFCCCNN-SATL)。第一個方法至第三個方法LSTM-MFCC-TL、BLSTM-MFCC-TL與CBLSTM-MFCC-TL以梅爾倒頻譜係數(Mel frequency cepstral coefficient, MFCC)進行模型輸入並使用不同模型進行Triplet loss訓練。第四個方法BLSTM-MFCCNN-TL與第五個方法CBLSTM-MFCCNN-TL利用神經網路(Neural Network, NN)的額外特徵結合MFCC進行模型輸入,模型則分別使用BLSTM與CBLSTM進行訓練;第六個方法到第八個方法BLSTM-MFCCCNN-TL、CBLSTM-MFCCCNN-TL與CBLSTMT-MFCCCNN-TL利用卷積神經網路(Convolutional neural network, CNN)的額外特徵結合MFCC進行模型輸入,模型則分別使用BLSTM、CBLSTM與CBLSTMT進行訓練;第九個方法CBLSTMT-MFCCCNN-SATL建構一個三階段訓練的模型,其中SATL表示三個損失函數Softmax、加法角度餘量(Aditive angular margin, AAM)與三元組損失(Triplet loss),並仍然使用CNN額外特徵結合MFCC進行模型輸入,最後採用CBLSTMT進行訓練。在過去身分人識別的問題上,都使用相似度距離計算來取得結果,最終針對距離結果進行門閥設定,以達到身分人識別的目的。我們除了使用類神經網路進行身分人識別的系統建構外,並且加入了額外特徵當作輸入,我們使用到NN與CNN進行額外特徵萃取。而後結合了三種損失函數Softmax、Triplet loss與AAM,過去Triplet loss與AAM皆應用於人臉辨識上,在度量學習(Metric learning)領域的實驗結果表現非常成功。其核心理念與我們過去身分人識別相似度距離計算相當的類似,於是將其結合我們的系統架構並應用於聲音領域之身分人識別。關鍵詞: 影像辨識、聲音辨識、遞迴神經網路、長短期記憶、神經網路、梅爾倒頻譜係數、卷積神經網路、三元組損失、加法角度餘量損失、度量學習

並列摘要


In today's society, neural network technologies are flourishing, and the most popular ones are image recognition and voice recognition. Among them, the development of image recognition is the most mature. In terms of voice recognition, neural network technology has also made a lot of progress and development. In the past, voice recognition mostly based on traditional methods, such as Gaussian Mixed Model (GMM), i-vector, etc. Voice recognition has many representative tasks in terms of tasks, such as word classification, emotion recognition and identity person recognition...etc. In recent years, due to the rise of neural networks, voice recognition has also achieved greater achievements than before. The most extensive are various classification tasks and identity recognition. This paper proposes six methods of identity recognition of irrelevant texts. The irrelevant texts represent that the recognition process not affected by the meaning of the words in the audio and achieve the recognition effect. The six methods all based on the Recursive Neural Network (RNN) Long Short-Term Memory (LSTM). The nine methods are Long and short-term memory combined mel-frequency cepstral coefficients using triplet loss for speaker recognition(LSTM-MFCC-TL), Bidirectional long and short-term memory combined mel-frequency cepstral coefficients using triplet loss for speaker recognition(BLSTM-MFCC-TL), bidirectional convolution Convolutional bidirectional long and short-term memory combined mel-frequency cepstral coefficients using triplet loss for speaker recognition (CBLSTM-MFCC-TL), bidirectional long and short-term memory combined mel-frequency cepstral coefficients with neural network features using triplet loss for speaker recognition(BLSTM-MFCCNN-TL), convolutional bidirectional long and short-term memory combined mel-frequency cepstral coefficients with neural network features using triplet loss for speaker recognition(CBLSTM-MFCCNN-TL), bidirectional long and short-term memory combined mel-frequency cepstral coefficients with convolutional neural network features using triplet loss for speaker recognition(BLSTM-MFCCCNN-TL), convolutional bidirectional long and short-term memory combined mel-frequency cepstral coefficients with convolutional neural network features using triplet loss for speaker recognition(CBLSTM-MFCCCNN-TL), convolutional bidirectional long and short-term memory with time average combined mel-frequency cepstral coefficients with convolutional neural network features using triplet loss for speaker recognition(CBLSTMT-MFCCCNN-TL), convolutional bidirectional long and short-term memory with time average combined mel-frequency cepstral coefficients with convolutional neural network features using softmax, aam and triplet loss for speaker recognition(CBLSTMT-MFCCCNN-SATL). The first to the third methods LSTM-MFCC-TL, BLSTM-MFCC-TL and CBLSTM-MFCC-TL use mel-frequency cepstral coefficient (MFCC) for model input and use different models for triplet loss training. The fourth method BLSTM-MFCCNN-TL and the fifth method CBLSTM-MFCCNN-TL use the extra features of neural network (NN) combined with MFCC for model input, and the model is trainned using BLSTM and CBLSTM respectively. The sixth method to the eighth method BLSTM-MFCCCNN-TL, CBLSTM-MFCCCNN-TL and CBLSTMT-MFCCCNN-TL use the extra features of convolutional neural network (CNN) combined with MFCC for model input, the model is then train using BLSTM, CBLSTM and CBLSTMT respectively. The ninth method CBLSTMT-MFCCCNN-SATL constructs a three-stage training model, where SATL represents three loss functions Softmax, additive angular margin (AAM) and triplet loss (Triplet loss), and still use the additional features of CNN are combined with MFCC for model input, and finally CBLSTMT is used for training. In the past problem of identity recognition, the similarity distance calculation used to obtain the result, and finally the gate valve setting performed for the distance result to achieve the purpose of identity recognition. In addition to using a neural network to construct a system for identity recognition, and adding additional features as input, we use NN and CNN for additional feature extraction. Then three loss functions are combined Softmax, Triplet loss and AAM. In the past, both Triplet loss and AAM were applied face recognition, and the experimental results in the field of metric learning performed very successfully. Its core concept is quite similar to our past calculation of similarity distance for identity recognition, so it combined with our system architecture and applied identity recognition in the sound field. Keywords: image recognition, voice recognition, RNN, LSTM, neural network, MFCC, CNN, triplet loss, AMM, metric learning

參考文獻


參考文獻
[1] Y. J. Chong, and C. K. Un, "An MLP/HMM hybrid model using nonlinear predictors", Speech Communication, vol. 19, no. 4, pp. 307-316, Oct. 1996.
[2] Y. J. Chung, and C. K. Un, "Multilayer perceptrons for state-dependent weightings of HMM likelihoods", Speech Communication, vol. 18, no. 1, pp. 79-89, Jan. 1996.
[3] T. Wansbeek, and P. Bekker, " On IV, GMM and ML in a dynamic panel data model", Economics Letters, vol. 51, no. 2, pp. 145-152, May 1996.
[4] J. Lund, and T. Engsted, "GMM and present value tests of the C-CAPM: evidence from the Danish, German, Swedish and UK stock markets", Journal of International Money and Finance, vol. 15, no. 4, pp. 497-521, Aug. 1996.

延伸閱讀