Title

基於鑑別式自編碼解碼器之錄音回放攻擊偵測系統

Translated Titles

A Replay Spoofing Detection System Based on Discriminative Autoencoders

Authors

吳家隆(Chia-Lung Wu);許祥平(Hsiang-Ping Hsu);呂淯鼎(Yu-Ding Lu);曹昱(Yu Tsao);李鴻欣(Hung-Shin Lee);王新民(Hsin-Min Wang)

Key Words

語者辨識 ; 語者辨識攻擊 ; 回放攻擊偵測 ; 鑑別式自編碼解碼器 ; 深度類神經網路 ; Speaker Verification ; Speakser Verification Attack ; Spoofing Attack ; Discriminative Autoencoder ; Deep Neural Network

PublicationName

中文計算語言學期刊

Volume or Term/Year and Month of Publication

22卷2期(2017 / 12 / 01)

Page #

63 - 72

Content Language

繁體中文

Chinese Abstract

在此論文中,我們提出了一個基於鑑別式自編碼解碼器的神經網路模型,對語者辨識系統的錄音回放攻擊進行自動偵測,也就是判斷語者辨識系統所收到的音訊內容是屬於真實的人聲或是由錄音機所回放出來的人聲。在語者辨識領域中,以人為的聲音造假對語者辨識系統進行的攻擊稱之為欺騙攻擊(Spoofing Attack)。有鑑於深度類神經網路模型已被廣泛應用在語音處理相關問題,我們期望能夠應用相關模型在此類問題上。在所提出的鑑別式自編碼解碼器模型中,我們利用模型的中間層來達到特徵抽取的目的,並且提出新的損失函數,使得中間層的特徵將依照資料的標記結果做分群,因此新的特徵將具有能鑑別真偽人聲的資訊,最後再利用餘弦相似度來計算所抽取的特徵與真實的人聲相近與否,得到偵測的結果。我們採用2017 Automatic Speaker Verification Spoofing and Countermeasures Challenge(ASVspoof-2017)所提供的資料庫進行測試,所提出的系統在開發數據集上得到了很好的成效,與官方所提供的測試方法相比,其準確度約有42 %的相對進步幅度。

English Abstract

In this paper, we propose a discriminative autoencoder (DcAE) neural network model to the replay spoofing detection task, where the system has to tell whether the given utterance comes directly from the mouth of a speaker or indirectly through a playback. The proposed DcAE model focuses on the midmost (code) layer, where a speech utterance is factorized into distinct components with respect to its true label (genuine or spoofed) and meta data (speaker, playback, and recording devices, etc.). Moreover, the concept of modified hinge loss is introduced to formulate the cost function of the DcAE model, which ensures that the utterances with the same speech type or meta information will share similar identity codes (i-codes) and higher similarity score computed by their i-codes. Tested on the development set provided by ASVspoof 2017, our system achieved a much better result, up to 42% relative improvement in the equal error rate (EER) over the official baseline based on the standard GMM classifier.

Topic Category 人文學 > 圖書資訊學
基礎與應用科學 > 資訊科學
工程學 > 電機工程
Reference
  1. Abe, M.,Nakamura, S.,Shikano, K.,Kuwabara, H.(1990).Voice conversion through vector quantization.Journal of the Acoustical Society of Japan (E),11(2),71-76.
  2. Alam, M. J.,Kenny, P.,Bhattacharya, G.,Stafylakis, T.(2015).Development of CRIM system for the automatic speaker verification spoofing and countermeasures challenge 2015.Proceedings of Interspeech 2015
  3. Alegre, F.,Amehraye, A.,Evansdoi, N.(2013).Spoofing countermeasures to protect automatic speaker verification from voice conversion.Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  4. Bone, D.,Lee, C.-C.,Narayanan, S.(2014).Robust unsupervised arousal rating: A rule-based framework with knowledge-inspired vocal features.IEEE Transactions on Affective Computing,5(2),201-213.
  5. Chen, L.-H.,Ling, Z.-H.,Liu, L.-J.,Dai, L.-R.(2014).Voice conversion using deep neural networks with layer-wise generative training.IEEE/ACM Transactions on Audio, Speech and Language Processing(TASLP),22(12),1859-1872.
  6. Chen, Y.-N.,Sun, M.,Rudnicky, A. I.,Gershmandoi, A.(2016).Unsupervised user intent modeling by feature-enriched matrix factorization.Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  7. Chung, Y.-A.,Wu, C.-C.,Shen, C.-H.,Lee, H.-Y.,Lee, L.-S.(2016).Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder.Proceedings of Interspeech 2016
  8. Glorot, X.,Bengio, Y.(2010).Understanding the difficulty of training deep feedforward neural networks.Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics,9,249-256.
  9. Goodfellow, I.,Bengio, Y.,Courville, A.(2016).Deep learning.Cambridge, MA:MIT press.
  10. Huang, K.-Y.,Wu, C.-H.,Su, M.-H.,Fu, H.-C.(2017).Mood detection from daily conversational speech using denoising autoencoder and LSTM.Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  11. Kenny, P.,Gupta, V.,Stafylakis, T.,Ouellet, P.,Alam, J.(2014).Deep neural networks for extracting Baum-Welch statistics for speaker recognition.Proceedings of Odyssey 2014
  12. Kingma, D. P.,Ba, J.(2014).Adam: A method for stochastic optimization.Proceedings of the 3rd International Conference for Learning Representation
  13. Kinnunen, T.,Evans, N.,Yamagishi, J.,Lee, K. A.,Sahidullah, Md.,Todisco, M.,Delgado, H.(2017).ASVspoof 2017: automatic speaker verification spoofing and countermeasures challenge evaluation plan.Training,10,1508.
  14. Krizhevsky, A.,Sutskever, I.,Hinton, G. E.(2012).Imagenet classification with deep convolutional neural networks.Proceedings of Advances in neural information processing systems
  15. Lee, H.-S.,Lu, Y.-D.,Hsu, C.-C.,Tsao, Y.,Wang, H.-M.,Jeng, S.-K.(2017).Discriminative autoencoders for speaker verification.Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  16. Lei, Y.,Scheffer, N.,Ferrer, L.,McLaren, M.(2014).A novel scheme for speaker recognition using a phonetically-aware deep neural network.Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  17. Lin, W.-w.,Mak, M.-W.,Chien. J.-Z.(2017).Fast scoring for PLDA with uncertainty propagation via i-vector grouping.Computer Speech & Language,45,503-515.
  18. Liu, X.,Gao, J.,He, X.,Deng, L.,Duh, K.,Wang, Y.-Y.(2015).Representation learning using multi-task deep neural networks for semantic classification and information retrieval.proceedings of HLT-NAACL 2015
  19. Luong, M.-T.,Le, Q. V.,Sutskever, I.,Vinyals, O.,Kaiser, L.(2015).Multi-task sequence to sequence learning.Proceedings of ICLR 2016
  20. Richardson, F.,Reynolds, D.,Dehakdoi, N.(2015).Deep neural network approaches to speaker and language recognition.IEEE Signal Processing Letters,22(10),1671-1675.
  21. Sarkar, A. K.,Do, C.-T.,Le, V.-B.,Barras, C.(2014).Combination of cepstral and phonetically discriminative features for speaker verification.IEEE Signal Processing Letters,21(9),1040-1044.
  22. Todisco, M.,Delgado, H.,Evans, N.(2016).A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients.Proceedings of Odyssey 2016
  23. van der Maaten, L.,Hinton, G.(2008).Visualizing data using t-SNE.Journal of Machine Learning Research,9,2579-2605.
  24. Van Santen, J. P. H.,Sproat, R.,Olive, J.,Hirschberg, J.(2013).Progress in speech synthesis.New York, NY:Springer Science & Business Media.
  25. Variani, E.,Lei, X.,McDermott, E.,Lopez Moreno, I.,Gonzalez-Dominguez, J.(2014).Deep neural networks for small footprint text-dependent speaker verification.Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  26. Villalba, J.,Miguel, A.,Ortega, A.,Lleida, E.(2015).Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge.Proceedings of Interspeech 2015
  27. Wu, Z.,Kinnunen, T.,Evans, N. W. D.,Yamagishi, J.,Hanilci, C.,Sahidullah, M.,Sizov, A.(2015).ASVspoof 2015 - the first automatic speaker verification spoofing and countermeasures challenge.Proceedings of Interspeech 2015
  28. Xiao, X.,Tian, X.,Du, S.,Xu, H.,Siong, C. E.,Li, H.(2015).Spoofing speech detection using high dimensional magnitude and phase features: The NTU approach for ASVspoof 2015 challenge.Proceedings of Interspeech 2015
  29. Yamada, T.,Wang, L.,Kai, A.(2013).Improvement of distant-talking speaker identification using bottleneck features of DNN.Proceedings of Interspeech 2013
  30. Yang, M.-H.,Lee, H.-S.,Lu, Y.-D.,Chen, K.-Y.,Tsao, Y.,Chen, B.,Wang, H.-m.(2017).Discriminative autoencoders for acoustic modeling.Proceedings of Interspeech 2017
  31. Ze, H.,Senior, A.,Schuster, M.(2013).Statistical parametric speech synthesis using deep neural networks.Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)