  • 期刊
  • OpenAccess


NSYSU+CHT Speaker Verification System for Far-Field Speaker Verification Challenge 2020


在本論文中,我們描述了NSYSU+CHT團隊在2020遠場語者驗證比賽(2020 Far-field Speaker Verification Challenge, FFSVC 2020)中所實作的系統。單一系統採用基於嵌入的語者識別系統。該系統的前端特徵提取器是結合了時延神經網路,與卷積神經網路模組兩者的優點,稱為時延殘差神經網路的架構。在池化層,我們實驗了不同方式:統計池化層和GhostVLAD。而後端的評分器則採用機率線性判別分析,我們訓練跟調適機率線性判別分析用以不同系統的融合。我們分別參加了FFSVC 2020採單一麥克風陣列資料的文本相關(任務一)與文本無關(任務二)的語者驗證任務。我們提出的系統在任務一上取得minDCF 0.7703,EER 9.94%,在任務二上則是minDCF 0.8762,EER 10.31%。


In this paper, we describe the system Team NSYSU+CHT has implemented for the 2020 Far-field Speaker Verification Challenge (FFSVC 2020). The single systems are embedding-based neural speaker recognition systems. The front-end feature extractor is a neural network architecture based on TDNN and CNN modules, called TDResNet, which combines the advantages of both TDNN and CNN. In the pooling layer, we experimented with different methods such as statistics pooling and GhostVLAD. The back-end is a PLDA scorer. Here we evaluate PLDA training/adaptation and use it for system fusion. We participate in the text-dependent(Task 1) and text-independent(Task 2) speaker verification tasks on single microphone array data of FFSVC 2020. The best performance we have achieved with the proposed methods are minDCF 0.7703, EER 9.94% on Task 1, and minDCF 0.8762, EER 10.31% on Task 2.


Speaker Verification TDNN CNN TDResNet GhostVLAD


Chen, J., Cai, W., Cai, D., Cai, Z., Zhong, H., & Li, M. (2018). End-to-end language identification using netfv and netvlad. In Proceedings of 11th International Symposium on Chinese Spoken Language Processing (ISCSLP 2018), 319-323. doi: 10.1109/ISCSLP.2018.8706687
Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4690-4699. doi: 10.1109/CVPR.2019.00482
Li, S., Lu, X., Takashima, R., Shen, P., Kawahara, T., & Kawai, H. (2018). Improving very deep time-delay neural network with vertical-attention for effectively training ctc-based asr systems. In Proceedings of 2018 IEEE Spoken Language Technology Workshop (SLT), 77-83. doi: 10.1109/SLT.2018.8639675
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 212-220. doi: 10.1109/CVPR.2017.713
McLaren, M., Lei, Y., & Ferrer, L. (2015). Advances in deep neural network approaches to speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), 4814-4818. doi: 10.1109/ICASSP.2015.7178885
