採用知識蒸餾與模型壓縮之低功耗可變關鍵字的喚醒詞辨識系統

隨著智慧裝置的普及，語音喚醒技術日益重要。語音喚醒主要透過喚醒詞辨識實現，目標為在一連續語音中辨識是否存在一特定關鍵字。由於深度神經網路快速的發展，採用深度神經網路的喚醒詞辨識也在辨識精準度上獲得了大幅的進步。傳統基於深度神經網路的喚醒詞辨識系統需要使用大量目標關鍵字的語音作為訓練資料，因此只能辨識固定的關鍵字且難以在完成訓練後替換關鍵字。若是需要替換關鍵字，就需要重新蒐集目標關鍵字的語料並重新訓練模型。本論文聚焦於實作一可變關鍵字的喚醒詞辨識系統，其採用連結時序分類（connectionist temporal classification，CTC）來訓練聲學模型，透過模型的輸出計算信心分數並基於信心分數來決定是否喚醒系統。然而為了方便使用，喚醒詞辨識系統需要部屬於邊緣裝置上，為了達成此目標，本論文也採用了知識蒸餾（knowledge distillation）和模型量化（model quantization）方法，在不影響辨識精準度的前題下大幅提升系統的辨識速度。於Mobvoi Hotwords上進行實驗，相較於基準方法，本研究提出的方法可以在運行速度相對提升40%時，同時使每小時錯誤喚醒次數為1時的錯誤拒絕率相對下降15.54%。

關鍵字

喚醒詞辨識；連結時序分類；知識蒸餾；模型量化； Mobvoi Hotwords

並列摘要

With the widespread of smart devices, wakeup-word detection becomes more and more important. Wakeup-word detection is based on keyword spotting (KWS), in which the target is to identify whether there is a specific keyword in a continuous speech. Traditional deep learning-based KWS approaches require the use of lots of keyword audio files to train the keyword-specific network and it is hard to change the keywords without extensive training in advance. If we want to change the keyword, we have to collect the new keyword audio corpus in order to retrain the network. In this thesis, we focus on the implementation of a system for open-vocabulary keyword spotting. We use the connectionist temporal classification (CTC) to train the acoustic model and determine whether to wake up the system by the confidence score of the target keyword calculated by CTC output. However, the keyword spotting system needs to be deployed on an edge device for convenience of use. To achieve this, we use knowledge distillation and model quantization to reduce the system latency without performance degradation. The experiments performed on Mobvoi Hotwords database show that the relative reduction in latency can reach 40% and the relative reduction in false rejection rate can reach 15.54% with 1 false alarm per hours, when compared with the baseline model.

並列關鍵字

keyword spotting ； connectionist temporal classification ； knowledge distillation ； model quantization ； Mobvoi Hotwords

參考文獻

[1] G. Chen, C. Parada, and G. Heigold, “Smallfootprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091.

Google Scholar

[2] T. Sainath and C. Parada, “Convolutional neural networks for smallfootprint keyword spotting,” in Interspeech, 2015.

Google Scholar

[3] R. Tang and J. Lin, “Deep residual learning for smallfootprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5484–5488.

Google Scholar

[4] Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, “A time delay neural network with shared weight selfattention for smallfootprint keyword spotting.” in INTERSPEECH, 2019, pp. 2190–2194.

Google Scholar

[5] G. Chen, C. Parada, and T. N. Sainath, “Querybyexample keyword spotting using long shortterm memory networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5236–5240.

Google Scholar

國際替代計量

採用知識蒸餾與模型壓縮之低功耗可變關鍵字的喚醒詞辨識系統

全文下載

主題瀏覽