基於語言模型的集成和重新排序以改善 ASR

語音辨識在神經語言模型加入後，已經大幅改善其辨識率，但仍會有一些錯誤產生。本文是希望能透過使用不同語言模型，利用類似集成學習的概念，來選擇適合的詞代表形成新的句子以改善語音辨識。首先，會先利用傳統的語音辨識方法，加入調適語料進行第一階段解碼，再使用不同神經語言模型重新評分進行第二階段解碼。本文提出使用五個不同重新評分模型對同一個句子解碼的結果進行選詞，選詞依據為每個詞的重要性和詞的位置是否正確，在詞的重要性選擇上，會使用多數權重和累積權重判斷，而詞位置則是使用平移對齊和最長公共子序列對齊決定，再將選出的詞代表重組以創建新句子，本文稱這個方法為句子集成。我們比較句子集成和重新評分的結果，在Aishell-1測試資料中，錯誤減少率可以達9.84%，驗證本文所提出方法的有效性。

關鍵字

句子集成；重新評分；神經語言模型；語言模型調適；自動語音辨識

並列摘要

After the addition of neural language models, speech recognition has greatly improved its recognition rate, but there are still some errors. This paper hopes to improve speech recognition by using different language models and using concepts similar to ensemble learning to select suitable word representatives to form new sentences. First, the traditional speech recognition method will be used, and the adaption corpus will be added for the first-stage decoding, and then different neural language models will be used to rescore for the second-stage decoding. This paper proposes to use five different rescoring models to select words for the decoding results of the same sentence. The selection of words is based on the importance of each word and whether the position of the word is correct. In the selection of the importance of words, the majority weight and The cumulative weight is judged, and the word position is determined using shift alignment and longest common subsequence alignment, and then the selected word representatives are reorganized to create a new sentence. This method is called sentence ensemble. We compare the results of sentence ensemble and rescoring. In the Aishell-1 test data, the error reduction rate can reach 9.84%, which verifies the effectiveness of the proposed method.

並列關鍵字

sentence ensemble ； rescoring ； neural language model ； language model adaptation ； automatic speech recognition

參考文獻

[1] L. Lee, “Introduction to Digital Speech Processing, Search Algorithms for Speech Recognition,” https://scidm.nchc.org.tw/dataset/ner-trs-vol1-text, Sep. 2021.

Google Scholar

[2] K.-F. Lee, Automatic Speech Recognition: The Development of the SPHINX System, ser. The Springer International Series in Engineering and Computer Science. Springer US, 1989.

Google Scholar

[3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Van-houcke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups, IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov. 2012.

Google Scholar

[4] R. Rosenfeld, “Two decades of statistical language modeling: Where do we go from here?” Proceedings of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000.

Google Scholar

[5] A. Mansikkaniemi et al., “Acoustic model and language model adaptation for a mobile dictation service,” Master’s thesis, Aalto University, 2010.

Google Scholar

主題瀏覽