使用BERT語意詞向量之三階段自動語音辨識

本研究模擬嬰兒學習一門語言的流程並提出一個語意導向的三階段自動語音辨識（automatic speech recognition，ASR）架構，先從聽到的聲音訊號理解其代表的意義，隨著年紀的增長才會去學習對應的文字：第一階段利用傳統之DNN-HMM聲學模型將聲音特徵轉換為音素後驗概率（phonetic posteriorgrams，PPG），並於第二階段透過基於Transformer之E2E架構將PPG轉換為帶有語意之詞向量，最後將詞向量轉換為文字供人類後續使用，其中於第二階段採用教師強制（teacher forcing）和計畫採樣（scheduled sampling）有效地提升模型辨識的正確率，而為了解決噪音產生文字的問題，除了加入噪音資料進行訓練外，還額外使用熵（entropy）的特性改善。本研究也提出重新組句的資料擴增方法，擴增不同語意的上下文供模型學習。實驗結果顯示，本研究提出之三階段ASR架構在MATBN測試集上能取得11.65%的字元錯誤率（character error rate，CER），與基於Hybrid CTC/Attention之E2E模型之12.2%的字元錯誤率相比，相對下降4.5%。

關鍵字

自動語音辨識；序列到序列；自迴歸模型；音素後驗概率；語意詞向量

並列摘要

The research simulates the process of infant learning a language and proposes a three-stage ASR model. An infant will first learn the meaning of the audio signals by listening, and then learn the corresponding characters when the age of the infant increases. Therefore, the acoustic features are converted into frames that include phonetic posteriorgrams (PPG) through the traditional DNN-HMM acoustic model in the first stage. In the second stage, PPG sequences are converted into semantic character embeddings through the Transformer-based end-to-end （E2E） architecture. Finally, the character embeddings are converted into characters for subsequent use by humans. In the second stage, teacher forcing and scheduled sampling are used to effectively improve the accuracy of speech recognition. In order to solve the problem of characters generated by noise, in addition to adding noise data for training, the feature of entropy is also utilized. This research also proposes the way of data augmentation - reorganize sentences to augment different semantic contexts. The experimental result shows 11.65% absolute (4.5% relative) character error rate (CER) on the MATBN testing set compared to 12.2% of hybrid CTC/attention E2E model .

並列關鍵字

automatic speech recognition ； sequence to sequence ； autoregressive model ； phonetic posteriorgrams ； semantic embeddings

參考文獻

D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach, ser. Signals and Communication Technology. London: Springer-Verlag, 2015.

Google Scholar

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “HybridCTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, Dec. 2017.

Google Scholar

“企業數位轉型新利器中華電信 AI 智能客服,” https://www.cht.com.tw/zh tw/home/cht/success-case/enterprise-success-case-print/a5, Sep. 2021.

Google Scholar

“聲控時代來臨玉山銀行 24 小時智能客服升級「隨聲隨行」語音服務,” https://www.esunbank.com.tw/bank/about/news-center previewItemID=%7B046CBC3C-3C6A-486F-A2F0-BE11F9C05C8C%7D filter=%7BD3D086FF-0713-46E5-A7D0-136259716415%7D range=2000 previewType=news, Sep. 2021.

Google Scholar

D. Wang, X. Wang, and S. Lv, “An Overview of End-to-End Automatic Speech Recognition,” Symmetry, vol. 11, no. 8, p. 1018, Aug. 2019.

Google Scholar

國際替代計量

使用BERT語意詞向量之三階段自動語音辨識

全文下載

主題瀏覽