透過您的圖書館登入
IP:216.73.216.225
  • 學位論文

使用BERT語意詞向量之三階段自動語音辨識

Using BERT Semantic Embeddings for 3-Stage ASR

指導教授 : 張智星

摘要


本研究模擬嬰兒學習一門語言的流程並提出一個語意導向的三階段自動語音辨識(automatic speech recognition,ASR)架構,先從聽到的聲音訊號理解其代表的意義,隨著年紀的增長才會去學習對應的文字:第一階段利用傳統之DNN-HMM聲學模型將聲音特徵轉換為音素後驗概率(phonetic posteriorgrams,PPG),並於第二階段透過基於Transformer之E2E架構將PPG轉換為帶有語意之詞向量,最後將詞向量轉換為文字供人類後續使用,其中於第二階段採用教師強制(teacher forcing)和計畫採樣(scheduled sampling)有效地提升模型辨識的正確率,而為了解決噪音產生文字的問題,除了加入噪音資料進行訓練外,還額外使用熵(entropy)的特性改善。本研究也提出重新組句的資料擴增方法,擴增不同語意的上下文供模型學習。實驗結果顯示,本研究提出之三階段ASR架構在MATBN測試集上能取得11.65%的字元錯誤率(character error rate,CER),與基於Hybrid CTC/Attention之E2E模型之12.2%的字元錯誤率相比,相對下降4.5%。

並列摘要


The research simulates the process of infant learning a language and proposes a three-stage ASR model. An infant will first learn the meaning of the audio signals by listening, and then learn the corresponding characters when the age of the infant increases. Therefore, the acoustic features are converted into frames that include phonetic posteriorgrams (PPG) through the traditional DNN-HMM acoustic model in the first stage. In the second stage, PPG sequences are converted into semantic character embeddings through the Transformer-based end-to-end (E2E) architecture. Finally, the character embeddings are converted into characters for subsequent use by humans. In the second stage, teacher forcing and scheduled sampling are used to effectively improve the accuracy of speech recognition. In order to solve the problem of characters generated by noise, in addition to adding noise data for training, the feature of entropy is also utilized. This research also proposes the way of data augmentation - reorganize sentences to augment different semantic contexts. The experimental result shows 11.65% absolute (4.5% relative) character error rate (CER) on the MATBN testing set compared to 12.2% of hybrid CTC/attention E2E model .

參考文獻


D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach, ser. Signals and Communication Technology. London: Springer-Verlag, 2015.
S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “HybridCTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, Dec. 2017.
“企業數位轉型新利器中華電信 AI 智能客服,” https://www.cht.com.tw/zh­ tw/home/cht/success-­case/enterprise­-success-­case-­print/a5, Sep. 2021.
“聲 控 時 代 來 臨 玉 山 銀 行 24 小 時 智 能 客 服 升 級 「隨 聲 隨 行」 語 音 服 務,” https://www.esunbank.com.tw/bank/about/news-center previewItemID=%7B046CBC3C-3C6A-486F-A2F0-BE11F9C05C8C%7D filter=%7BD3D086FF-0713-46E5-A7D0-136259716415%7D range=2000 previewType=news, Sep. 2021.
D. Wang, X. Wang, and S. Lv, “An Overview of End-to-End Automatic Speech Recognition,” Symmetry, vol. 11, no. 8, p. 1018, Aug. 2019.

延伸閱讀