隨著深度學習技術的蓬勃發展,在傳統語音辨認的管線化架構下運用類神經網路已獲得相當的成功。而這兩年端到端(End-to-end)的語音辨認架構也取得與前者可匹敵的成果,但是需要非常龐大的訓練語料和運算資源。本研究嘗試從端到端語言模型的角度切入,利用實驗室所累計4.4 億詞的文字語料經過文字轉音節系統作為訓練語料,以自然語言處理中常用的序列標註模型和具自注意力機制的序列到序列模型(Transformer)訓練音轉字語言模型,發現音節序列中也具語意資訊,利用深度神經網路有助於將音節序列轉為正確的字(詞)序列。其中Transformer 的音轉字模型在外部測試集中達到比作為基線的Trigram 模型更低的錯誤率。
Deep nerual network with conventional automatic speech recognition structure has achieved huge improvement. Similarly, end-to-end speech recongnition structure got close performance in these two years, but with huge amout of data and computing resources. This study attempt to focus on end-to-end language model, training an end-to-end language model by sequence labeling method and self-attention seq2seq model (Transformer) which are common method in some NLP task, with syllable sequence converted from 440 million words chinese corpus through a proposed G2P system. And the syllable to character model with transformer achieved lower character error rate than the baseline trigram model in our outside test set.