透過您的圖書館登入
IP:3.137.170.183
  • 學位論文

語音辨識系統之聲學模型訓練研究

Acoustic Model Training for Speech Recognition

指導教授 : 江振宇
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


語音是人類溝通最主要的方式,這種溝通方式是自古以來人類演化發展的基礎,因此,語音的溝通就是由人類文化相互編織而成的,而其最吸引人去研究語音這塊領域的原因就是它在人類社會中的普遍性。 本論文主旨在於研究和開發語音辨識系統的聲學模型訓練,聲學模型是建構語音辨識系統的第一步,基本的模型訓練能夠使語音識別引擎應用於更多方面,語音辨識系統的應用尤其以下三點: 1. 透過電話傳遞簡短單字或關鍵字的辨識。 2. 中等長度單字的語音交互指揮和控制系統,例如:電信學中的IVR交互式語音應答,以及汽車、銀行和殘疾人士使用的聲控系統等。 3. 限定的語音翻譯。 聲學模型的訓練過程涉及到參數的估計,即尋求給定字串單字的最大相似度。首先,由於語音的辨識依賴於單字、語言模型和HMM模型,因此採用依照不同的資訊來源來分類劃分的方法。本論文中,語音是由HMM狀態、發音字典(TIMIT字典)和訓練語料庫(TIMIT訓練數據)所構成的,透過FST技術,將會建構出此三項資訊表現出來的網路架構。 本篇研究的動機可以說有兩個,一是學術需求,二則在專業的態度上做一些有所貢獻的事情。從學術的角度看,建立一個語音辨識系統,第一步是以演算法、方法、技術(數學和計算)將模型訓練概念化,這項學術努力的成果將可以應用在相似的工業理念上,因為我身為一位電信工程師,專業地投入這項研究,所以我能想像這項研究在工業上直接的應用。 事實證明,語音通訊可以擴展到人機通訊,讓機器能夠解讀和回應是語音辨識中一個艱難的任務,這項成功的發展歸功於語音研究人員和專家團隊,增加了語音通訊的領域範圍,也因此,人機通訊所帶來經濟和社會的便利讓許多企業希望能採納這項技術,相信這項技術所帶來的好處利益是很可觀的。在未來人類與機器互動將會是很容易的,也會有更多的人來使用這項技術,例如:殘障人士或老人社區。

並列摘要


Speech is the primary source of communication amongst human beings. This form of communication is the fundamental thread that underlies the progress of human evolution since time immemorial. Therefore, speech communication is the thread that is interwoven in the fabric of every human culture. A compelling reason to study and work with speech is that it is indubitably the most common form of communication within the human community, rendering speech communication ubiquitous and pervasive. This thesis seeks to investigate and develop the acoustic model training for a speech recognition system. This is the first step in building a speech recognition system. Underlying such training is a speech recognition engine whose applications are multifold. Applications of a speech recognition system are, inter alia; 1. Small vocabulary keyword recognition over dial-up telephone lines. 2. Medium size vocabulary voice interactive command and control systems, e.g. IVR in telecommunications and voice activated systems in automobiles, banks and disabled community. 3. Limited domain speech translation. Training refers to the process of parameter estimation; that seeks to maximize the likelihood of the observation given a string of words. First and foremost, a hierarchical approach is adopted wherein different sources of information are represented. This is motivated by the fact that speech recognition depends on vocabulary, language model and HMM models. In this thesis, speech is modeled by HMM states, pronunciation dictionary is the TIMIT dictionary and the training corpus is TIMIT training data. In this thesis, a network representing these three sources of information will be built by using FST technology. The motivation for this study is dual fold. It is motivated by the academic requirements as well as the professional need to fulfill a long standing desire of doing something great. From the academic perspective, algorithms, methods and techniques (mathematical and computational) are sought to conceptualize the training as a first step in building a speech recognition system. The success of this academic endeavor will culminate to the application of the similar ideas in industry. As I am devoted to this study, professionally, as I am a telecommunication engineer, I can think of immediate applications of this research in industry. It turns out that speech communication can be extended to human-machine communication. To make a machine know and respond to speech is a task of pattern recognition, which this thesis endeavors to investigate and study. Success to this extension implies an increasing domain of speech communication and that success is attributable to the wider community of speech researchers and professionals. As a result of the increasing domain of speech communication, businesses will also wish to embrace the concept of human-machine speech communication. One of the driving forces to embrace such a concept is the economic and social conveniences that come with this form of communication. By embracing such a technology, benefits are huge. Interacting with a machine will be effortless, giving rise to the number of people making use of the technology, e.g. the disabled and elderly community.

參考文獻


References and Bibliography
[1] Lawrence R. Rabiner and Ronald W. Schafer, “Theory and Applications of Digital
1993, pp 42-54
Music and Communications”, John Wiley & Sons Ltd., 2007, pp 510-520.
[4] Sergios Theodoridis and Konstantinos Koutroumbas, “Pattern Recognition”, Elsevier

延伸閱讀