語音特定屬性之偵測與應用

自動語音辨識(Automatic Speech Recognition，簡稱ASR)，從1960 年代到1970 年代，人們面對這個問題，使用語音學上的語音特性為基礎，來設計智慧型辨識系統的規則，所幸自從1960至1970年代，CMU與IBM兩個研究單位都提出了隱藏式馬可夫模型(Hidden Markov Model，簡稱HMM)技術，從1970 年代起許多使用HMM語音辨識的研究紛紛展開，例如:使用不同形式的函數來估測事後機率、不同的HMM參數估測與鑑別式參數估測(MCE、MMI等)。應用範圍也從獨立詞、連接詞彙的問題進展到連續語音辨識。研究對象也從語者相關拓展到更加複雜的與語者無關(speaker independent)。以HMM為架構的語音辨識系統在1980 年代至2000 年代間，系統的進步幅度明顯高於專家知識為基礎的辨識系統。然則近年來使用HMM的語音系統在發展曲線上趨於平緩，因此部分研究學者紛紛開始尋找新的研究方向，因此C.H. Lee提出Automatic Speech Attribute Transcription (簡稱ASAT)，其目的是希望結合基於知識(knowledge based)與基於資料的(data driven)的辨識系統。這方法需要許多不同語音特性辨識器(speech attribute detector)，來評估不同的語音特性，ASAT把這些特性根據專家知識，來整合並產生出更高階的語音特性。因此，選擇合適的語音特性當作 ASAT的語音特性辨識器成了一個重點，現今語音學上使用不同種類語音之間關鍵性差異的語音特徵，階層式的把語音歸類，這些類別可以看成三大部份，第一部分是有聲(voiced)與無聲(voiceless)，主要關鍵是喉部聲門(glottis)處的兩片聲帶震動與否。第二部份是發音位置，區分方式依照使用何處發出聲音來定義，比如/b/就是使用唇當作發音位置，/g/使用口腔後上方的軟顎當作發聲位置。第三部份是依照發聲方式分類，主要分有母音(vowel)、半母音(semivowel)、鼻音(nasal)、爆破音(stop)與摩擦音(fricative)。母音在發音時從肺部呼出的氣流通過口腔，產生共鳴點，阻力極小無摩擦聲音，所以聽起來較清楚，例如:/i/。/j/和/w/這兩個所謂的半母音(semi-vowel)，顧名思義，它們雖然是子音卻又與母音相似，亦稱滑音(glide)。他們之所以是子音是因為發音時，氣流受了阻礙；它們之所以又叫半母音是因為氣流阻礙不明顯，使它們聽起來感覺像母音。爆破音指的是氣流受阻礙後接著強烈地送出氣來，/p/與/t/都是屬於這一類。摩擦音/f/、/s/的發音方式就是要讓氣流通過狹窄通道，口腔處於阻塞氣流通道但不完全閉塞，在發聲部位發生摩擦噪音。本篇論文中研究對象著重在發聲方式的語音特徵辨識器。在我們的研究中嘗試擷取不同於梅爾倒頻譜參數的語音特徵參數來訓練聲學模型，並偵測語音轉變時之資訊，整合這兩種方式，來建立一個語音特性辨識器。首先，我們所擷取的這些語音參數主要是依據語音學上區分不同類語音時所需要考慮的語音特性，再根據語音特性來選擇語音特徵參數，其中包含有時域及頻域的特徵參數。此外，當使用我們擷取出來的語音特徵參數，來分類語音資料時，我們使用高斯混合模型(Gaussian Mixture Model，簡稱GMM)與支援向量機(Support Vector Machine，簡稱SVM)來當作分類器，並評估使用語音特徵參數後分類器的效能。此外，利用不同發音方式在轉變時，時域上特定特徵參數會有明顯改變，我們也擷取這些資訊，當作辨識分類時的一個重要的依據。而我們在時域上所偵測出的資訊，所代表的是兩種不同語音特性相鄰的邊界，可以對應到時域上的變化，而這些變化持續時間都十分短暫，因此我們考慮HMM這種架構，容易忽略一些持續短暫的語音特性，所以我們修改維特比演算法(Viterbi Algorithm)，採取知識為基礎的決策方式，來利用這種邊界資訊，以幫助更準確的辨識出聲音發聲方式特性，建立更準確的語音特性辨識器。

關鍵字

語音特性辨識器；自動語音辨識

並列摘要

In the past four decades, the integration of Hidden Markov Model into automatic speech recognition (ASR) has made great progress. A variety of research about HMM in speech recognition has been developed, such as different definitions of observation probability functions, or different methods for estimating model parameters. Recently, some researchers looked into the framework of ASR from a new angle. C.H. Lee proposed automatic speech attribute transcription (ASAT) which combines the knowledge-based and data driven methods. Some diagnostic information is provided by integrating additional speech attribute detectors which are designed by acoustic phonetic knowledge. It is believed that incorporation of such knowledge is potentially beneficial to ASR. ASAT is based on speech attribute detection. Specific speech event is composed of some speech attributes. ASAT can interpret different speech event into more high level speech evidence. Therefore, effective speech attribute detectors have become an important research issue. In this paper, we focus on detecting articulation attributes, namely, vowel, fricative, stop, sonorant consonant and silence. We use knowledge-based features to detect manner of articulation. Support vector machine based classifiers for manner of articulation have also been designed using a set of knowledge based features under a probabilistic framework. Besides frame based event detectors, segment based detectors can also be used. Some speech landmarks are detected by using temporal information. These landmarks are integrated with HMM-based event detectors.

並列關鍵字

speech attribute detector

參考文獻

【2】A. Juneja and C. Espy-Wilson, ”Speech Segmentation using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines,” Neural Network, Proceedings of the International Joint Conference, 2003.

【3】A. Juneja and C. Espy-Wilson, “An Event-Based Acoustic-Phonetic Approach For Speech Segmentation And E-Set Recognition,” ICASSP 2002.

【4】A. Juneja and C. Espy-Wilson, “Segmentation of Continuous Speech using Acoustic –Phonetic Parameters and Statistical Learning,” ICONIP 2002.

【6】AMA Ali, J Van der Spiegel, P Mueller, G Haentjens ,and J. Berman, “An Acoustic-Phonetic Feature-Based System for Automatic Phoneme Recognition in Continuous Speech,” ISCAS 1998

【7】Ahmed M. Abdelatty Ali and Jan Van der Spiegel , “Acoustic-Phonetic Features for The Automatic Classification of Fricatives,” JASA 2001.

被引用紀錄

蔡明嘉（2010）。使用支持向量機演算法之鼻音事件偵測〔碩士論文，國立清華大學〕。華藝線上圖書館。https://doi.org/10.6843/NTHU.2010.00527

林玉婷（2008）。手持式裝置之情境規劃〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0207200917352422

黃怡寧（2008）。華語捲舌音與非捲舌音辨識之研究〔碩士論文，國立清華大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0016-2002201314584551

國際替代計量

語音特定屬性之偵測與應用

全文下載

主題瀏覽