  • 學位論文


Blind Phone Segmentation Based on Spectral Change Detection

指導教授 : 王小川 鐘太郎


音素分割是將一段連續語音訊號切割成各個音素單位,通常在語音處理時會作音素分割, 例如聲學語音學分析、語音辨識、語者辨識、語音合成、及語音資料庫標註等。人工的 音素分割是耗時的,而且會因為不同轉譯者得出不一致的結果,因此需要一個自動音素 分割的方法。典型的做法是將語音訊號與音素標記對齊,如果已經有一句話的文字轉譯, 使用基於隱藏式馬可夫模型的強制對齊法可以找出這句話的音素邊界時間點,這是一種 監督式的方法,通常會得到高正確率。然而有些應用是沒有訓練語料與事先轉譯,就需 要採用非監督式的方法。 如果語音資料沒有告知其語言的相對文字與轉譯,音素分割就 得採用盲分割的方法,這種方法很難得到高的正確率,提高其正確率是一種挑戰。 本論文探討盲音素分割的問題,採用頻帶能量抽取語音特徵,提出四個盲音素分 割的方法,(1)頻譜差異函數法、(2) 頻帶能量曲線追蹤法、(3)高氏函數法、與(4)勒氏多 項式近似法。在英語語音資料庫TIMIT 上作檢驗,實驗的結果顯示所提出的方法較前人 的方法為佳。頻帶能量曲線追蹤法也用以檢驗中文語音資料庫TCC300,發現一些關於 語言不相關性的問題,噪音的影響則在高氏函數法與勒氏多項式近似法中作探討。


Phone segmentation involves partitioning a continuous speech signal into discrete phone units. It is often required in some areas of speech processing, such as acoustic-phonetic analysis, speech recognition, speaker recognition, speech synthesis, and annotations of speech corpus. Manual phone segmentation is time consuming, and its result may be inconsistent because of the subjective criteria of different transcribers. Therefore a method of automatic phone segmentation is desirable. A typical approach is to align the speech signal to its phone transcripts in an utterance. The forced alignment based on hidden Markov model is a way to locate phone boundaries when the phone transcripts of the target utterance are available. This supervised method usually obtains high accuracy. However, the training speech signal and their transcripts are unavailable in some applications. Hence, unsupervised methods are used. If there is no linguistic knowledge (such as, orthographic or phonetic transcripts) of given speech data, phone segmentation is performed in blind method. However, this approach is difficult to obtain a high accuracy. Obtaining a high level of accuracy by using the blind method is challenging. This dissertation addresses the problem of blind phone segmentation. The band energies of speech signals are calculated for feature extraction. Four methods for blind phone segmentation are proposed. They are based on (1)Delta spectral function, (2)Band-energy tracing technique, (3)Gaussian function, and (4)Legendre polynomial approximation. English speech corpus, TIMIT, was examined. Experimental results showed that the proposed methods were more accurate than previous methods. For the method using BE tracing technique, Chinese speech corpus, TCC300, was also evaluated to reveal the language-independent problems. Noise influences were investigated in the methods using Gaussian function and Legendre polynomial approximation.




distribution and small sample Bayesian information criterion,” Speech Communication,
[1] Furui, “Digital Speech Processing, Synthesis and Recognition,” Marcel Dekker, New
[2] J. Marcus, ‘‘Phonetic recognition in a segment-based HMM,’’ in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, 2, pp. 479–482,
[3] James R. Glass, “A probabilistic framework for segment-based speech recognition,”
