確實地從語音信號中偵測顯著的語音事件,在基於語音事件的語音辨識技術中扮演重要的角色。標記語音事件不僅對音素辨識有所助益,也能方便吾人從中萃取重要的語音訊息。本論文著重於爆發起始的偵測,它是塞音和塞擦音中最具代表性的語音事件,此事件的存在與否可藉由偵測時頻空間中的閉塞爆發轉移來達成。本文採用二維倒頻譜係數為特徵參數,搭配隨機森林法在連續語音中偵測爆發起始。在建造隨機森林的過程中,會遭遇影響偵測效果甚鉅的不平衡語料問題,因此我們提出非對稱拔靴複製法來克服。在TIMIT英文語料庫所進行的一連串實驗結果顯示,本文提出的爆發起始偵測法同時俱有高效率和高準確率兩項優點,而偵測過程的部份資訊亦可輔佐梅爾頻率倒頻譜係數來增加塞音和塞擦音的音素辨識率。 嗓音起始時間是塞音爆發起始和帶聲起始的時間差,在有關嗓音起始時間的諸多研究中,如何有效率地估量其值一直是備受關注的議題之一。以人工進行標記是個可行方案,但面對龐大的語料庫時,人工的方式難免耗日費時。本論文的第二部份特別針對此問題,結合基於隱藏式馬爾可夫模型的狀態層次強制對齊以及基於隨機森林法的起始點偵測兩項技術,來自動標記嗓音起始時間。強制對齊能在連續語音中約略地標出塞音的發生位置,爾後起始點偵測在對齊的塞音片段中找尋較精準的爆發起始和帶聲起始時間點。實驗語料亦來自於TIMIT語料庫,包含2,344個字首塞音及1,440個字中塞音,平均而言,本文提出的方法能在5毫秒、10毫秒、15毫秒、20毫秒的誤差容忍度之下,分別達到57%、83%、93%、96%的累積正確率,實驗結果亦指出字首塞音的嗓音起始時間較字中塞音的嗓音起始時間易於估測。除了展現嗓音起始時間估測的準確程度外,對於可能影響估測的因素,如塞音的發音部位、塞音的帶聲狀態、後接母音的品質等,皆在文中有所探討。
The reliable detection of salient acoustic-phonetic cues in speech signal plays an important role in landmark-based speech recognition. Locating speech landmarks not only assists phone recognition, but also helps extraction of phonetic information. This dissertation focuses on the issue of detecting burst onset, which is the most prominent landmark in stop and affricate consonants. The chosen feature representation is the two-dimensional cepstral coefficients (TDCCs) from a spectro-temporal patch, which are able to highlight the closure-burst transitions that indicate the presences of burst onsets. Then the random forest technique, an ensemble of tree-structured classifiers, employs the feature vectors to detect burst onsets in continuous speech. During the random forest construction, we also proposed an asymmetric bootstrap to deal with the problem of imbalanced training data, which may deteriorate performance of a resulting forest. A series of experiments conducted on an English spoken corpus, TIMIT, demonstrate that the proposed detector provides an efficient and accurate means to detect burst onsets. When the detection results are appended to MFCC vectors, the augmented feature vectors enhance the recognition correctness of stop and affricate consonants. Voice onset time (VOT) of a stop consonant is an interval between its burst onset and voicing onset. Among a variety of research topics on VOT, one that has been concerned for years is how to efficiently measure a VOT. Manual annotation is a feasible way, but it becomes a time-consuming task when corpus size is large. The second part of this dissertation proposes an automatic VOT estimate method which combines an HMM-based state-level forced alignment and an RF-based onset detection. The forced alignment roughly locates stop consonants in continuous speech. Then the onset detector searches each aligned stop segment for its subtle locations of burst and voicing onsets to estimate a VOT. The proposed method is able to onset detection can detect the onsets in an efficient and accurate manner with only a small amount of training data. The evaluation data were extracted from TIMIT corpus, which in total comprises 2,344 word-initial and 1,440 word-medial stops. The experimental results showed that, on average, 57%, 83%, 93%, and 96% of the estimates deviate less than 5 ms, 10 ms, 15 ms, and 20 ms from their manually labeled values respectively. The results also revealed the fact that VOTs of word-initial stops are more accurately estimated than those of word-medial stops. In addition to the accuracy of VOT estimates, factors that may influence the estimate accuracy, i.e., articulation place of a stop, voicing status of a stop, and quality of succeeding vowel, were also investigated.