透過您的圖書館登入
IP:3.138.138.144
  • 學位論文

基於脣形光流偵測之連續中文音節邊界自動擷取方法

Lip Optical-flow Driven Automatic Continuous Speech Syllabification in Mandarin

指導教授 : 黃乾綱

摘要


連續語音辨識系統的訊號處理方式可分為「完整訊號辨識」及「音節辨識」兩類。其中以音節為主的辨識方法能降低雜訊干擾,針對能量集中的訊號區間進行分析。進行音節辨識的首要步驟為偵測準確的音節邊界位置,但口語語音訊號中常見的連音現象使其無法順利取得正確的音節邊界。故本研究針對連續中文語料的脣形影像進行音節邊界偵測,以不同脣形間的轉換做為偵測音節邊界的關鍵資訊。 本研究提出的方法首先以SIFT定位人臉位置,而後採用密集光流法計算相鄰影格間的脣形影像變動量,以此變動量做為偵測音節邊界的依據。在本研究錄製的影音檔中,聲音及影像兩者為同步錄製而成,因此本研究在語音層面偵測的音節邊界結果中,引入影像偵測的音節邊界以提升正確率。實驗結果顯示在語音層面無法順利切分的音節中,有過半的音節邊界能由脣形影像變動量分析取得。結合兩者的音節邊界偵測資訊將能增強系統在嘈雜環境中的穩定度,使系統適應不同噪音的現實環境。本研究另錄製以中文連續數字序列為語料的影像資料庫,做為偵測音節邊界的實驗資料。在資料庫中的受試者共有40位;影片總數為2,480部 (包含閱讀階段及自然語速階段),此資料庫將公開於學術領域以促進中文脣語辨識的發展。

並列摘要


Automatic speech recognition (ASR) researches with video can be split into two categories which are based on signal processing methods. One is recognizing signals thoroughly from streaming video, the other is recognizing the signals on the syllable basis. The latter approach analyzes the energy centralized region to reduce noise interference. Recognizing syllables requires detecting the syllable boundaries correctly from continuous speech signals. To achieve this goal, this study focuses on detecting syllable boundaries contained in the continuous Mandarin corpus of the lip images. The transitions between different lip shapes are the key information in detecting syllable boundaries. The algorithm proposed in this research firstly locates the face positions, then dense optical flow is adopted to calculate the lip images variance between every two neighboring video frames. It is the basis that using this variance to detect syllable boundaries via lip images in continuous video frames. Since the audio and video are simultaneously recorded in this research, it is reasonable to assume the boundaries of two adjacent syllables should be seen from image information. The experiment result shows that more than half of the syllable boundaries can be extracted with the variance of lip images when audio signals of the syllables are inseparable from their energy distribution. Using both audio and video channels not only helps raise the stability in detection syllable boundaries, but also make the system robust to resist noisy environment. Furthermore, the recorded database of this study which consists of 2,480 clips (both reading & speaking) by 40 informants will be opened for download to promote academic research in Mandarin continuous speech recognition.

參考文獻


2. Leutenegger, S., M. Chli, and R.Y. Siegwart. BRISK: Binary robust invariant scalable keypoints. in Computer Vision (ICCV), 2011 IEEE International Conference on. 2011. IEEE.
4. Akdemir, E. and T. Ciloglu, Bimodal automatic speech segmentation based on audio and visual information fusion. Speech Communication, 2011. 53(6): p. 889-902.
5. Shaikh, A.A., D.K. Kumar, and J. Gubbi, Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments. The Visual Computer, 2012. 29(10): p. 969-982.
6. McCool, C., et al. Bi-modal person recognition on a mobile phone: using mobile phone data. in Multimedia and Expo Workshops (ICMEW), 2012 IEEE International Conference on. 2012. IEEE.
8. Mak, M.W. and W.G. Allen, Lip-Motion Analysis for Speech Segmentation in Noise. Speech Communication, 1994. 14(3): p. 279-296.

延伸閱讀