影音整合語音辨識及語音識別

本文介紹一個新穎的多模模型語音辨識及語者識別技術，結合了同步的聲音和嘴唇影像訊號，以增進傳統單一模式所能獲致的效果。由實驗測試證明，加上嘴唇影像資訊的語音辨識器，會比單純只使用語音資訊的語音辨識器效果來的好。而且聲音和嘴唇影像之間有互補的特性存在，可以針對雙方不足的部分給予加強，尤其當聲音受到背景噪音干擾時，嘴唇影像可以提供較穩定且可靠的資訊給語音辨識器，來提高語音辨識器的抗噪能力。在語者識別方面，我們嘗試結合聲音、嘴唇動態變化及語者臉部影像等三種資訊。在少量的訓練資料下，三種資訊的整合即可獲得顯著的辨識率增進效果，對於時間變異的抵抗力更有顯著的效果。此項技術的開發除了建立了嘴唇辨識的基礎之外，也使我們在多感測訊號的整合能力上向前邁進一步。

關鍵字

語音辨識；語音識別；影音整合；關鍵詞粹取

並列摘要

In this paper we propose a novel audio-visual speech recognition (AVSR) technology. The AVSR augments an audio-only speech recognizer with visual lip-reading information, in order to improve the performance and robustness of recognizer. The speech recognizer's variable length audio segments are resolved with fixed length video frames using segment constrained Hidden Markov Modeling. A Viterbi search over the per-segment Hidden Markov Model resolves the variable asynchrony between the audio and video streams. The two streams are combined according to relative weighting scheme, which is determined by optimizing on a held-out data set. The experiment shows the performance of AVSR is better than that of ASR. On the other hand, for person identification, we collect audio, lip, and face information from a speaker, combining them to achieve a better identification rate. Our results show that there exists some complementary relationship between any two of the three signals, so the performance can be boosted for a long-term experiment based on few training data. Audio and visual sources are clearly inter-related, and must be consistent, since they both capture time-varying information associated with the production of speech signal. However, the information they contain is often complementary. Thus, the integration of these parallel information sources could lead to enhanced capabilities for human-computer interaction.

並列關鍵字

Speech Recognition ； Person Identification ； Audio-Visual ； Keyword-Spotting

國際替代計量

影音整合語音辨識及語音識別

全文下載

主題瀏覽