透過您的圖書館登入
IP:13.58.137.218
  • 學位論文

強健及分散式語音辨識系統中的動態量化技術

Dynamic Quantization for Robust and Distributed Speech Recognition

指導教授 : 李琳山
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


架構於無線網路上的分散式語音辨識系統(Distributed Speech Recognition,DSR),將傳統的語音辨識分散在手持設備與伺服器兩端:在手持設備執行語音特徵參數的抽取與壓縮,並將壓縮後的資料經過無線通道傳送至伺服器端,以進行特徵參數的還原與辨識。由於隨身攜帶的手持設備面臨多變且無可預知的環境,環境雜訊與壓縮帶來的信號失真以及傳輸造成的錯誤會互相加成起來,嚴重影響分散式語音辨識的效能。 本論文針對聲學模型與量化碼本的訓練語音和實際進行辨識語音的特性不匹配的問題,提出兩種強健性的動態量化法,第一種方法是「以分佈統計為基礎的強健性量化法」,此方法是根據最接近所要量化係數的前面一段區間的順序統計資訊(order-statistics)或分佈統計資訊(histogram),動態調整其量化邊界,可使量化碼本自動跟隨輸入語料的分佈而改變,解決了傳統以距離為基礎的量化因固定碼本的限制下,量化碼字無法有效表示帶有不同雜訊的語音的問題,而動態的量化區間也使得量化本身較不受不同語者特性所影響;本論文亦進一步提出一種以量化失真與分佈偏移為基礎的綜合不確定性解碼法,在完全不需要增加額外資料傳輸量的情況下,能夠估測在「以分佈統計為基礎量化法」中的量化失真和雜訊環境下語音特徵參數的兩種不確定性,在辨識器解碼的過程中一併考慮。在「以分佈統計為基礎量化法」的分散式辨識系統中,本論文進一步發展出一套三階式錯誤補償技術。此技術結合了以分佈統計為基礎量化法的強健特性,並同時考慮了語音輸入端的背景雜訊和無線通道錯誤的問題。第一階段方法可偵測出音框和子向量兩種層級的錯誤,第二階段考慮了語音訊號統計資訊、通道傳輸的轉換機率,以及所接收的語音係數可靠程度,利用最大可能性估測法還原錯誤的語音特徵向量。第三階段則將所估測語音特徵資訊的不確定性,加入維特比解碼的過程,使得較不確定的語音係數對辨識率的影響較小。在每一階段中,錯誤補償技術皆能有效利用以分佈統計為基礎量化法的強健本質,我們在 Aurora 2 語料庫上做了完整的實驗,並包含 GPRS通訊系統的通道錯誤模擬。實驗的結果顯示我們所提出的方法能有效地克服環境雜訊與傳輸干擾的影響,並顯著地提升語音辨識的正確率。 本論文提出的第二種強健性的動態量化法是「前後資訊相關的量化法」,有別於傳統的量化方法都是以音框為單位,考慮單一音框的參數數值來決定此一參數的量化結果,「前後資訊相關的量化法」的每個分割單元在解碼過程並不是對應到單一的碼字,其代表碼字會根據前後特徵參數不同動態決定,此量化法考慮了語音前後相關的特性,可得到較單一音框的量化更具有代表性的碼字。我們亦進一步建立前後音框碼字的三連模型,將語音參數受到雜訊影響可能的變化在訓練模型時即加以考慮,並以最小均方誤差(MMSE)準則來估計語音特徵參數。此量化法可以直接應用在使用者端任何原有的量化方法上,完全不需更改使用者端的計算複雜度和傳輸的位元數,而是在伺服器端加入前後音框的資訊,透過一對多的解碼方式增加量化特徵參數的解析度。本論文亦將「前後資訊相關的量化法」與「以分佈統計為基礎量化法」結合,能動態的定義分割邊界和代表碼字,對環境雜訊和傳輸錯誤皆極具強健性。 本論文提出的兩種強健性量化法,亦可應用於強健性語音辨識。將語音特徵參數透過強健性量化法轉換為代表值,可視為一套強健性的特徵參數轉換法,量化法本身具有強健的特性,部分的環境干擾可以被量化法吸收掉,實驗結果顯示對低訊噪比環境與不穩定性的雜訊也可有效處理。 最後,本論文亦將動態量化的觀念應用到圖片特徵參數的量化。由於相片庫中大部分的相片即使有文字註解,亦均為較短的文句,無法充分代表整張相片的語意,若能由相片的圖片特徵抽取出可代表相片特徵的「圖像詞」,利用這些「圖像詞」所建立的潛藏語意模型將有助於相片檢索。本論文將抽取「圖像詞」的過程視為一種量化法,也就是要找出一些能有效表示圖片特徵參數的「圖像詞」。實驗結果顯示使用動態的量化法能有效抽出代表圖片語意的「圖像詞」,大幅改善了相片檢索的效果。

並列摘要


Split Vector Quantization (SVQ) is popularly used in a Distributed Speech Recognition (DSR) framework, in which the speech features are vector quantized and compressed at the client, transmitted via wireless networks, and recognized at the server. However, recognition accuracy is inevitably degraded by environmental noise at the input, quantization distortion and transmission errors; these three sources of disturbances naturally mix up with each other and further complicate the problem. The mismatch between the pre-trained VQ codebook and the constantly changing environmental conditions at the moving client is one of several major problems. In this dissertation, two dynamic quantization methods are proposed for both robust and distributed speech recognition. The first approach, Histogram-based Quantization (HQ), is a novel approach in which the partition cells of the quantization are dynamically defined by the histogram or order statistics of a segment of the most recent past values of the parameter to be quantized. This dynamic quantization scheme based on local signal order statistics is shown to be able to solve to a good degree many problems related to the mismatch with a fixed VQ codebook. This concept is extended to Histogram-based Vector Quantization (HVQ). A Joint Uncertainty Decoding (JUD) approach is further developed for it, in which the uncertainty caused by both environmental noise and quantization errors can be jointly considered during Viterbi decoding. A three-stage error concealment (EC) framework based on HQ is also developed to handle transmission errors. The first stage detects the erroneous feature parameters at both the frame and subvector levels. The second stage then reconstructs the detected erroneous subvectors by MAP estimation, considering the prior speech source statistics, the channel transition probability, and the reliability of the received subvectors. The third stage then considers the uncertainty of the estimated vectors during Viterbi decoding. At each stage, the error concealment (EC) techniques properly exploit the inherent robust nature of Histogram-based Quantization (HQ). The second approach is context-dependent quantization, in which the representative parameter (whether a scalar or a vector) for a quantization partition cell is not fixed, but depends on the signal context on both sides, and the signal context dependencies can be trained with a clean speech corpus or estimated from a noisy speech corpus. This results in a much finer quantization based on local signal characteristics, without using any extra bit rate. The context-dependent quantization could be integrated with HQ proposed above. Both partition cells and representative values are dynamically defined in the integrated dynamic quantization process. These two dynamic quantization techniques are not only useful for DSR, but are also attractive feature transformation approaches for robust speech recognition outside of a DSR environment. In the latter case, the feature parameters are simply transformed into their representative parameters after quantization. The robust nature of dynamic quantization is analyzed in detail. HQ performs the transformation by block-based order statistics, small disturbances of the feature parameters can be absorbed by the histograms to a good extent. As a result, the proposed HQ scheme can be useful for both robust and distributed speech recognition. For robust speech recognition, HQ is used as the front-end feature transformation and JUD as the enhancement approach at the back-end recognizer. For context-dependent quantization, exploiting high correlation in speech signals also significantly improves the robustness against transmission errors and environmental noise. All the above claims about speech recognition have been verified by experiments using the Aurora 2 testing environment, and significant performance improvements for both robust and/or distributed speech recognition over conventional approaches have been achieved. In addition, we also apply the concept of dynamic quantization on image features for photograph retrieval. Quantization with dynamic partition cells reduces the mismatch of pixel value distributions between different cameras; thus photos taken from different cameras are more easily retrieved. Quantization with dynamic representative codewords emphasizes more important color bins and texture features; thus the photo difference in more discriminative feature dimension could be preserved well in the quantization process as well. Experimental results show that dynamic quantization on image features can significantly improves photo retrieval results.

參考文獻


Thesis, National Taiwan University Jun. 2001.
Signal Processing Magazine, vol. 22, no. 5, Sep. 2005.
the ETSI standards activities for distributed speech recognition front-ends,” Proc.
for speech recognition over the world wide web,” IEEE Select. Areas Commun.,
[4] J. -Y. Li, Bo Liu, R. -H. Wang and Li. -R. Dai, “A complexity reduction of ETSI advanced

延伸閱讀