透過您的圖書館登入
IP:3.145.66.241
  • 學位論文

小波理論於語音訊號增強及特徵壓縮

Wavelet Speech Enhancement and Feature Compression

指導教授 : 蘇柏青
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


語音是人與人、人與機器之間最簡便以及最重要的溝通方式。然而,語音在實際應上則容易受到環境雜訊的影響,因而造成這類溝通方式無法有效地傳遞訊息。為了能在在雜訊環境中,增進溝通介面的效能,本篇論文可概分為兩個部分以處理並探討系統的效能:運用離散小波包轉換(discrete wavelet packet transform,DWPT)於語音增強(speech enhancement,SE)、以及特徵壓縮(feature compression,FC)等兩種重要語音訊號處理技術處理語音訊號;期能有效地傳遞語音資訊。 首先,論文中的第一部分將以離散小波包轉換設計一個先進的語音增強技術,同時比較以短在時傅利葉轉換(short-time Fourier transform,STFT)為主的傳統方法,探討離散小波包轉換做為短時傅利葉轉換替代方案的可行性。傳統語音增強技術的訊號分析特徵是由短時傅利葉轉換求得一連串頻譜訊號特徵,再強化此頻譜特徵以得到較為乾淨的頻譜特徵。然而,此乾淨語音頻譜中的相位成份仍然是由的帶噪語音頻譜獲得,換句話說,語音訊號中相位成份並未處理,因此造成語音增強系統在除噪效能的限制。為了解決傳統方法中相位失真的問題,本論文以離散小波包轉換求取語音特徵,並做為語音增強系統的輸入,使系統直接處理時域訊號。本論文使用兩種不同的語音增強技術,測試離散小波包轉換之效能:非負矩陣分解(nonnegative matrix factorization,NMF)及強健性主成份分析(robust principal component analysis,RPCA)法。我們所提出基於小波包轉換之語音訊號增強包括三個步驟︰(1)一段時域語音訊號首先經由離散小波包轉換分解為數個子頻帶,每一個子頻帶均為時域訊號,僅其包含的頻率成份不同;(2)每一子頻帶訊號以非負矩陣分解或強健性主成份分析法濾出每一子頻帶的乾淨語音成份;(3)以逆離散小波包轉換合成回強化後的語音訊號。本論文以華語語料庫Mandarin hearing in noise test(MHINT)為測試平台,實驗結果顯示,相較於傳統的短時傅利葉轉換語音增強方法,以離散小波包轉換為主的語音增強系統能提供較佳的語音品質與語音理解的效能,並改善傳統方法造成的訊號失真問題。 論文的第二部分將離散小波包轉換運用於語音訊號之特徵壓縮上,並應用至強健分散式語音辨識(distributed speech recognition,DSR)系統在雜訊環境下的辨識率。依據訊號處理的方式,分散式語音辨識系統可分為前端客戶端以及後端伺服端。前端處理系統將擷取以及壓縮語音特徵,並經由訊號傳輸介面傳送至後端系統進行語音辨識。在本篇論文中,我們提出一語音特徵壓縮技術,由小波選擇壓縮(suppression by selecting wavelets,SSW),除了降低記憶體的使用以及硬體配置的要求,同時能維持或提高分散式語音辨識系統的辨識效能。由小波選擇壓縮技術的實行流程如下:(1)於客戶端系統中,在時間軸上,將待辨識的語音特徵經由離散小波包轉換拆解成兩個子序列特徵,這兩份子序列特徵分別包含了原始特徵的高、低頻成份。(2)保留包含低頻成份的子序列特徵,並捨棄另一份含有高頻成份的子序列特徵,以達成語音特徵壓縮;(3)傳送此壓縮後的語音特徵至後端做進一步的處理。(4)在伺服端系統,首先正規化接收到的壓縮語音特徵,並以逆小波轉換解壓縮為原特徵大小,最後由一簡單的後處理濾波器補償語音特徵,以減少解壓縮後可能產生的特徵過度平滑現象(over-smoothing effects)。其中,離散小波包轉換包含濾波以及下取樣(down-sampling)的處理,因此能有效的解析特徵序列在時間上的特性;而下取樣的處理則進一步減資料的數量,因此能達成壓縮的效果。在特徵參數壓縮的實驗架構於Aurora-4及及華語新聞語料庫(Mandarin Chinese news corpus,MATBN)上。實驗結果顯示,相較於傳統的雜訊強健技術,所提出的SSW演算法能有效地提昇語音辨識系統的辨識率,同時提供約50%的壓縮率,證實此方法非常適合應用於DSR系統。

並列摘要


Speech is the most essential communication interface for human-human and human-computer interactions. In real-world scenarios, the communication effectiveness may be seriously degraded by the environmental noises. To address this issue, this thesis investigated to use the discrete wavelet packet transform (DWPT) for speech enhancement (SE) and feature compression (FC) to attain better human-human and human-computer interactions. For the first part of this thesis, we applied DWPT to design an advanced SE approach. For most conventional SE methods, a sequence of spectral features are usually used as a compact representation for raw waveforms. However, one major problem for the conventional SE is that the phase of the noisy speech is directly used as the phase of the enhanced speech, when reconstructing enhanced waveforms. Since the phase information of the noisy and clean speech can be different, this process can potentially distort the reconstructed speech waveforms. To address this issue, we proposed to apply the DWPT to form different types of feature representation for SE. In this thesis, we investigate to apply DWPT with two SE approaches: nonnegative matrix factorization (NMF) and robust principal component analysis (RPCA). In brief, the DWPT is first applied to split a time-domain speech signal into a series of subband signals without introducing any distortions. Then we exploit either NMF or RPCA to highlight the speech component for each subband. Finally, the enhanced subband signals are joined together via the inverse DWPT to reconstruct a noise-reduced signal in time domain. We evaluate the proposed method on the Mandarin hearing in noise test (MHINT) task. Experimental results show that this new method behaves very well in prompting speech quality and intelligibility and outperforms the conventional STFT-based methods. For the second part of this thesis, we applied DWPT to derive advanced FC approach for robust distributed speech recognition (DSR). DSR splits the processing of data between a mobile device and a network server. In the front-end, features are extracted and compressed to transmit over a wireless channel to a back-end server, where the incoming stream is received and reconstructed for recognition tasks. In this thesis, we propose a FC algorithm termed suppression by selecting wavelets (SSW) for DSR: minimizing memory and device requirements while also maintaining or even improving the recognition performance. The SSW approach first applies the DWPT to filter the incoming speech feature sequence into two temporal sub-sequences at the client terminal. FC is achieved by keeping the low (modulation) frequency sub-sequence while discarding the high frequency counterpart. The low-frequency sub-sequence is then transmitted across the remote network for specific feature statistics normalization. Wavelets are favorable for resolving the temporal properties of the feature sequence, and the down-sampling process in DWPT reduces the amount of data at the terminal prior to transmission across the network, which can be interpreted as data compression. Once the compressed features have arrived at the server, the feature sequence can be enhanced by statistics normalization, reconstructed with inverse DWPT, and compensated with a simple post filter to alleviate any over-smoothing effects from the compression stage. Results on a standard robustness task (Aurora-4) and on a Mandarin Chinese news corpus (MATBN) showed SSW outperforms conventional noise-robustness techniques while also providing nearly a 50% compression rate during the transmission stage of DSR systems.

參考文獻


[1] B. Milner and A. James, “An analysis of packet loss models for distributed speech recognition,” in Proc. ICSLP, pp. 1549 1552, 2004.
[2] L. Rabiner, “The power of speech,” Science, vol. 301, no. 5639, pp. 1494 1495, 2003.
[3] S. Doclo, M. Moonen, T. Van den Bogaert, and J. Wouters, “Reduced-bandwidth and distributed mwf-based noise reduction algorithms for binaural hearing aids,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 38 51, 2009.
[4] B. Jacob, M. Shoji, and C. Jingdong, “Speech enhancement (signals and commu- nication technology): Chapter 1,” 2005.
[5] W. Hartmann, A. Narayanan, E. Fosler-Lussier, and D. Wang, “A direct masking approach to robust ASR,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 1993 2005, 2013.

延伸閱讀