應用於非監督式音訊轉換偵測之新型方法及特徵參數

音訊分割可以分成兩部份，分別為語音分割及環境聲音分割，其目的是將聲音切成多個分段，而每一個分段都只包含單一語者或單一環境聲音。對於語音分割，本論文主要提出一個新的概念，將傳統語音切割轉換成語者驗證問題。而為解決訓練的資料不足問題，因此採用支持向量機作模型的訓練，由於支持向量機需要耗費較多的訓練時間，因此我們先用較簡單的廣義概似比例作為第一階段找出可能的轉換點，第二階段再由我們提出的支持向量機相鄰音窗相似度演算法作確認，藉此減少運算時間，而實驗結果顯示我們提出的音訊切割方法效果較傳統貝氏資訊準則演算法好。在音訊特徵參數部分，語音部份我們採用梅爾倒頻譜參數，而環境聲音則因變化較大，因此我們提出非均勻尺度頻率圖參數，此參數採用匹配追蹤演算法對音訊作拆解。環境聲音分割的實驗結果顯示，我們提出的參數較梅爾倒頻譜參數有更好的抗噪能力及鑑別度。

關鍵字

語者切割；語者轉換偵測

並列摘要

Audio segmentation can be divided into two categories which are speech segmentation and environmental sound segmentation. It divides an audio stream into many segments and there is only one speaker or one environmental sound in each segment. In speaker segmentation, this thesis proposes a new concept that turns traditional speaker change detection problem into speaker verification problem. In order to solve the problem of insufficient training data, we use support vector machine (SVM) to train the speaker models. Because SVM has a computational load in training, we adopt a two stage search strategy. In the first stage, generalized likelihood ratio is used to find the change point candidates. In the second stage, we confirm it by the proposed SVM based adjacent window similarity criterion. In the experimental results, the performance of the proposed SVM based adjacent window similarity criterion is better than conventional Bayesian information criterion (BIC). Considering the acoustical features, we use MFCC to do the speaker segmentation. As for the environmental sound, we propose a feature set based on non-uniform scale frequency map (SFM). This feature is obtained by decomposing an audio signal by matching pursuit algorithm. Experimental results demonstrates that the proposed non-uniform SFM based feature set is more noise robust than MFCC in environmental sound segmentation.

並列關鍵字

speaker segmentation ； speaker change detection

參考文獻

[2] Z. Zhang , S. Furui , and K. Ohtsuki, “On-line incremental speaker adaptation for broadcast news transcription,” Speech Communication, vol. 37, no. 3-4, pp. 271-281, July 2002.

[3] J. Gauvain, L. Lamel, and G. Adda, “The LIMSI broadcast news transcription system,” Speech Communication, vol. 37, no. 1-2, pp. 89-108, 2002.

[4] K. Mori and S. Nakagawa, “Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 413-416, May 2001.

[5] R. Huang and J. H. L. Hansen, “Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 907-919, May 2006.

[6] K. Park, Jeong-sik Park, and Y. H. Oh, “GMM adaptation based online speaker segmentation for spoken document retrieval,” IEEE Transactions on Consumer Electronics, vol.56, no.2, pp.1123-1129, May 2010.

國際替代計量

應用於非監督式音訊轉換偵測之新型方法及特徵參數

未授權

主題瀏覽