音高追蹤法應用於單通道語音和歌聲分離

隨著智慧型手機的流行，語音和多媒體的應用因而蓬勃發展。對於語音來說，消除背景的噪音是重要的應用，由於噪音的存在使得語音的品質和辨識結果變差，如何將目標語音分離出來變成必要的課題。對多媒體來說，音樂檢索被認為是重要的議題之一，例如卡拉OK的歌詞同步系統。為了達到這樣的目的，善用歌聲的資訊是重要的研究方向。然而，當唱片在錄音間製作時，歌聲往往混著音樂伴奏，純歌聲通常是不可得的，因此同樣面臨著分離的問題。在本論文中，將會利用音高擷取作為語音分離和歌聲分離的基礎。語音分離的部份，將考慮語音和語音混合的情景，本研究提出兩個音高擷取的演算法。歌聲分離的部份，將利用歌聲音高的擷取作為歌聲分離的依據。在音高擷取的過程當中，音高追蹤法藉由音框間成本函數的設定，提高音高的精準度。分離的實驗顯示，對於語音分離，與近期常見的Hu-Wang演算法相比，本研究提出的演算法在男聲混女聲下在主觀和客觀的評比下有較好的結果，但是對於男聲混男聲下，Hu-Wang的演算法比較好，本研究也提出可能的原因和改善的方向。對於歌聲分離，與三個最近提出的演算法相比，本研究提出的演算法可以改善客觀評比下的效能。

關鍵字

單通道語音分離；單通道歌聲分離；音高估計和追蹤；維特比演算法；音高個數；歌聲偵測

並列摘要

Since smartphones are ubiquitous nowadays, the demand for speech and multimedia re-lated applications grows vigorously. For speech applications, reduction of noise is one of the high demanded techniques. The existence of noise degrades speech quality and performance of speech recognition dramatically. To separate target speech from interferences in the con-taminated recording is a popular research topic. For multimedia, music information retrieval is needed in many applications, for example, the synchronization between singing voice and lyrics. However, singing voice is always mixed with background music when albums are produced in the studio. The post-processing of the vocal/music separation is also on demand. In this thesis, pitch is used as a basic feature for speech and singing voice separation. For speech separation applications, the scenario of speech mixed with speech is considered and an algorithm to extract two pitch values is proposed. For singing voice separation applica-tions, a system to extract singing voice is proposed. For the pitch extraction, temporal pitch tracking is also engaged to improve the accuracy of estimated pitch values in each frame. Experiment results show the proposed speech separation algorithm performs better than the Hu-Wang system in male-female speech mixtures using objective and subjective performance measures, while Hu-Wang system performs better in male-male mixtures. Experiment results show the proposed singing voice separation algorithm performs better than three systems using an objective performance measure.

並列關鍵字

monaural speech separation ； monaural vocal/music separation ； pitch estimation and tracking ； Viterbi algorithm ； pitch number ； singing voice detection

參考文獻

[1] Y. Gong, "Speech recognition in noisy environments: A survey," Speech Communication, vol. 16(3), pp. 261-291, 1995.

[2] J. S. Lim and A. V. Oppenheim, "Enhancement and bandwidth compression of noisy speech," Proceedings of the IEEE, vol. 67(12), pp. 1586-1604, 1979.

[3] D. A. Reynolds and R. C. Rose, "Robust text-independent speaker identification using Gaussian mixture speaker models," Speech and Audio Processing, IEEE Transactions on, vol. 3(1), pp. 72-83, 1995.

[4] A. Hyvarinen and E. Oja, "Independent component analysis: algorithms and applications," Neural Networks, vol. 13(4-5), pp. 411-430, 2000.

[5] H. Krim and M. Viberg, "Two decades of array signal processing research: the parametric approach," Signal Processing Magazine, IEEE, vol. 13(4), pp. 67-94, 1996.

國際替代計量

音高追蹤法應用於單通道語音和歌聲分離

全文下載

主題瀏覽