透過您的圖書館登入
IP:216.73.216.23
  • 學位論文

單聲道音樂之歌聲分離

Monaural Singing Voice Separation from Music Accompaniment

指導教授 : 張智星

摘要


於單聲道的音樂中分離出歌聲是個極具挑戰性的問題。當以音高為基礎分離歌聲的方法大有進展的情況下,卻鮮有研究去注意此類方法無法分離歌聲氣音的部分,一方面是由於氣音部分不若母音的部分有諧波結構,另一方面氣音的能量通常較母音的部分微弱,因此容易和背景音樂混雜在一起造成分離上的困難。於此論文中,我們提出了一個方法去從背景音樂中偵測並且分離出歌聲的氣音。我們也提出了新的歌聲音高擷取演算法來改善歌聲母音部分的分離效果。我們提出的方法沿用計算聽覺場景分析(computational auditory scene analysis, CASA)的架構而分成分解階段及重組階段。在分解階段中,所處理的歌曲訊號根據不同的時間及頻率解析度被分解成許多小的感知元素。而屬於歌聲氣音部分的感知元素我們使用高斯混和模型加以識別出來。實驗結果顯示分離出的歌聲在氣音有顯著的改進。另外一方面,由於大部分的歌聲是屬於母音,因此目標音高偵測是CASA系統影響歌聲分離效能的關鍵技術。但不幸的,強健的目標音高偵測非常的困難,尤其是對於像音樂配樂這樣不穩定規律的背景雜訊。本論文也提出了一個tandem演算法,此tandem演算法可同時偵測音高以及分離母音。粗略的音高首先被估量出來,同時考量頻譜上的諧波結構以及時間上的連續性來分離歌聲。而分離出來的歌聲和偵測出的音高則用來互相增進彼此的效能。為了增進tandem演算法對於音樂歌曲的效能,我們提出了一個音高趨勢偵測演算法,此演算法可以一個音框一個音框的估量出最可能的歌聲音高範圍,而此技術大量的減少了由樂器或歌聲的泛音所產生的錯誤音高。系統化的評估顯示出tandem演算法於音高偵測以及歌聲分離明顯的超越之前的演算法。結合氣音分離的演算法,我們提出了一個完整的CASA-based歌聲分離系統。另一方面,為了解決在目前歌聲分離研究上缺乏一個公開且具規模的語料庫的問題,我們建構了MIR-1K (Multimedia Information Retrieval lab, 1000 song clips)語料庫。在MIR-1K中,所有的錄音歌聲及背景音樂都是分開錄製於不同的聲道中,而且所有的歌曲皆有人工標記的音高資訊,氣音位置及其種類,歌聲出現位置,歌詞,以及歌詞的讀音錄音,使此語料的用途更加廣泛。

並列摘要


Monaural singing voice separation is an extremely challenging problem. While efforts in pitch-based inference methods have led to considerable progress in voiced singing voice separation, little attention has been paid to the incapability of such methods to separate unvoiced singing voice due to its inharmonic structure and weaker energy. In this dissertation we proposed a systematic approach to identify and separate the unvoiced singing voice from music accompaniment. The proposed system follows the framework of computational auditory scene analysis (CASA) which consists of the segmentation stage and the grouping stage. In the segmentation stage, the input song signals are decomposed into small sensory elements in different time-frequency resolutions. The unvoiced sensory elements are then identified by Gaussian mixture models. The experimental results demonstrated that the quality of the separated singing voice is improved for the unvoiced part. On the other hand, target pitch detection is key to the performance of a CASA system since most of the singing voice is voiced. Unfortunately, it is difficult to detect the target pitch robustly, especially for mixtures with non-stationary and harmonic interference such as music. This dissertation also investigates a tandem algorithm that estimates the singing pitch and separates the singing voice jointly and iteratively. Rough pitches are first estimated and then used to separate the target singer by considering harmonicity and temporal continuity. The separated singing voice and estimated pitches are used to improve each other iteratively. To enhance the performance of the tandem algorithm for dealing with musical recordings, we propose a trend estimation algorithm to detect the pitch ranges of a singing voice in each time frame. The detected trend substantially reduces the difficulty of singing pitch detection by removing a large number of wrong pitch candidates either produced by musical instruments or the overtones of the singing voice. Systematic evaluation shows that the tandem algorithm outperforms previous systems for pitch extraction and singing voice separation. With both the proposed voiced and unvoiced singing voice separation method, we have a complete CASA system to separate singing voice from music accompaniment. Moreover, to deal with the problem of lack of a publicly available dataset for singing voice separation, we have constructed a corpus called MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) where all singing voices and music accompaniments were recorded separately. Each song clip comes with human-labeled pitch values, unvoiced sounds and vocal/non-vocal segments, and lyrics, as well as the speech recording of the lyrics.

參考文獻


[1] A.S. Bregman, Auditory scene analysis. Cambridge MA: MIT Press, 1990.
[3] G.J. Brown and M. Cooke, "Computational auditory scene analysis," Computer
Speech and Language, vol. 8, pp. 297-336, 1994.
of auditory grouping,” in Listening to Speech: An Auditory Perspective, S.
[5] A. de Cheveigne, "Multiple F0 estimation," in Computational auditory scene

延伸閱讀


國際替代計量