中文流行歌曲與饒舌歌曲之歌詞對位

本論文之系統有兩個輸入：歌詞文字檔以及歌曲wav檔，其目標在能自動地標示出歌詞正確出現的時間。我們以強制對位 (forced alignment) 為主要架構，再利用模型調適 (adaptation)，使得用於強制對位的隱馬可夫模型 (Hidden Markov Model, HMM) 更符合測試歌曲的背景環境，而調適的方法是事後機率最大化 (Maximum a Posteriori, MAP) 調適。進行強制對位前，需要對歌詞及歌曲做前處理，且須取得一組初始模型。歌詞的部分，我們必須先做斷詞 (word segmentation) ，再透過字典 (lexicon) 查找出每個詞的音素序列 (phone sequence) ，進而得到整首歌的音素序列。歌曲的部分，則是利用HTK工具，直接對wav檔提取梅爾倒頻譜係數 (Mel-scale Frequency Cepstral Coefficients, MFCCs) 。而初始模型是利用公視新聞語料庫 (Mandarin Chinese Broadcast News Corpus, MATBN) 中，收錄於2001年11月至2002年12月期間的內場主播語料，訓練了151個隱馬可夫模型，其中包含了112個聲母 (initial) 模型、38個韻母 (final) 模型、以及一個靜音模型 (silence model) 。我們將112個聲母模型與38個韻母模型稱為語音模型 (speech model) ，語音模型加靜音模型的組合稱為說話模型 (spoken voice model, SpoModel)。有了歌詞音素序列、歌曲MFCC特徵向量以及初始模型後，便可利用HTK進行強制對位。但為了使模型更符合測試歌曲之背景環境，我們利用訓練歌曲，對初始模型進行事後機率最大化調適。本論文收集標記的訓練歌曲包括中文流行歌曲及中文饒舌歌曲兩類，因此，調適後的模型也有兩類，統稱流行模型 (pop song model, PopModel) 及饒舌模型 (rap song model, RapModel)。本論文利用兩類經過調適的模型及兩類測試歌曲進行交叉強制對位實驗，最後透過實驗數據的分析，我們發現歌曲曲風特徵的差異會大大影響強制對位的結果。

關鍵字

強制對位；隱馬可夫模型；事後機率最大化；歌詞對位

並列摘要

There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use forced alignment as our core architecture, which is based on the Hidden Markov Models (HMMs). The HMMs are first trained with the speech data, and then adapted with the songs, so that the adapted models will be more suitable for processing the singing voice over music. We adopt the Maximum a Posteriori (MAP) adaptation strategy. In order to do forced alignment, some preprocessing steps on the lyrics and audio songs are necessary. We also need an initial set of HMMs. For the lyrics, we perform word segmentation first, and then look up the phone sequence of each word in the lexicon so that we can know the phone sequence of the song. As for the audio songs, we use HTK to get the Mel-scale Frequency Cepstral Coefficients (MFCCs) from the wav files. We use the anchor reporter speech in the Mandarin Chinese Broadcast News Corpus (MATBN) to train 151 HMMs as the initial set of models. The training speech data were collected from November 2001 to December 2002. Among the 151 HMMs, there are 112 initial models, 38 final models, and one silence model. These 112 initial and 38 final HMMs are called the speech model, and the combination of speech and silence models is called the spoken voice model (SpoModel). With the phone sequence of the lyrics, the MFCCs of the audio signal, and the initial set of HMMs, we can use HTK to perform forced alignment. However, in order to make the models more robust against the background music, we conduct MAP adaptation on the initial models with some training songs. We have collected two types of training songs, namely Chinese pop songs and Chinese rap songs. Therefore, there are two sets of adapted models, which are called the pop song model (PopModel) and the rap song model (RapModel), respectively. We run forced alignment experiments on the two sets of adapted models and the test songs of two genres. The experimental results show that the genre has a big impact on the results of automatic lyrics-to-audio alignment.

並列關鍵字

forced alignment ； Hidden Markov Model ； Maximum a Posterior ； lyric-to-audio alignment

參考文獻

[1] H. Fujihara, M. Goto, J. Ogata, and H. Okuno. LyricSynchronizer: Automatic Synchronization Method Between Musical Audio Signals and Lyrics. IEEE Journal of Selected Topics in Signal Processing, 5(6):1252–1261, 2011.

[2] HTKBook http://htk.eng.cam.ac.uk/

Google Scholar

[3] SoX http://sox.sourceforge.net/

Google Scholar

國際替代計量

中文流行歌曲與饒舌歌曲之歌詞對位

主題瀏覽