透過您的圖書館登入
IP:3.135.195.249
  • 學位論文

以隱藏式馬可夫模型為基礎之哼唱轉譜演算法

A Humming Transcription Algorithm Based on Hidden Markov Models

指導教授 : 陳中平
共同指導教授 : 張智星(Jyh-Shing Jang)

摘要


哼唱轉譜演算法最核心的問題是,分割(segmentation)與標籤(labelling)。把原始聲音特徵分割為一個個的音符段落,是為分割;將音符段落標上正確的音高,是為標籤。根據Ryynanen的分類,隱藏式馬可夫模型(hidden Markov models,HMM)屬於並聯式(jointly)的系統,同時決定邊界與音高;音程分割法(E. Molina,2015)則屬於串聯式(cascade)系統則先決定邊界,而後給予音高。隱藏式馬可夫模型以訓練資料為基礎,以機率模型模擬音樂語法中複雜的慣用模式;音程分割法從音符的角度來考慮,過濾掉短暫、激烈的音高變化,而得到較好的音符邊界。本研究提出一套哼唱轉譜系統,其中分割與標籤的階段,使用實驗室自行收集的資料庫,訓練隱藏式馬可夫模型,加上音程分割法與前測音高,在實驗中得到55%的正確音符偵測(correct)成效。其主要原因不是音程分割法找出正確的邊界,而是前測音高的給定,造成音高段落內的音域緊縮,降低音準飄移問題的難度。 在驗證方法方面,我們自行收集了140首由非音樂專業使用者錄製的哼唱音檔,製作驗證資料庫。標準答案的製作以半自動的方式進行:轉譜專家們錄製MIDI音軌,再以動態時間校正法(dynamic time warping,DTW)與原始音檔對齊,最後由一位專家手動作最後修正。在製作的過程中,錄音檔與專家答案的差異,彰顯了『音準飄移』現象在音高判定的難題。本研究回顧了相關文獻,並根據錯誤傳遞容忍度,提出修改音準的原則。

並列摘要


Segmentation and labelling are core problems in humming transcription. Based on features like energy, voicing and abrupt changes in fundamental frequency (F0), segmentation stage divide the whole song into note sequence with proper boundary. While the F0 sequence are widely varying and out of absolute tuning, labelling stage assign a pitch label such as an integer MIDI note number for each note. According to Ryynanen’s classification, hidden Markov models (HMM) is one of the methods that perform these two stages jointly; SPiTH (Molina, 2015) belong to cascade system, deciding boundary and pitch sequentially. Based on corpus data, HMM methods use probability distribution to model the conventional syntax in music; in the view of that music in constituted by notes, SPiTH filters the unstable pitch change in each note, obtaining better note boundary. We propose a humming transcription system in this paper. In the stages of segmentation and labelling, firstly, the interval-based segmentation (SPiTH) divide song into note set. Second, HMM model which is trained by collected corpus, is used to assign pitch label to each note. In the experiment, this method has 55% correct in note rate. The main reason of this advantage is not lying on the proper note boundary, but the prior pitch label: The assignment of prior pitch makes the unstable pitch change shrink, which make the tuning problem (singing out-of-tune) more easily. In the evaluation method, we collect 140 songs recorded by non-professional user and make the answer of each song (ground truth). Firstly, experts play on the MIDI keyboard and record it. Second, the MIDI file are aligned to the WAV file through dynamic time warping (DTW) algorithm. At last, an expert corrects the remain errors manually. When making the ground truth, the pitch difference between MIDI and WAV highlight the tuning problem. After reviewing the related literature, we also propose the principle of correcting pitch based on tolerance difference between singer and listener and the error propagating phenomenon.

參考文獻


[3] B. Pardo, J. Shifrin, and W. Birmingham, “Name that tune: A pilot study in finding a melody from a sung query,” J. Amer. Soc. Inf. Sci. Technol., vol. 55, no. 4, pp. 283–300, 2004.
[9] E. Molina, I. Barbancho, E. Gomez, A. Barbancho, and L. Tardon, “Fundamental frequency alignment vs. note-based melodic similarity for singing voice assessment,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2013, pp. 744–748.
[13] W. Krige, T. Herbst, and T. Niesler, “Explicit transition modelling for automatic singing transcription,” Journal of New Music Research, vol. 37, no. 4, pp. 311–324, 2008.
[14] E. Gomez and J. Bonada, “Towards computer-assisted flamenco transcription: An experimental comparison of automatic transcription algorithms as applied to a cappella singing,” Computer Music Journal, vol. 37, no. 2, pp. 73–90, 2013.
[15] E. Molina, L. J. Tardón, A. M. Barbancho and I. Barbancho, "SiPTH: Singing Transcription Based on Hysteresis Defined on the Pitch-Time Curve," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 252-263, Feb. 2015.

延伸閱讀