透過您的圖書館登入
IP:18.220.1.239
  • 學位論文

發掘具鑑別性特徵於音樂曲風/情緒分類之應用

Discovering discriminative features with applications to music genre/mood classification

指導教授 : 張智星 張俊盛
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


音樂歌曲通常是由一序列的聲音事件所組成,而這些聲音事件包含時序上短時間及長時間的資訊。然而,在自動音樂曲風分類的研究上,大部份基於文字分類的方法都只抽取時序上短時間內的相依資訊(如基於unigram和bigram出現次數的統計資料)來代表音樂內容。在本論文中,我們提出利用具時間限制的時序樣式(time-constrained sequential patterns, TSPs)作為有鑑別性的特徵,並應用在音樂曲風分類。首先,我們利用自動語言識別技術將音樂歌曲轉換成由隱藏式馬可夫模型索引值所組成的序列。爾後,我們對音樂序列施以TSP探勘技術得到特定曲風的TSPs,緊接著則是計算探勘所得到的TSPs出現在每首歌曲的頻率次數。最後,這些出次頻率次數被用來訓練支援向量機以完成分類目的。在兩個廣泛應用在音樂曲風分類的資料集:GTZAN及ISMIR2004Genre上所完成的實驗,顯示了我們所提出的方法比基於unigram及bigram統計的方法可以發掘更多具鑑別性的時序上結構,並同時達到更好的辨識效果。 此外,我們也提出另一音樂曲風/情緒分類系統,其結合短時間以音框為主的音色特徵及長時間的音色調頻頻譜分析,並以支援向量機為分類器。我們提出的分類系統贏得MIREX 2011音樂情緒分類競賽冠軍。在我們的系統中,我們對短時間音色特徵執行傳統的調頻頻譜分析以取出長時間的調頻特徵。然而,在分析過程中的兩個步驟將導致遺失掉有用的調頻資訊以至於低估其分類效果。第一是針對從紋理視窗(每一視窗都是由從數以百計的音框取出的音色特徵所組成)取出的調頻頻譜圖計算平均值以得到一首音樂具代表性的調頻頻譜圖。第二是從調頻頻譜對比/低峰矩陣(其是由具代表性的調頻頻譜圖計算得到)計算出平均值及變數異以取得每首音樂的壓縮特徵向量。為了避免平滑掉調頻資訊,在此論文中,我們提出利用語音頻率及調頻頻率的兩維表示圖取出結合語音-調頻頻率的特徵值。這些結合語音-調頻頻率的特徵,包括語音-調頻對比係數/波谷係數(AMSC/AMSV)、平坦度及峰度量評值(AMSFM/AMSCM),是從每個結合語音-調頻頻率子頻帶的調頻頻譜計算得到。藉由整合我們所提出的特徵值、MFCC的調頻分析及短時間音色特徵的統計值,這組新的特徵集合在其它4個曲風/情緒資料集上的分類效果已勝過我們MIREX 2011的方法。

並列摘要


A music piece usually consists of a sequence of sound events which represent both short-term and long-term temporal information. However, in the task of automatic music genre classification, most text-categorization-based approaches only capture temporal local dependencies (e.g., unigram and bigram-based occurrence statistics) to represent music contents. In this dissertation, we propose to use time constrained sequential patterns (TSPs) as effective features for music genre classification. First of all, an automatic language identification technique is performed to tokenize each music piece into a sequence of hidden Markov model indices. Then TSP mining is applied to music sequences to discover genre-specific TSPs, followed by the computation of occurrence frequencies of TSPs in each music piece. Finally, these occurrence frequencies are feed into support vector machines (SVMs) to perform the classification task. Experiments conducted on two widely used datasets, GTZAN and ISMIR2004Genre, show that the proposed method can discover more discriminative temporal structures and achieve a better recognition accuracy than the unigram and bigram-based statistical approach. In addition, we also propose another music genre/mood classification system which combines both short-term frame based timbre features and the long-term modulation spectral analysis of timbre features for SVMs. This proposed system won the first place of the MIREX 2011 music mood classification task. In our submission, we performed the modulation spectral analysis on short-term timbre features to extract long-term modulation features. However, two operations in this analysis are likely to smooth out useful modulation information, which may degrade the classification performance. The first one is to take the averaging of modulation spectrograms extracted from texture windows (each of which is composed of timbre features extracted from hundreds of frames) to create a representative modulation spectrogram for a music clip. The second one is to compute the mean and standard deviation of modulation spectral contrast/valley matrices (these two matrices are computed from the representative modulation spectrogram) to obtain a compact feature vector for a music clip. To avoid smoothing out modulation information, in this dissertation, we propose the use of a two-dimensional representation of acoustic frequency and modulation frequncy to compute joint frequency features. These joint frequency features, including acoustic-modulation spectral contrast/valley (AMSC/AMSV), flatness measure and crest measure (AMSFM/AMSCM), are then computed from modulation spectra of each joint frequency subband. By combining the proposed features, together with the modulation spectral analysis of MFCC, and statistical descriptors of short-term timbre features, this new set of features outperforms our MIREX 2011 submission on four other genre/mood datasets.

參考文獻


[1] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293-302, 2002.
[2] L. Lu, D. Liu, and H.-J. Zhang, “Automatic mood detection and tracking of music audio signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 1, pp. 5-18, Jan. 2006.
[3] D. Huron, “Perceptual and cognitive applications in music information retrieval,” in Proc. Int. Society Music Info. Retrieval, 2000.
[4] S. Lippens, J. P. Martens, M. Leman, B. Baets, H. Meyer, and G. Tzanetakis, “A comparison of human and automatic musical genre classification,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 233-236, 2004.
[5] L. Karydis, A. Nanopoulos, and Y. Manolopoulos, “Symbolic music genre classification based on repeating patterns,” in Proc. ACM Multimedia, pp. 53-57, 2006.

延伸閱讀