基於雙狀態馬可夫模型之音訊內容識別

音訊內容識別或稱內容式音樂檢索，是一種新穎且有趣的技術。藉由音訊內容識別，能使音訊訊號連結非常短的音訊指紋，然後使用音訊指紋來識別音訊訊號。整個架構由資料庫創造階段（訓練階段），和識別階段（比對階段）兩個階段組成。本論文探討之雙狀態馬可夫模型（Two-State Markov Model），由每個音訊訊號的音訊指紋以及每個狀態由一組高斯混合機率密度函數做音訊分類所組成。在實驗中考慮梅爾頻率倒頻譜係數（Mel-frequency Cepstral Coefficients），頻譜質量中心（Spectrum Centroid）兩種音訊特徵值。在資料庫創造階段，音訊指紋被收集並儲存於資料庫中。每個音訊指紋能連結到與音訊訊號相符的標籤或其相關音訊資訊。在比對階段，輸入未知音訊擷取其音訊指紋，然後跟資料庫內所有的音訊指紋做比對或是找出最高可能性之音訊資料，完成比對時便能得到與音訊相關的標籤或其相關音訊資訊。實驗結果顯示，雙狀態馬可夫模型與梅爾頻率倒頻譜係數的組合，即使在不同的失真情況，例如：MP3壓縮、切割、振幅失真、時間長度的改變能有最高的正確率。

關鍵字

音訊內容識別；音訊指紋；內容式音樂檢索

並列摘要

Audio content identification or content-based music retrieval is a novel and fascinating technology that associates an audio signal to a much shorter fingerprint and then uses the fingerprint to identify the audio signal. The framework consists of two phases, the database creation phase (training phase) and the identification phase (matching phase). In this thesis, a two-state Markov model (TSMM) is developed as the audio fingerprint for each audio signal and each state is classified by a set of Gaussian mixture probabilities. Two audio features, the Mel-frequency Cepstral Coefficients (MFCC) and the Spectrum Centroid (SC) are taken into consideration in the experiments. The fingerprints are collected and stored in a database during the database creation phase. Each database entry (fingerprint) can be linked to a tag or other metadata relevant to the corresponding audio signal. During the matching phase, an unknown audio is processed to extract its fingerprint and then compared with all the fingerprints in the database, or we should search for the one with the maximum likelihood. The tag or other metadata associated with the audio signal can be obtained from the database if a match is found. The experimental results show that the TSMM(MFCC) scheme performs high correct rates even when suffering various distortion, such as MP3 compression, clipping, amplitude modification, and time-scale modification, etc.

並列關鍵字

Audio content identification ； Audio fingerprinting ； Content-based Music Retrieval

參考文獻

[2] R. J. McNab, L. A. Smith and D. Bainbridge, Witten: The new zealand digital library melody index: D-Lib Magazine, May 1997.

[3] B. Gold and L. Rabiner, “Parallel processing techniques forestimating pitch periods of speech in the time domain,” J. Acoust. Soc. Am. 46 (2), pp. 442-448, 1969.

[5] J. R. Jang, H. R. Lee and C. H. Yeh, “Query by Tapping: A new paradigm for content-based music retrieval from acoustic input,” The Second IEEE Pacific-Rim Conference on Multimedia, Beijing, China, 2001.

[7] J. Herre, O. Hellmuth and M. Cremer, “Scalable robust audio fingerprinting using MPEG-7 content description,” in IEEE Workshop on Multimedia Signal Processing, pp. 165–168, Dec. 2002.

[9] J. Lukasiak, D. Stirling, N. Harders and S. Perrow, “Performance of MPEG-7 low level audio descriptors with compressed data,” Proceedings of IEEE Multimedia and Expo, vol. 3,pp.III-237-6, July 2003.

國際替代計量

基於雙狀態馬可夫模型之音訊內容識別

未授權

主題瀏覽