改善基於神經網路與地標法的音訊指紋

音訊指紋是一種音樂檢索方式，可用來快速的從錄音中辨識出相符的音樂，其作法是從錄音檔抽取顯著的特徵，並將此特徵和資料庫中的音樂特徵做比對。由於錄音經常會受到雜訊干擾，因此音訊指紋需要有抵抗環境噪音的能力。過去音訊指紋的做法主要是傳統演算法，如Avery Wang提出的地標法，近年來基於深度學習的音訊指紋做法已逐漸成為主流，如Google提出的Now Playing。此篇研究主要聚焦在Sungkyun Chang等人提出的神經網路法音訊指紋。本論文首先以MIREX音訊指紋資料集來評估神經網路法和地標法，顯示出神經網路法在以現實世界的錄音來測試時，精準度仍然不如傳統演算法。因此本論文提出了三種方法來改進神經網路法：二階段洗牌、資料擴增改良以及對查詢做多次時間位移，並在最後以支援向量機(Support Vector Machine, SVM)來整合地標法和神經網路法的結果。為了方便重現，實驗使用公開的Free Music Archive資料集，透過加入雜訊的方式生成查詢音檔，並依照雜訊的強度分別計算檢索精準度。實驗結果顯示本論文提出的改進方式能夠顯著的提升神經網路在強雜訊下的精準度，並使得神經網路法在現實世界錄音查詢的表現超越地標法。

關鍵字

音樂檢索；音訊指紋；地標法；對比學習；二階段洗牌；資料擴增；支援向量機

並列摘要

Audio fingerprint is a method in music information retrieval that can be used to quickly recognize matched music from an audio recording. To do that, it first extracts significant features from the recording file, and then compares these features with those extracted from database music. Since recordings are often contaminated with noise, audio fingerprint has to have the ability to resist background noise. In the past, the approaches to audio fingerprint were usually based on traditional algorithm, such as the landmark method, proposed by Avery Wang. Recently, audio fingerprint methods based on deep learning have gradually become mainstream, such as Google's Now Playing. This work focuses on neural audio fingerprint, proposed by Sungkyun Chang et al. We first evaluated neural network method and landmark method on MIREX audio fingerprint dataset, and found that the accuracy of neural network method is still worse than traditional algorithm when tested with real-world recordings. Therefore we propose three approaches to improve such a method: two-phase shuffling, extensive data augmentation, and doing multiple time shifting to the query. Finally, Support Vector Machine (SVM) is used to integrate the results of the landmark method and neural network method. To make our work reproducible, we use public Free Music Archive dataset in our experiments and generate query audio by adding noise to this dataset. We then compute the query accuracy under different noise levels. Experiment shows that our approaches can significantly improve the accuracy of neural network under strong noise, and make neural network method perform better than the landmark method on real-world queries.

並列關鍵字

music retrieval ； audio fingerprinting ； landmark method ； contrastive learning ； two-phase shuffling ； data augmentation ； SVM

參考文獻

[1] 唐子翔。「以雙向檢索及排序學習演算法來改進音訊指紋辨識」。碩士論文，國立臺灣大學資訊工程學研究所，2020。

Google Scholar

[2] 廖信富。「藉由目標區域以及雜湊表調整對以地標為特徵音訊指紋的改進」。碩士論文，國立臺灣大學資訊工程學研究所，2018。

Google Scholar

[3] 廖珮妤。「用於音樂檢索的聲紋辨識改良」。碩士論文，國立清華大學資訊工程學系，2013。

Google Scholar

[4] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint, 2016. arXiv:1607.06450.

Google Scholar

[5] A. Báez-Suárez, N. Shah, J. A. Nolazco-Flores, S.-H. S. Huang, O. Gnawali, and W. Shi. Samaf: Sequence-to-sequence autoencoder model for audio fingerprinting. ACM Trans. Multimedia Comput. Commun. Appl., 16(2), May 2020.

Google Scholar

國際替代計量

改善基於神經網路與地標法的音訊指紋

全文下載

主題瀏覽