使用機器學習進行無伴奏合唱的歌聲分離

In recent years there have been many studies done on the problem of speech separation, which attempts to separate audio of multiple people speaking simultaneously into the audio of each speaker individually. However, audio source separation of multiple simultaneous singers is still not well explored and remains a challenge. This is mainly due to the fact that when people are singing their voices tend to “blend” together much more than when speaking, and multiple vocal lines are often singing the same words, and potentially frequencies, in unison. In order to deal with these issues, we propose a new U-Net based model specifically for the purpose of a cappella singing separation of two singers and compare it to three state-of-the-art speech separation models. There is a large variety in the results of our experiments. The U-Net based network excels at separating music taken from choir datasets, with a max mean SDR of 9.76 dB, but achieves poor results at separating random combinations of singers. The best speech separation network is capable of separating random combinations of singers quite well, with a max mean SDR of 7.64 dB after finetuning but is uncapable of separating samples where the singers are singing the same lyrics simultaneously. This singing separation score is also much lower than the same model’s mean SDR for speech separation of 9.04 dB. These results are quite nuanced and show that singing separation is a different, and overall, more difficult task than speech separation. However, they also show that both a U-Net based network, and one based on contemporary speech separation networks may certainly be capable of performing well on it.

並列關鍵字

Machine learning ； Audio source separation ； Music source separation ； Speech separation ； Music information retrieval ； Singing separation ； A cappella separation ； U-Net ； TasNet

參考文獻

[1] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. ICASSP. IEEE, pp. 696–700, 2018.

Google Scholar

[2] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio. Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.

Google Scholar

[3] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2020 IEEE International Conference on. IEEE, pp. 46–50, 2020.

Google Scholar

[4] Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., and Zhong, J. “Attention is all you need in speech separation.” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25. IEEE, 2021.

Google Scholar

[5] Shahar Lutati, Eliya Nachmani, Lior Wolf, “Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation”, arXiv:2301.10752v1, 2023.

Google Scholar

國際替代計量

使用機器學習進行無伴奏合唱的歌聲分離

全文下載

主題瀏覽