歌聲分離領域旨在將音樂中的「主唱音軌」與「伴奏音軌」分離出,可以在 time domain 或是 frequency domain 實現,後者是本研究的重點。深度學習已在現今聲音分離領域中是不可或缺的方法,本研究主要基於 Ronneberger 等人的 U-Net 架構,用於分割生物醫學影像有很好的效果,本論文基於此架構,用於訓練頻譜圖的切割。基於 ratio mask filter 與 Wiener filter 理論,改善現有的 U-Net 模型,在模型的輸出有凸波異常時,可以適時矯正(伴奏 SDR 由 13.805 提升至 14.288);以注意力機制的 attention gate 與 self-attention 改善 U-Net 模型,讓模型可以學到有規律節奏的聲音(伴奏 SDR 由 13.805 提升至 14.457);基於先前頻譜刪減(spectral subtraction)的研究,調整各頻段刪減幅度至最佳,以提升模型輸出,但本研究提出的方法與先前研究提出的刪減幅度相較起來,並無有效提升(伴奏 SDR:baseline—13.805、先前研究—14.031、本次研究—13.895);對 U-Net 進行模型剪枝(model pruning)並最大化保留效能(模型大小由 118.9MB 減少至 59.8MB,伴奏 SDR 由 12.989 降低至 12.771);調整最佳的模型量化(model quantization)參數,以不損失太多效能(模型大小由 118.9MB 減少至 4.75MB,伴奏 SDR 由 12.989 降低至 11.184)。實驗使用到公開的資料集包含:MUSDB18、DSD100、MedleyDB、iKala,非公開的資料集包含:Ke(捷奏錄音室-柯老師)。
The field of singing voice separation aims to separate the "vocals stem" and "accompaniment stem" in music, which can be achieved in the time domain or the frequency domain. The latter is the focus of this research. Deep learning is now an indispensable method in the field of singing voice separation. This research is mainly based on the U-Net architecture proposed by Ronneberger et al. It has good performance on the segmentation of biomedical images. Based on the theory of ratio mask filter and Wiener filter, this research improves the existing U-Net model. When the model output has abnormal convex waves, it can be corrected in time (accompaniment SDR: 13.805 v.s 14.288). Based on previous studies of spectrum subtraction, this research adjusts the subtraction ratio of each frequency band to the best to improve the performance of the model. However, compared with the subtraction ratio proposed in the previous study, the method proposed in this study is not effective (accompaniment SDR: Baseline—13.805, previous study—14.031, this study—13.895). In this study, model pruning was performed on U-Net to maximize after-pruning performance (the model size: 118.9MB v.s 59.8MB; accompaniment SDR: 12.989 v.s 12.771). The public datasets used in the experiment include: MUDB18, DSD100, MedleyDB, iKala, and the non-public data sets include: Ke (Jiézòu studio).