使用超長程關聯卷積來處理泛音與相位回復以解決歌聲分離問題

音訊分離如今是一個具有挑戰的研究方向，不僅吸引許多的研究者投入開發相關技術之外，甚至從2008年開始近乎每年一度的頻繁舉辦Signal SeparationEvaluation Campaign (SiSEC)比賽。觀察了2018年的比賽參加選手與大會的分析報告後我們發現，絕大多數的參賽者因為缺乏效果與效率兼具的資料壓縮技術，導致參賽者們皆會捨去訓練資料中較高頻率的資訊與頻譜圖相位的資訊。經這樣的觀察後，我們針對相對應的問題研發了一種新的深度學習模型—OvertoneNet(OveNet)，其中利用了兩個新的技術：頻率1x1卷積層(F1x1 convolution layers)與複數的頻譜圖訓練法(complex-spectrogram channels)，使得我們可以處理完整的44.1千赫的音訊(高解析度的音訊)，也使得我們有能力利用音樂中頻繁存在的泛音關係加強訓練模型的效率與效果，這樣的優勢在其他模型中是無法被實現的。這次實驗結果顯示，我們在客觀與主觀的測試中之分離能力完全超越所有參賽SiSEC2018的對手，這樣的結果也證明了我們的方法效果顯著。

關鍵字

歌聲分離；卷積；神經網路；相位回復

並列摘要

Audio source separation is a challenging topic that attracted various research teamsto attend the Signal Separation Evaluation Campaign (SiSEC) in 2018. Most top-rankedcompetitors based on deep learning methods ignored higher-frequency signals of the har-monic and the phase information due to the lack of efficient data compression method. Wepropose a new deep learning model named OvertoneNet (OveNet) that adopts two novelconcepts, frequency 1x1 convolution layers and complex-spectrogram channels, to handlethe 44.1k audio signals (Hi-Res audio signals) with a wide range of overtones. The re-sults of our experiment show that OveNet performs well in both objective and subjectiveevaluation on interference using limited training data from SiSEC2018.

並列關鍵字

singing voice separation ； convolutional layers ； deep learning ； phase reconstruction

參考文獻

[1] Fabian-Robert St ̈oter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separationevaluation campaign.ArXiv, abs/1804.06267, 2018.

Google Scholar

[2] Emmanuel Vincent, R ́emi Gribonval, and C ́edric F ́evotte. Performance measurementin blind audio source separation.IEEE Transactions on Audio, Speech, and LanguageProcessing, 14:1462–1469, 2006.

Google Scholar

[3] Zafar Rafii, Antoine Liutkus, Fabian-Robert St ̈oter, Stylianos Ioannis Mimilakis, andRachel M. Bittner. Musdb18 - a corpus for music separation. 2017.

Google Scholar

[4] Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia G ́omez. A vocoder basedmethod for singing voice extraction.ICASSP 2019 - 2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP), pages 990–994, 2019.

Google Scholar

[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networksfor biomedical image segmentation.ArXiv, abs/1505.04597, 2015.

Google Scholar

延伸閱讀

白盛方（2005）。結合分頻頻譜刪減與迴歸共變異數矩陣子空間法對語音加強處理〔碩士論文，崑山科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0025-0306200810410912
Chi, L. W. (2015). 利用量測陣列模型及次頻帶疊代濾波實現聲學迴聲消除 [master's thesis, National Tsing Hua University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0016-0312201510244311
周伯威（2015）。以深層與卷積類神經網路建構聲學模型之大字彙連續語音辨識〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2015.00754
Sathesh, K., & Muniraj, N. (2015). Application of Adaptive Line Enhancer with LMS to Separate Heart Sound Signal from Lung Sound Signal at Real Time. Research Journal of Applied Sciences, Engineering and Technology, 9(6), 448-453. https://www.airitilibrary.com/Article/Detail?DocID=20407467-201502-201506220035-201506220035-448-453
Gu, H. Y., & Lin, Z. F. (2014). Singing-voice Synthesis Using ANN Vibrato-parameter Models. Journal of Information Science and Engineering, 30(2), 425-442. https://doi.org/10.6688/JISE.2014.30.2.9

國際替代計量

使用超長程關聯卷積來處理泛音與相位回復以解決歌聲分離問題

全文下載

主題瀏覽