音訊分離如今是一個具有挑戰的研究方向,不僅吸引許多的研究者投入開發相關技術之外,甚至從2008年開始近乎每年一度的頻繁舉辦Signal SeparationEvaluation Campaign (SiSEC)比賽。觀察了2018年的比賽參加選手與大會的分析報告後我們發現,絕大多數的參賽者因為缺乏效果與效率兼具的資料壓縮技術,導致參賽者們皆會捨去訓練資料中較高頻率的資訊與頻譜圖相位的資訊。經這樣的觀察後,我們針對相對應的問題研發了一種新的深度學習模型—OvertoneNet(OveNet),其中利用了兩個新的技術:頻率1x1卷積層(F1x1 convolution layers)與複數的頻譜圖訓練法(complex-spectrogram channels),使得我們可以處理完整的44.1千赫的音訊(高解析度的音訊),也使得我們有能力利用音樂中頻繁存在的泛音關係加強訓練模型的效率與效果,這樣的優勢在其他模型中是無法被實現的。這次實驗結果顯示,我們在客觀與主觀的測試中之分離能力完全超越所有參賽SiSEC2018的對手,這樣的結果也證明了我們的方法效果顯著。
Audio source separation is a challenging topic that attracted various research teamsto attend the Signal Separation Evaluation Campaign (SiSEC) in 2018. Most top-rankedcompetitors based on deep learning methods ignored higher-frequency signals of the har-monic and the phase information due to the lack of efficient data compression method. Wepropose a new deep learning model named OvertoneNet (OveNet) that adopts two novelconcepts, frequency 1x1 convolution layers and complex-spectrogram channels, to handlethe 44.1k audio signals (Hi-Res audio signals) with a wide range of overtones. The re-sults of our experiment show that OveNet performs well in both objective and subjectiveevaluation on interference using limited training data from SiSEC2018.