透過您的圖書館登入
IP:18.216.137.32
  • 學位論文

基於深度學習原音自編碼器去噪應用於哼唱式系統

Learning-Based Raw Waveform Autoencoder Denoising Method for Query by Humming Systems

指導教授 : 丁建均
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


哼唱式系統主要概念是在僅知道片段音樂或是旋律的情況下,從對應的資料 庫中尋找目標音樂。音樂的類型主要為人聲所哼(Humming)或是唱(Singing),而大 多數的哼唱式系統力求系統的精準度,知名的範例為 Google Hum to search。給定 片段哼唱,系統會在語音平台 Youtube 搜尋到對應的歌曲匹配排名,其排名代表符 合輸入哼唱的歌曲。由於嵌入式系统(如手機、車機、平板)的普及,哼唱式系統 隨著平台的不同,也伴隨著不同的環境與使用需求。如手機使用者可能會在餐廳聽 到某段曲子,並利用哼唱式系統搜尋。相比於傳統哼唱式系統只求精確度,在有環 境噪音之下,需要將傳統哼唱式系統加以改良,以符合在噪音下使用系統的需求並 同時達到短回應時間,在 Google Hum to search 之中,其搜尋時間大約為 3 秒鐘。 傳統哼唱式系統分為音符偵測、基礎頻率分析以及序列比對。本篇論文之中對音符 偵測、基頻分析的近期技術加以改良並加入語音去噪功能,期望整合出抗噪與短回 應時間的哼唱式系統。 語音去噪,大致分為強健性語音特徵的擷取以及語音模型調試法,語音特徵擷 取如韋納濾波器法,而語音模型調試法則基於深度學習方式,常見如理想比值掩蓋、 自編碼器的延伸,多數模型不外乎基於時域處理原始音訊、頻域處理頻譜抑或是時 頻分析將音訊與噪音分離,本論文利用自編碼器與原始聲音,達到短時間與去噪的 目的,而語音去噪對整個哼唱式系統,為語音的前處理。實際上,哼唱與語音資料 差異大,此論文將探討訓練過程的差異,結果顯示採用語音資料集訓練去噪模型, 測試在哼唱資料集,模型較容易訓練。 在常見的環境下如街上、辦公室使用哼唱式系統。給定一段旋律,會伴隨著不 想要的環境訊號,屏除哼的旋律。通常統稱這些環境訊號為噪音,音符偵測利用音 符開頭能量劇烈變化的程度來判斷當下是否為音符,利用這些音符的位置以利後 續的系統比對。但由於噪音的存在,會產生不正確的音符位置,而造成系統誤判。 本篇論文中,利用自編碼器模型,讓模型學習哼唱與噪音的性質,並降低噪音的能 量並分離出哼唱,且使用多音符偵測模型,以達到具有抗噪能力的音符偵測。 基礎頻率分析主要任務是將音符的音高找出來。現有常見技術包括自相關函 數、諧波頻譜及其延伸、倒頻譜等等。近年由於深度學習的開發,許多 SOTA 方法基於深度學習方法找基礎頻率,但花費時間較傳統基頻分析長。基礎頻率之於傳統 哼唱式系統,主要任務為找出對應音符的音高。而對於環境噪音下,會出現基頻發 生在倍頻、甚至是三倍頻位置,進而影響哼唱式系統的精準度。本篇論文提出了不 易受到環境噪音影響的基頻提取方法,在既有的方法之下並結合深度學習方式,提 取中位數,以達到短時間與安靜環境下精準度維持不變且降低環境噪音下基頻提 取的錯誤差距。 於資料庫比對上,本篇論文僅基於既有的方法,探討研究語音去噪、音符偵測、 基頻分析與哼唱式系統這四者的改善。本篇論文將安靜環境下的哼唱式系統作為 基準點,並以各種使用者會遇到的環境噪音和噪音大小來測試並評估抗噪式哼唱 式系統的效能。

並列摘要


The main concept of a Query by humming (QBH) system is to find the target music from the corresponding database when only a fragment of music or melody is given. The type of music is mainly humming or singing by the human voice. Most QBH systems strive for system accuracy, a well-known example is Google Hum to search. Given a fragment of humming, the system will search in the streaming platform Youtube. It searches the corresponding song and outputs ranking with its ranking represents the songs that match the input humming. Due to the popularity of embedded systems (e.g., cell phones, cars, tablets), QBH systems are used in different environments and with different needs depending on the platform. For example, a cell phone user may hear some song in a restaurant and search for it with a QBH system. Compared with the traditional QBH system, which only seeks accuracy. The traditional QBH system needs to be improved to meet the needs of using the system under the noise and responding in short time at the same time. We follow the response time with 3 seconds of Google Hum to search. The traditional QBH system is divided into onset detection, fundamental frequency analysis, and sequence matching. In this thesis, we research the recent techniques of onset detection and fundamental frequency analysis and apply the modification to these methods, the speech enhancement (SE) model is added to integrate with a QBH system with noise immunity and short response time. The speech enhancement method is broadly divided into robust speech feature extraction and speech model adaption. Wiener subtraction is one of the robust speech feature extraction methods and speech model adaption is based on deep learning methods like LSTM with IRM or Autoencoder and its variation. Most models are based on denoising the time domain, suppressing noise on a spectrogram, or separating speech and noise on time-frequency analysis. In this thesis, we use the Autoencoder with a raw waveform as input to achieve the purpose of short response time and denoising, while SE is the preprocessor of speech for the whole QBH system. The results show that it’s easier to train the model using the speech data and testing the model on the humming data set. In common environments such as streets and offices, a QBH system is used. Given a melody, it will be accompanied by unwanted environmental signals, and the humming melody will be screened out. All the unwanted signals are called noise. The onset detection function detects the energy change to decide if a frame index is an onset or not, using these onsets enables melody matching to match query and target. However, due to the presence of noise, incorrect onset positions can be generated, resulting in QBH system misjudgment. In this thesis, an autoencoder model is used to learn the difference between humming and noise, reduce the energy of noise and separate the humming. Then, with multiple onset detection models to achieve a noise-resistant onset detection. Fundamental frequency analysis is to find out the pitch of the onsets. Common techniques include autocorrelation function, harmonic spectrum, and its variant, cepstrum, etc. Due to the development of deep learning in recent years, many state of the art (SOTA) methods are based on deep learning methods to find the fundamental frequency, but it takes longer time than the traditional fundamental frequency analysis. The pitch estimation of the traditional QBH system is to find the pitch of the corresponding onset. In the case of ambient noise, the fundamental frequency may occur at the octave or even triplet position affecting the accuracy of the QBH system. In this thesis, we propose a method to extract the pitch that is not easily affected by the noise. The method combines the traditional methods and a deep learning method to extract the median frequency to maintain accuracy in a short time and quiet environment and to reduce the error between prediction and ground truth pitch under ambient noise. In this thesis, the sequence matching is based on the existing method and with the improvement of speech enhancement, onset detection, fundamental frequency analysis, and QBH system. Therefore, the relationship between the 3 improved modules and the QBH system are investigated accordingly. The performance of the robust QBH system is tested with 17 types of ambient noise and 𝑆𝑁𝑅𝑑𝑏 noise level from -5 to 30 and compared with the QBH system under a clean environment as the ground truth.

參考文獻


[1]. Müller,M.(2015).Fundamentalsofmusicprocessing:Audio,analysis,algorithms, applications.
[2]. Lin,C.-W.,etal.(2016).Advancedquerybyhummingsystemusingdiffusedhidden Markov model and tempo based dynamic programming. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), IEEE.
[3]. Mostafa, N. and P. Fung (2017). A Note Based Query By Humming System Using Convolutional Neural Network. INTERSPEECH.
[4]. Jeon, H., Jung, Y., Lee, S., & Jung, Y. (2020). Area-Efficient Short-Time Fourier Transform Processor for Time–Frequency Analysis of Non-Stationary Signals. Applied Sciences, 10(20), 7208.
[5]. 陳秉鴻 (2020). "深度學習, 池化運算及改良式動態規劃應用於哼唱檢索系統."

延伸閱讀