透過您的圖書館登入
IP:18.116.42.208
  • 學位論文

混合人聲之聲音場景辨識

Classification of Acoustic Scenes with Mixtures of Human Voice and Background Audio

指導教授 : 廖文宏

摘要


日常生活環境週遭聲音,從來不是單獨事件,而是多種音源重疊在一起,使得環境音辨識充滿了各種挑戰。本研究以DCASE2016 比賽Task1所提供的音訊資料,包括海邊(Beach)與輕軌電車(Tram)等共15種場景的環境錄音為基礎,搭配16位人聲進行合成,針對混合人聲後的場景進行分析與辨識。聲音特徵萃取採用了普遍使用於聲音辨識的對數梅爾頻譜(Log-Mel Spectrogram),用以保留最多聲音特徵,並利用卷積神經網路(CNN)來分辨出這些相互疊合聲音場景,整體平均辨識率達79%,於車輛(Car)類別辨識率可達93%,希望能將其運用在線上身份驗證之聲紋辨識的前處理階段。

並列摘要


The sounds around the environment of daily life are never separate events but consist of overlapping audio sources, making environmental sound recognition a challenging issue. This research employs audio data provided by Task1 of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE2016) competition, including environmental recordings of 15 scenes in different settings such as beach and tram. They are mixed with 16 human voices to create a new dataset. Acoustic features are extracted from the Log-Mel spectrogram, which is commonly used in voice recognition to retain the most distinct sound properties. Convolutional neural network (CNN) is employed to distinguish these overlapping sound scenes. We achiveve an overall accuracy of 79% and 93% accudacy in the ‘car’ scene. We expect the outcome to be applied as the pre-processing stage of voice-based online identity verification.

參考文獻


[1] ESC Dataset https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT
[2] UrbanSound8K
https://urbansounddataset.weebly.com/urbansound8k.html
[3] DCASE Challenge
http://dcase.community/

延伸閱讀