語者自動分段標記與後修正技術之研究

語音辨識 (Speech Recognition) 可以被應用在諸多方面，研究者發現若可以將不同語者進行分割後，再分別進行語音辨識可以有更好的結果，也因此語者辨識 (Speaker Recognition) 相關領域開始被關注。其中語者自動分段標記 (Speaker Diarization) 是這個領域中最終目標，該任務目標希望可以辨別「誰在甚麼時候說話」，即是在沒有足夠語者或語音資訊時，能夠自動化的將語音分割出各個片段，並標示出各個片段那些屬於相同語者。這個任務可以被應用在諸多情境，例如在會議場合、直播節目或電話錄音等，多位語者同時存在的場合。隨著近年來的技術提升，在此領域研究已經有各種不同的方法與模型架構，但當遇到有重疊語音時，辨識結果依舊容易產生錯誤，也因此研究者們也針對如何處理重疊語音片段提出各種後修正改良方法。在本論文中，使用了LibriSpeech語料，並透過模擬房間脈衝響應 (Simulated Room Impulse Responses) 及Musan混合噪音用以模擬真實環境。實驗主要使用端對端之架構來做為預訓練模型，並使用了DiaCorrect後修正方法。因為使用了較多噪音的訓練集，在預訓練之結果有著更高的錯誤率，因此在後修正實驗中亦選擇保留更多的訓練資料來訓練解碼器模型，並比較選擇不同之資料量訓練時的結果。另外，本論文也使用了Pyannote 2.1提供之模型當作預訓練模型，此模型在各語料皆有不凡的表現，透過Pyannote 2.1與DiaCorrect之方法結合，以改善原先較差的部分，並藉此得到更佳的語者自動分段標記結果。

關鍵字

語者自動分段標記；端對端；語音活性檢測；後修正；重疊語音檢測

並列摘要

Speech recognition can be applied in various fields, and researchers have found that better results can be achieved by segmenting different speakers and performing speech recognition separately. Therefore, speaker recognition has gained attention in related fields. Among them, speaker diarization is the ultimate goal in this area. The task aims to identify “who spoke when”, auto-matically segmenting speech into segments and labeling segments belonging to the same speaker when there is not enough speaker or speech information. This task can be applied in many situations, such as meetings, live broadcasts, or phone call recording. With recent technological advancements, various methods and model architectures have been developed in this field. However, when encountering overlapping speech, recognition results still tend to pro-duce errors. Therefore, researchers have proposed various post-correction and improvement methods for handling overlapping speech segments. In this paper, the LibriSpeech corpus is utilized, and simulated room im-pulse responses along with Musan mixed noise are employed to simulate real environments. The experiments primarily utilize an end-to-end architecture as the pre-training model, coupled with the DiaCorrect post-correction method. Due to the incorporation of a more noise-intensive training set, higher error rates are observed in the results of pre-training. Therefore, in the post-correction experiments, a decision is made to retain more training data to train the decoder model and to compare the results with different amounts of train-ing data. Additionally, Pyannote 2.1 models are employed as pre-training models, exhibiting exceptional performance across all corpora. Through the integration of Pyannote 2.1 and DiaCorrect methods, the aim is to improve the initially inferior aspects and thereby achieve better speaker diarization results.

並列關鍵字

Speaker Diarization ； End-to-End ； Speech Activity Detection ； Post-Correction ； Overlapped Speech Detection

參考文獻

T. Kinnunen and H. Li, "An Overview of Text-Independent Speaker Recognition: from Features to Supervectors," Speech Communication, vol. 52, pp. 12-40, 2010.

Google Scholar

Z. Bai and X.-L. Zhang, "Speaker Recognition Based on Deep Learning: An Overview," Neural Networks Volume 140, pp. 65-99, August 2021.

Google Scholar

J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee and I. Han, "In defence of metric learning for speaker recognition," arXiv preprint arXiv:2003.11982, 2020.

Google Scholar

X. F. Yiming Wang, I.-F. Chen, Y. Liu, T. Chen and B. Hoffmeister, "End-to-end Anchored Speech Recognition," 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

Google Scholar

S. Sigtia, E. Marchi, S. Kajarekar, D. Naik and J. Bridle, "Multi-task Learning for Speaker Verification and Voice Trigger Detection," 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

Google Scholar

國際替代計量

語者自動分段標記與後修正技術之研究

全文下載

主題瀏覽