語者自動分段標記(Speaker Diarization)任務目標在沒有先驗資訊的情況下,自動化的將語音分割出不同的語者段落,並區分出哪些段落屬於相同語者。此任務可以應用於會議語音、視訊上課或者直播節目等,在音訊中同時存在多位語者對話之情境,並對音訊中每段語音標記其所屬之語者。近年來,伴隨硬體技術突破帶來的電腦算力增強,使得此任務效能得到了突破性的進長。然而,目前的語者自動分段標記系統在如增強多方互動(Augmented Multi-Party Interaction, AMI)語料集等高重疊長時語料上仍舊未取得良好的效能表現。 有鑑於此,本論文於AMI語料集上研究階段性與端對端等兩大語者自動分段標記架構,並分別針對兩種架構提出相應的改進方法。在階段性架構上,考量不同語者間在用詞習慣上對於語者辨識之影響,本論文採用自動語音辨識(Automatic Speech Recognition)生成的音素後驗圖(Phone Posteriorgrams)做為語者特徵之輔助資訊以此提高系統對於不同語者之辨別能力。在端對端架構上,考量相較於傳統階段性架構的網路僅需處理片段,段對段架構的神經網路需要一次性分析整段音訊,因而造成建模上的負擔,使得其效能在長時語料上表現不彰。此外,段對段架構相較於階段性架構需額外判斷此段音訊是否為非語音段,容易被噪音誤導造成辨識的難度激增。因此本研究參考階段性架構引入標準答案的語音活性檢測(Oracle Speech Activity Detection)降低神經網路在分辨語者時的負擔,並藉此縮短音訊長度。最後,本研究採用重疊自動分段標記投票機制(Diarization Output Voting Error Reduction Overlap, DOVER-lap)融合兩種架構的多個實驗結果,結合階段性架構的準確性與端對端架構對於重疊語音的處理能力以獲得更傑出之效能表現。
Speaker diarzation is to solve the “who spoke when” question by partitioning an audio file into homogeneous segments based on the speaker’s identity. It can be applied to conversations when there are multiple speakers at the same time. However, current dairization system can not achieve good performance on wideband corpus such as the Augmented Multi-party Interaction (AMI) corpus. This paper studies the two major diarization architectures, i.e. stage-wise and end-to-end (E2E), on the AMI corpus. Improved methods corresponding to the two architectures are porposed. For the influence of different speakers’ idioms on speaker recognition in stage-wise, phone posteriorgrams generated by automatic speech recognition is used as the auxiliary information. It improves the ability to distinguish different speakers. Different from the stage-wise architecture that only segmented information is processed, the E2E architecture analyzes the entire audio. It becomes an obstacle on modeling and decreases its performance on wideband corpus. In addition, the E2E architecture needs additional judgement on whether the audio segment is a speech segment. Such judgement can be easily affected by noise, which makes the difficulty of identification increase sharply. Therefore, this research refers to the stage-wise architecture and introduces the speech activity detection. It reduces the burden for neural network to distinguish speakers. Finally, the so-called DOVER-lap mechanism is used to integrate multiple experimental results of these two architectures. With the high accuracy of stage-wise architecture and the ability of E2E architecture on tackling overlapping speech, more pronounced performance with DER 16.9% is obtained.