語者辨認系統在公共電話網路中,通常會遇到話筒不匹配和辨認語料不足的問題。為增進語者辨認系統之強健性,我們提出一融合下層聲學與上層韻律訊息之架構,首先利用(1)最大相似先驗知識內插法(maximum likelihood-a priori knowledge interpolation, ML-AKI)方法做話筒聲學特性估計與補償,並以(2)最小錯誤鑑別式法則(Minimum Classification Error, MCE)訓練語者模型,拉大不同語者模型間分數的距離,以得到更精確的語者模型,與利用(3)韻律訊息特徵分析(eigen-prosody analysis, EPA)為輔助,將所有語者投影至緊密的特徵韻律訊息空間,量測語者間的距離,最後利用(4)線性迴歸的方式融合聲學與韻律模型分數得到辨識的結果。 實驗部份使用Linguistic Data Consortium(LDC)之HTIMIT語料庫(共有十種不同的話筒),以leave-one-out方式驗證所提出之方法,若用傳統MAP-GMM/CMS的方法當作baseline,平均語者辨認率為60.2%。但若結合ML-AKI,MCE,EPA與MAP-GMM/CMS的方法,則平均辨認率可達到79.3%。而若只觀察未知話筒部份,平均語者辨識率亦可由58.3%提升至74.6%。由以上結果得知結合ML-AK,MCE/GPD,EPA和MAP-GMM方法和傳統MAP-GMM/CMS方法比較,無論對已知話筒和未知話筒環境皆能達到有效改善。
Unseen handset mismatch is the major source of performance degradation for close-set speaker identification in telecommunication environment.To compensate the handset mismatch problems with few available train/test data, a maximum likelihood a priori knowledge interpolation (ML-AKI) and an eigen-prosodic analysis (EPA) approaches were proposed and fused together for robust speaker indentification. The experimental results on HTIMIT showed that the ML-AKI+EPA+MCE+MAP-GMM/CMS fusion approach achieved 79.3% average speaker identification accuracy, which is much better than the traditional MAP-GMM/CMS-based baseline (60.2%). Moreover, the average speaker identification rates of the nine unseen handset turns in the level-one-out experiment could also be increased from 58.3% (MAP-GMM/CMS) to 74.6% (ML-AKI+EPA+MCE+MAP-GMM/CMS), respectively. Therefore, the proposed ML-AKI and EPA fusion methode is a promising approach for robust speaker identification for dealing both seen and unseen handset distortion.