透過您的圖書館登入
IP:18.116.62.45
  • 學位論文

無監督式結構化語音模型和語音特徵及其在語音檢索的運用

Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection

指導教授 : 李琳山

摘要


在大數據時代,大量的原始語音數據容易獲得,但有對應文字標記的語音數據仍然難以獲得。這導致不需要對應文字標記的無監督學習的重要性越來越高,其中一個最明顯的應用是語音對語音的語句檢索。在語音研究的文獻中,監督式學習隨著自動語音辨識技術的進步一起成長,而無監督式學習仍然是相對較少文獻探索的探索領域。在本論文中,我們提出了階層式和多層式學習,兩種無監督式學習方法,用於直接從語音語料庫中發現具有結構的語音模型。階層式學習同時學習語音模型和單詞的兩個層次的表示法。多層學習試圖用不同粗細度的多套語音模型來捕獲所有可用信息。此外,我們提出運用多套語音模型和深層神經網絡的架構的來抽取無監督的語音特徵的方法。本文還同時提出一個理論框架統一兩種學習方式,並運用語音對語音的檢索實驗來驗證我們的模型。本論文所提出的語音模型和語音特徵在標準語料庫上使用明確的度量標準與文獻中最先進的方法進行比較,結果顯示本文的語音模型與語音特徵相當具有競爭實力。

並列摘要


In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application for which is the Query-by-Example Spoken Term Detection (QbE-STD). With the dominant paradigm of automatic speech recognition (ASR) technologies being supervised learning, such a scenario is still a relatively less explored area. In this thesis, we present the Hierarchical Paradigm and the Multi-granularity Paradigm for unsupervised discovery of structured acoustic tokens directly from speech corpora. The Hierarchical Paradigm attempts to jointly learn two level of representations that are correlated to phonemes and words. The Multi-granularity Paradigm makes no assumptions on which set of tokens to select, and seeks to capture all available information with multiple sets of tokens with different model granularities. Furthermore, unsupervised speech features can be extracted using the Multi-granular acoustic tokens with a framework which we call the Multi-granular Acoustic Tokenizing Deep Neural Network (MAT-DNN). We unified the two paradigms in a single theoretical framework and performed query-by-example spoken term detection experiments on the token sets and frame-level features. The theories and principles on acoustic tokens and frame-level features proposed in this thesis are supported by competitive results against strong baselines on standard corpora using well-defined metrics.

參考文獻


[4] Igor Szoke, Miroslav Sk ¨ acel, Luk ´ a´s Burget, and Jan ˇ Cernock ˇ y, “Coping with channel mismatch in query-by-example-but quesst 2014,” in Acoustics, Speech and Signal
[5] Cheung-Chi Leung, Lei Wang, Haihua Xu, Jingyong Hou, Van Tung Pham, Hang Lv, Lei Xie, Xiong Xiao, Chongjia Ni, Bin Ma, et al., “Toward high-performance language-independent query-by-example spoken term detection for mediaeval 2015: Post-evaluation analysis,” in Proc. INTERSPEECH, 2016.
[6] Hongjie Chen, Cheung-Chi Leung, Lei Xie, Bin Ma, and Haizhou Li, “Unsupervised bottleneck features for low-resource query-by-example spoken term detection,” in Proc. INTERSPEECH, 2016.
[8] Haipeng Wang, Tan Lee, Cheung-Chi Leung, Bin Ma, and Haizhou Li, “Acoustic segment modeling with spectral clustering methods,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 2, pp. 264–277, 2015.
[10] Yaodong Zhang, Unsupervised Speech Processing with Applications to query-byexample Spoken Term Detection, Ph.D. thesis, Massachusetts Institute of Technology, 2013.

延伸閱讀