簡易檢索 / 詳目顯示

研究生: 李鴻欣
Hung-Shin Lee
論文名稱: 基於分類錯誤之線性鑑別式特徵轉換應用於大詞彙連續語音辨識
Classification Error-based Linear Discriminative Feature Transformation for Large Vocabulary Continuous Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 中文
論文頁數: 107
中文關鍵詞: 語音辨識鑑別分析特徵擷取特徵轉換
英文關鍵詞: speech recognition, discriminant analysis, feature extraction, feature transformation
論文種類: 學術論文
相關次數: 點閱:161下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 線性鑑別分析(linear discriminant analysis, LDA)的目標在於尋找一個線性轉換,能將原始資料投射到較低維度的特徵空間,同時又能保留類別間的幾何分離度(geometric separability)。然而,LDA並不能總是保證在分類過程中產生較高的分類正確率。其中一個可能的原因在於LDA的目標函式並非直接與分類錯誤率連接,因此它也就未必適合在某特定分類器控制下的分類規則,自動語音辨識(automatic speech recognition, ASR)就是一個很好的例子。在本篇論文中,我們藉著探索每一對容易混淆之音素類別間的經驗分類錯誤率(empirical classification error rate)與馬氏距離(Mahalanobis distance)的關係,擴展了傳統的LDA,並且將原來的類別間散佈矩陣(between-class scatter),從每一對類別間的歐式距離(Euclidean distance)估算,修改為它們的成對經驗分類正確率。這個新方法不僅保留了原本LDA就具有的輕省可解性,同時無須預設資料是為何種機率分佈。
    另一方面,我們更進一步提出一種嶄新的線性鑑別式特徵擷取方法,稱之為普遍化相似度比率鑑別分析(generalized likelihood ratio discriminant analysis, GLRDA),其旨在利用相似度比率檢驗(likelihood ratio test)的概念尋求一個較低維度的特徵空間。GLRDA不僅考慮了全體資料的異方差性(heteroscedasticity),即所有類別之共變異矩陣可被彈性地視為相異;並且在分類上,能藉由最小化類別間最混淆之情況(由虛無假設(null hypothesis)所描述)的發生機率,而求得有助於分類效果提升的較低維度特徵子空間。同時,我們也證明了LDA與異方差性線性鑑別分析(heteroscedastic linear discriminant analysis, HLDA)可被視為GLRDA的兩種特例。再者,為了增進語音特徵的強健性,GLRDA更可進一步地與辨識器所提供的經驗混淆資訊結合。
    實驗結果顯示,在中文大詞彙連續語音辨識系統中,我們提出的方法都比LDA或其它現有的改進方法,如HLDA等,有較佳的表現。

    The goal of linear discriminant analysis (LDA) is to seek a linear transformation that projects an original data set into a lower-dimensional feature subspace while simultaneously retaining geometrical class separability. However, LDA cannot always guarantee better classification accuracy. One of the possible reasons lies in that its criterion is not directly associated with the classification error rate, so that it does not necessarily accommodate itself to the allocation rule governed by a given classifier, such as that employed in automatic speech recognition (ASR). In this thesis, we extend the classical LDA by leveraging the relationship between the empirical phone classification error rate and the Mahalanobis distance for each respective phone class pair. To this end, we modify the original between-class scatter from a measure of the Euclidean distance to the pairwise empirical classification accuracy for each class pair, while preserving the lightweight solvability and taking no distributional assumption, just as what LDA does.
    Furthermore, we also present a novel discriminative linear feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension – heteroscedastic linear discriminant analysis (HLDA) are just two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance.
    Experimental results demonstrate that our approaches yields moderate improvements over LDA and other existing methods, such as HLDA, on the Chinese large vocabulary continuous speech recognition (LVCSR) task.

    第1章 研究目標與方法論 1 1.1 基本目標與方法 1 1.2 研究基礎:線性鑑別分析 5 1.3 論文貢獻 6 1.4 論文架構 7 第2章 背景介紹 9 2.1 統計式語音辨識 9 2.2 聲學特徵擷取 11 2.2.1 頻譜定形 12 2.2.2 頻譜分析 13 2.2.3 參數轉換 15 2.2.4 另一種框架:多向量輸入 16 2.3 線性鑑別分析(LDA) 18 2.3.1 目標函式 19 2.3.2 幾何分離度的意義與分析 24 2.3.3 限制與改進:異方差性(Heteroscedasticity) 29 2.3.4 限制與改進:分類相關性 33 第3章 基於經驗資訊之線性鑑別分析 39 3.1 權重式線性鑑別分析 39 3.2 基於混淆資訊之權重式線性鑑別分析 45 3.2.1 基於經驗錯誤率之權重式線性鑑別分析 45 3.2.2 距離-錯誤耦合之權重式線性鑑別分析 47 3.2.3 近似成對經驗正確率標準 51 3.2.4 aPTAC與aPEAC之比較 53 3.3 基於經驗錯誤率之類別內共變異矩陣 54 第4章 普遍化相似度比率鑑別分析 57 4.1 相似度比率檢定 57 4.2 普遍化相似度比率鑑別分析 58 4.2.1 同方差性(Homoscedasticity) 59 4.2.2 異方差性(Heteroscedasticity) 64 4.2.3 討論與比較 68 4.3 混淆資訊的延伸 70 第5章 實驗架構與實驗結果 73 5.1 實驗語料庫 73 5.2 臺灣師大之中文大詞彙連續語音辨識系統 75 5.2.1 前端處理 75 5.2.2 聲學模型 76 5.2.3 詞典建立與語言模型訓練 76 5.2.4 詞彙樹複製搜尋 77 5.2.5 實驗評估方式 79 5.2.6 多向量輸入(頻域-時域特徵擷取) 80 5.3 實驗結果 80 5.3.1 關於類別定義的進一步討論 81 5.3.2 基礎實驗結果 83 5.3.3 基於混淆資訊之權重式線性鑑別分析實驗結果 85 5.3.4 普遍化相似度比率鑑別分析實驗 88 5.3.5 最小化音素錯誤(MPE)實驗 90 第6章 結論與未來展望 93 第7章 附錄 95 7.1 重要的向量微分公式 95 7.2 一些證明推導 96 7.2.1 證明式(2.37) 96 7.2.2 有關高斯分佈與相似度的證明 96 參考文獻 99 作者相關學術著作 107

    [1] H.-S. Chiu, et al., "Position information for language modeling in speech recognition," in Proc. ISCSLP, 2008, pp. 101-104.
    [2] J. Li, et al., "Soft margin estimation of hidden markov model parameters," in Proc. Interspeech, 2006, pp. 2422-2425.
    [3] M. Gilbert, et al., "Intelligent virtual agents for contact center automation," IEEE Signal Processing Magazine, vol. 22, pp. 32-41, 2005.
    [4] M. Gilbert and J. Feng, "Speech and language processing over the web," IEEE Signal Processing Magazine, vol. 25, pp. 18-28, 2008.
    [5] N. Morgan, et al., "Pushing the envelope - aside," IEEE Signal Processing Magazine, vol. 22, pp. 81-88, 2005.
    [6] H. Hermansky, "Should recognizers have ears?," Speech Communication, vol. 25, pp. 3-27, 1998.
    [7] H. Hermansky, "Exploring temporal domain for robustness in speech recognition," in Proc. ICA, 1995, pp. 61-64.
    [8] H. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach: Springer, 1994.
    [9] M. J. Hunt and C. Lefdbvre, "A comparison of several acoustic representations for speech recognition with degraded and undegraded speech," in Proc. ICASSP, 1989, pp. 262-265.
    [10] S. Makino, et al., "Recognition of consonant based on the Perceotron model," in Proc. ICASSP, 1983, pp. 738-741.
    [11] S. Furui, "Speaker-independent isolated word recognizer using dynamic features of speech spectrum," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 34, pp. 52-59, 1986.
    [12] J. S. Bowers and C. J. Davis, "Is speech perception modular or interactive?," Trends in Cognitive Sciences, vol. 8, pp. 3-5, 2004.
    [13] A. G. Samuel, "Knowing a word affects the fundamental perception of the sounds within it," Psychological Science, vol. 12, pp. 348-351, 2001.
    [14] J. Obleser and F. Eisner, "Pre-lexical abstraction of speech in the auditory cortex," Trends in Cognitive Sciences, vol. 13, pp. 14-19, 2009.
    [15] D. Norris, et al., "Merging information in speech recognition: Feedback is never necessary," Behavioral and Brain Sciences, vol. 23, pp. 299-370, 2000.
    [16] M. K. Tanenhaus, et al., "No compelling evidence against feedback in spoken word recognition," Behavioral and Brain Sciences, vol. 23, pp. 348-349, 2000.
    [17] D. B. Pisoni and R. E. Remez, The Handbook of Speech Perception. Oxford: Blackwell, 2005.
    [18] R. Chengalvarayan and L. Deng, "HMM-based speech recognition using state-dependent, discriminatively derived transforms on mel-warped DFT features," IEEE Trans. Speech and Audio Processing, vol. 12, pp. 19-26, 1997.
    [19] X.-B. Li, et al., "Dimensionality reduction using MCE-optimized LDA transformation," in Proc. ICASSP, 2004, pp. 137-140.
    [20] D. Povey, et al., "fMPE: discriminatively trained features for speech recogntion," in Proc. ICASSP, 2005, pp. 961-964.
    [21] B. Schölkopf and A. J. Smola, Learning with Kernels - Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, Massachusetts: The MIT Press, 2002.
    [22] E. Alpaydin, Introduction to Machine Learning. Cambridge, MA: The MIT Press, 2004.
    [23] A. R. Webb, Statistical Pattern Recognition, 2nd ed.: John Wiley and Sons, 2002.
    [24] M. Gales, "Semi-tied covariance matrices for hidden Markov models," IEEE Trans. Speech and Audio Processing, vol. 7, pp. 272-281, 1999.
    [25] X. Wang and K. K. Paliwal, "Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition," Pattern Recognition, vol. 36, pp. 2429-2439, 2003.
    [26] N. Kumar and A. G. Andreou, "Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition," Speech Communication, vol. 26, pp. 283-297, 1998.
    [27] G. Saon, et al., "Maximum likelihood discriminant feature spaces," in Proc. ICASSP, 2000, pp. 1129-1132.
    [28] K. Demuynck, et al., "Optimal feature sub-space selection based on discriminant analysis " in Proc. Eurospeech, 1999, pp. 1311-1314.
    [29] X. Cui, et al., "Stereo-based stochastic mapping with discriminative training for noise robust speech recognition," in Proc. ICASSP, 2009, pp. 2933-2936.
    [30] P. F. Brown, "The acoustic-modelling problem in automatic speech recognition," Ph.D. dissertation, Carnegie Mellon University, 1987.
    [31] R. Haeb-Umbach and H. Ney, "Linear discriminant analysis for improved large vocabulary continuous speech recognition," in Proc. ICASSP, 1992, pp. 13-16.
    [32] H. Hermansky, "Stochastic techniques in deriving perceptual knowledge," in Proc. SAPA, 2004.
    [33] D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, 2nd ed.: Prentice Hall, 2008.
    [34] S. Young, et al., The HTK Book (for HTK Version 3.4): Cambridge University Engineering Department, 2006.
    [35] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition: Prentice Hall, 1993.
    [36] W. Chou and B.-H. Juang, Pattern Recognition in Speech and Language Processing: CRC Press, 2003.
    [37] X. Liu, "Discriminative complexity control and linear projections for large vocabulary speech recognition," Ph.D. dissertation, University of Cambridge, 2005.
    [38] M. N. Stuttle, "A Gaussian mixture model spectral representation for speech recognition," Ph.D. dissertation, University of Cambridge, 2003.
    [39] J. W. Picone, "Signal modeling techniques in speech recognition," in Proc. the IEEE, 1993, pp. 1214-1247.
    [40] S. B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, pp. 357-366, 1980.
    [41] R. Haeb-Umbach, et al., "Improvements in connected digit recognition using linear discriminant analysis and mixture densities," in Proc. ICASSP, 1994, pp. 239-242.
    [42] B. D. Ripley, Pattern Recognition and Neural Networks. New York: Cambridge University Press, 1996.
    [43] M. J. Hunt, "A statistical approach to metrics for word and syllable recognition," presented at the the 98th Meeting of the Acoustical Society of America, 1979.
    [44] G. R. Doddington, "Phonetically sensitive discriminants for improved speech recognition," in Proc. ICASSP, 1989, pp. 556-559.
    [45] M. J. Hunt, et al., "An investigation of PLP and IMELDA acoustic representations and of their potential for combination," in Proc. ICASSP, 1991, pp. 881-884.
    [46] L. Wood, et al., "Improved vocabulary-independent sub-word HMM modelling," in Proc. ICASSP, 1991, pp. 181-184.
    [47] G. Yu, et al., "Discriminant analysis and supervised vector quantization for continuous speech recognition," in Proc. ICASSP, 1990, pp. 685-688.
    [48] C. M. Ayer, et al., "A discriminately derived linear transform for improved speech recognition," in Proc. Eurospeech, 1993, pp. 583-586.
    [49] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. New York: Academic Press, 1990.
    [50] R. A. Fisher, "The use of multiple measurements in taxonomic problems," Annals of Eugenics, vol. 7, pp. 179-188, 1936.
    [51] R. A. Fisher, "The statistical utilization of multiple measurements," Annals of Eugenics, vol. 8, pp. 376-386, 1938.
    [52] C. R. Rao, "The utilization of multiple measurements in problems of biological classification," Journal of the Royal Statistical Society, Series B, vol. 10, pp. 159-203, 1948.
    [53] R. O. Duda, et al., Pattern Classification. New York: John & Wiley, 2000.
    [54] G. A. F. Seber, Multivariate Observations. New York: John Wiley & Sons, 1984.
    [55] S. S. Wilks, Mathematical Statistics. New York: John Wiley & Sons, 1962.
    [56] R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 5th ed. New Jersey: Prentice Hall, 2002.
    [57] R. A. Gopinth, "Maximum likelihood modeling with Gaussian distributions for classification," in Proc. ICASSP, 1998, pp. 661-664.
    [58] N. A. Campbell and W. R. Atchley, "The geometry of canonical variate analysis," Systematic Zoology, vol. 30, pp. 268-280, 1981.
    [59] W. J. Krzanowski, Principles of Multivariate Analysis: A User's Perspective. New York: Oxford University Press, 1988.
    [60] D. J. Hand, Construction and Assessment of Classification Rules. New York: John Wiley & Sons, 1997.
    [61] N. A. Campbell, "Canonical Variate Analysis - A General Model Formulation," Australian Journal ofStatistics, vol. 26, pp. 86-96, 1984.
    [62] N. Kumar, "Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition," Ph.D. dissertation, Johns Hopkins University, 1997.
    [63] M. Sakai, et al., "Generalization of linear discriminant analysis used in segmental unit input hmm for speech recognition," in Proc. ICASSP, 2007, pp. 333-336.
    [64] M. Sakai, et al., "Linear discriminant analysis using a generalized mean of class covariances and its application to speech recognition," IEICE Trans. Information and Systems, vol. E91-D, pp. 478-487, 2008.
    [65] C. R. Rao, Linear Statistical Inference and Its Applications, 2nd ed. New York: John Wiley & Sons, 2002.
    [66] T. W. Anderson, An Introduction to Multivariate Statistical Methods, 2nd ed. New York: John Wiley & Sons, 1984.
    [67] S. Geisser, "Discrimination, Allocatory, and Separatory Linear Aspects," in Classification and Clustering, J. V. Ryzin, Ed., ed, 1977, pp. 301-330.
    [68] Y. Li, et al., "Weighted pairwise scatter to improve linear discriminant analysis," in Proc. ICSLP, 2000, pp. 608-611.
    [69] Y. Liang, et al., "Uncorrelated linear discriminant analysis based on weighted pairwise Fisher criterion," Pattern Recognition, vol. 40, pp. 3606-3615, 2007.
    [70] M. Loog and R. Haeb-Umbach, "Multi-class linear dimension reduction by generalized Fisher criteria," in Proc. ICSLP, 2000, pp. 1069-1072.
    [71] H.-S. Lee and B. Chen, "Linear discriminant feature extraction using weighted classification confusion information," in Proc.Interspeech, 2008, pp. 2254-2257.
    [72] H.-S. Lee and B. Chen, "Improved linear discriminant analysis considering empirical pairwise classification error rates," in Proc. ISCSLP, 2008, pp. 149-152.
    [73] H.-S. Lee and B. Chen, "Empirical error rate minimization based linear discriminant analysis," in Proc. ICASSP, 2009.
    [74] E. K. Tang, et al., "Linear dimensionality reduction using relevance weighted LDA," Pattern Recognition, vol. 38, pp. 485-493, 2005.
    [75] Y. Liu and P. Fung, "Acoustic and phonetic confusions in accented speech recognition," in Proc. Interspeech, 2005, pp. 3033-3036.
    [76] J. M. Górriz, et al., "Generalized LRT-based voice activity detector," IEEE Signal Processing Letters, vol. 13, pp. 636-639, 2006.
    [77] N. A. Campbell, "Canonical variate analysis with unequal covariance matrices - generalizations of the usual solution," Mathematical Geology, vol. 16, pp. 109-124, 1984.
    [78] J. D. Foley, et al., Computer Graphics: Principles and Practice in C, 2nd ed.: Addison-Wesley, 1995.
    [79] H.-M. Wang, et al., "MATBN: A mandarin Chinese broadcast news corpus," International Journal of Computational Linguistics and Chinese Language Processing, vol. 10, pp. 219-235, 2005.
    [80] C. Barras, et al., "Transcriber : Development and use of a tool for assisting speech corpora production," Speech Communication, vol. 33, pp. 5-22, 2001.
    [81] A. Stolcke, SRI Language Modeling Toolkit (Version 1.5.2): http://www.speech.sri.com/projects/srilm/.
    [82] X. Aubert, "An overview of decoding techniques for large vocabulary continuous speech recognition," Computer Speech and Language, vol. 16, pp. 89-114, 2002.
    [83] 劉士弘, "改善鑑別式聲學模型訓練於中文連續語音辨識之研究," 碩士論文: 國立台灣師範大學, 2007.
    [84] B. Chen, et al., "Lightly supervised and data-driven approaches to mandarin broadcast news transcription," in Proc. ICASSP, 2004, pp. 777-780.
    [85] 張志豪, "強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究," 碩士論文: 國立台灣師範大學, 2005.
    [86] S. Ortmanns, et al., "A word graph algorithm for large vocabulary continuous speech recognition," Computer Speech and Language, vol. 11, pp. 43-72, 1997.
    [87] L. R. Bahl, et al., "A maximum likelihood approach to continuous speech recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-5, pp. 179-190, 1983.
    [88] L. E. Baum, "An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov Processes," Inequalities, vol. 3, pp. 1-8, 1972.
    [89] D. Povey, "Discriminative training for large vocabulary speech recognition," Ph.D. Dissertation, University of Cambridge, 2004.
    [90] D. Povey and P. C. Woodland, "Minimum phone error and I-smoothing for improved discriminative training," in Proc. ICASSP, 2002, pp. 105-108.

    下載圖示
    QR CODE