Improving Results of Recognizing Speech Digits Using Neural Network by Processing the Raw Data and the Feature Sets

The aim of the report is to classify spoken digits using neural network. In order to improve the result, this report mainly focused on improving Mel-scale Filter Cepstral Coefficients(MFCC) feature set. To reduce the effect of bad quality data on the classification system, We constructed a binary neural network (BNN). At the cost of discarding roughly 3 to 4 percent of a total of 2300 data which the BNN system deems bad and not worth to be classified, the PEG value gains a significant reduction of over 95 percent for both English and Chinese databases after implementing the BNN before the DNN classification system. In the second part, the study attempts to improve the classification accuracy of neural network by improving the quality of raw data and its feature. The strategies of stretching the effective signal samples, using multiple energy thresholds and filtering, and copying the segments which may contain more information after segmentation are mainly used. This report also proposed the Frequency Masking Filter to improve the MFCC to have a better result. After applying the Frequency Masking Filter into the MFCC, the classification result is improved by 5% at most. The fourth part focused on finding the optimum gain filter which was not too large to form a poor feature set boosting the noise by adding two vectors of n and f0 when pre-processing speech signals. The results show a 20% improvement in MFCC and STFT with both English and Chinese database.

關鍵字

Spoken digits classification ； Neural network ； Mel-scale Filter Cepstral Coefficients ； Signal processing ； Frequency Masking Filter

參考文獻

Brian C. J. Moore and Brian R. Glasberg. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. The Journal of the Acoustical Society of America, 74(3):750–753, 1983.

D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human- level performance on imagenet classification. ICCV, pages 1026–1034, 2015.

H B Kekre, V Kulkarni, and P Gaikar. Speaker identification using spectrograms of varying frame sizes. International Journal of Computer Applications, 50(2):27, 2012.

Abdel-Rahman Mohamed, George E. Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief net- works. IEEE Trans. Audio Speech Lang Processing, 20(1):14–22, 2012.

國際替代計量

Improving Results of Recognizing Speech Digits Using Neural Network by Processing the Raw Data and the Feature Sets

全文下載

主題瀏覽