語音辨識 輔助的 台語語料庫 收集方法 探討

語料庫是語言技術的基礎，然而對弱勢語如台語，語料收集並不如強勢華語方便。本文探討使用語音辨識幫助台語語料收集，同時包括語音語料庫以及文字語料庫。假若給定資料是台語錄音以及對應的台文，那麼我們有機會快速得到台語語音語料、文字語料、標音語料、變調語料，這不妨叫台文台音問題。另外一種不妨叫華文台音問題，假設給定資料是華文資料以及對應的台語翻譯的語音，那麼除了以上四種語料，我們還可得著台華平行句語料；平行句對台華互譯有基礎的重要性。因為語音辨識系統正確率尚未完美，此時針對每一個特定句子及語音，操作辨識網路並簡化其複雜度，可以提高辨識的效果，本文目的之一在探討，給定特定的台語句或華語句，如何獲得包括正確台語拼音串的最簡單辨識網路。語音辨識在解碼時，實際上可以得到二項結果： 1° 辨識網路規範之下最佳音串 (概似值最大音串列)， 2° 最佳音串各音所佔時間。如何使用這兩個結果，找到語料庫中的可能錯誤，以提升語料品質，也是本文目標。

關鍵字

Corpus collection ； Speech recognition

並列摘要

Corpus is fundamental to computing linguistics. But for marginalized Taiwanese language, corpus collection is not as easy as Chinese. This thesis explores using speech recognition technology to help collect Taiwanese text and speech corpus with various annotations. Given a Taiwanese sentence and its corresponding recorded speech, we might semi-automatically obtain its phonetic annotations and tone sandhi. This gives a total of four corpus contents: text, speech, phonetic annotation, and tone sandhi. Let us call it Taiwanese-text-Taiwanese-speech (TTTS) problem. Another similar setup is the Mandarin-text-Taiwanese-speech (MTTS) problem. In addition to the four corpus contents, we might also obtain Taiwanese Mandarin parallel sentences in the MTTS case. Parallel corpus is essential to the research of Taiwanese-Mandarin translation. Since the current automatic speech recognition system is not perfect yet even for healthy languages like English and Chinese, it is sensible to manipulate the recognition network to decrease the complexity of the network used in the speech recognition system. Using a TTTS corpus and a MTTS corpus, this paper explores ways of constructing the recognition network on a sentential basis both for Taiwanese text and for Mandarin text. The current hidden Markov model based speech recognition system is capable of giving two kinds of results. One is the best path in the recognition network, in the likelihood sense. The other is the occupation time of each syllable. These results can be used in spottin possible errors in the corpus.

並列關鍵字

無資料

參考文獻

[2] Dau-Cheng LyuLyu, Yuang-Chin Chiand and Chun-Nan HsuRen-Yuan. (2005). Modeling Pronunciation Variation for Bi-Lingual Mandarin/Taiwanese Speech Recognition. Computational Linguistics and Chinese Language Processing, 363-380.

[19] 謝博行. (2013). 局部最長連續共同子序列與新詞組收集.

[4] HintonGeoffrey. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE, 82-97.

[5] HTK Speech Recognition Toolkit. (2009). 擷取自 http://htk.eng.cam.ac.uk/

[7] Kam-Fai WongLi, Ruifeng Xu, Zheng-sheng ZhangWenjie. (2009). Introduction to Chinese Natural Language Processing.

被引用紀錄

林駿羽（2014）。台語聲調辨識〔碩士論文，國立清華大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0016-2912201413492246

國際替代計量

語音辨識輔助的台語語料庫收集方法探討

全文下載

主題瀏覽