優良的語言文脈訊息是語音合成的關鍵部分,傳統的文脈訊息都是依賴於自然語言處理(Natural Language Processing,NLP),即使用parser 分析文字。但是parser 設計困難無法專門為語音合成設計;所以我們想直接以字元為處理單元建立一個end-to-end 的語音合成系統, 在這想法下我們改用字元層級(character-level)的word2vec 與遞迴類神經網路,直接將輸入字元序列轉換成隱藏特徵向量當做語言合成的文脈訊息。最後我們利用一中英夾雜語音合成系統測試此想法,語音合成的實驗的結果表明,我們提出的方式的確比傳統使用parser 的方式有更好的性能。
High quality linguistic features is the key to the success of speech synthesis. Traditional linguistic feature extraction methods are usually relied on a word-level natural language processing (NLP) parser. Since, a good parser requires a lot of feature engineering to build, it is usually a genral-purpose one and often not specially designed for speech synthesis. To avoid these difficulties, we propose to replace the conventional NLP parser by a character embedding and a chacter-level recurrent neural network language model (RNNLM) module to directly convert input character sequences, character-by-character, into latent linguistic feature vectors. Experimental results on Chinese-English speech synthesis system showed that the proposed approach achieved comparable performance with transitional NLP parser-based methods.
為了持續優化網站功能與使用者體驗,本網站將Cookies分析技術用於網站營運、分析和個人化服務之目的。
若您繼續瀏覽本網站,即表示您同意本網站使用Cookies。