透過您的圖書館登入
IP:3.147.73.35
  • 期刊
  • OpenAccess

基於音段式LMR對映之語音轉換方法的改進

Improving of Segmental LMR-Mapping Based Voice Conversion Method

摘要


基於線性多變量迴歸(linear multivariate regression, LMR)頻譜對映之語音轉換方法,轉換出的頻譜包絡仍然存在過度平滑(over smoothing)的現象,因此本論文研究在音段式LMR頻譜對映之前加入直方圖等化(HEQ)的處理,並且在LMR頻譜對映之後加入目標音框挑選的處理,希望藉以提升轉換出語音的品質。在此,直方圖等化處理包含兩個步驟,首先是把離散倒頻譜係數(DCC)轉換成主成分分析(PCA)係數,接者把PCA係數轉換成累積密度函數(CDF)係數;目標音框挑選則是依據一個音框的音段類別編號、及LMR對映出的DCC向量,到目標語者相同音段類別所收集的音框群中,去搜尋出距離較小的目標語者DCC向量、並且取代原先對映出的DCC向量,如此以避免發生頻譜包絡之過度平滑現象。對於直方圖等化與目標音框挑選,我們以外部平行語料(未參加模型參數訓練)來量測語音轉換之平均DCC誤差,當加入直方圖等化後會使誤差值變大一些,而當加入目標音框挑選後則會使誤差值變大得更多。不過,VR(variance ratio)值量測及主觀聽測的結果卻是相反的方向,亦即直方圖等化可使語音品質提升一些,而目標音框挑選則可使語音品質獲得更為明顯的提升。這種誤差距離值和語音品質聽測之間的不一致性,我們設法去尋找了它的原因,所找到的一個理由在內文裡說明。

並列摘要


Spectral over-smoothing is still observable in the converted spectral envelope when linear multivariate regression (LMR) based spectrum mapping is adopted to convert voice. Therefore, in this paper, we study to place a histogram-equalization (HEQ) module immediately before LMR based mapping and to place a target frame selection (TFS) module immediately after LMR based mapping. These two modules are intended to promote the quality of the converted voice. Here, HEQ processing includes the two steps: (a) transform discrete cepstral coefficients (DCC) into principal component analysis (PCA) coefficients; (b) transform PCA coefficients into cumulated density function (CDF) coefficients. As to TFS, an input frame is first processed to obtain its converted DCC and its segment-class number. Then, the group of target-speaker frames corresponding to the same segment-class number is searched to find a target frame whose DCC are sufficiently close to the converted DCC. Next, the converted DCC are replaced by the DCC of the target frame found. In experimental evaluation, the outside parallel sentences (not used in model-parameter training) are used to measure average cepstral distances (ACD) between the converted DCC and the target DCC. When the HEQ module is added, the value of ACD would be increased a little. Furthermore, the value of ACD would be apparently increased when the TFS module is added. Nevertheless, according to the measured VR (variance ratio) values and the scores of subjective listening tests, the quality of the converted voice will become better when HEQ is added, and become much better when TFS is added. As to the reasons for why the measured ACD values and the perceived converted-voice qualities are inconsistent, we have found one possible cause which can explain why this inconsistency may occur.

參考文獻


Gu, H. Y.,Tsai, S. F.(2009).A Discrete-cepstrum Based Spectrum-envelope Estimation Scheme and Its Example Application of Voice Transformation.International Journal of Computational Linguistics and Chinese Language Processing.14(4),363-382.
Lin, S. H.,Yeh, Y. M.,Chen, B.(2007).A Comparative Study of Histogram Equalization (HEQ) for Robust Speech Recognition.International Journal od Computational Linguistics and Chinese Language Processing.12(2),217-238.
Hotelling, H. (1933). Analysis of a Complex of Statistical Variables into Principal Components. Journal of Educational Psychology, 24(6), 417-441.
Abe, M.,Nakamura, S.,Shikano, K.,Kuwabara, H.(1988).Voice Conversion through Vector Quantization.Int. Conf. Acoustics, Speech, and Signal Processing.1,655-658.
Cappé, O.,Moulines, E.(1996).Regularization Techniques for Discrete Cepstrum Estimation.IEEE Signal Processing Letters.3(4),100-102.

延伸閱讀