自監督式學習為近年深度學習領域中的重要技術。該訓練框架於不同領域皆 取得卓越的成果表現,如電腦視覺領域、自然語言處理領域和語音領域。 語音訊號中帶有文字資訊與非文字資訊,而非文字資訊中主要由語者資訊所 構成,而語者資訊之中則包含韻律資訊,雖然非文字資訊已可透過自監督式學習 模型捕捉,然而我們卻不知道其背後的原理機制。 本篇論文從上述的語者資訊和韻律資訊作為切入點。於語者資訊的研究中我 們發現自監督式學習模型會透過自監督式學習模型輸出的語音特徵之中,對應輸 入語音訊號的無聲音部分的語音特徵片段汲取語者資訊,且實驗中顯示該項發現 能讓我們在沒有額外添加運算時間的狀況下提升既有的自監督式學習模型表現。 於韻律資訊部分,透過 15 個語音自監督式學習模型和 3 個韻律相關任務,驗 證自監督式學習模型能將韻律資訊鑲嵌在語音特徵之中。此外實驗顯示模型傾向 將韻律資訊儲存在前面的層數中,並且自監督式學習模型能處理預訓練時未見語 言的韻律資訊。 綜上所述,本論文提供實驗驗證自監督式學習模型如何處理非文字資訊,並 根據其機制給出具體改進模型的建議。
Self-supervised learning (SSL) is an important technology in the field of deep learning in recent years. This training framework has achieved excellent results in different fields, such as computer vision, natural language processing, and speech. Speech information can be divided into text information and non-text information. Non-text information mainly consists of speaker information. In addition, prosodic information is included in the speaker information. Although non-text information can be captured by SSL models. However, recent studies have not investigated the underlying mechanism. This thesis starts from speaker information and prosodic information. In the study of speaker information, we found that SSL models can learn speaker information through the speech features output by the self-supervised learning model which corresponding to the silent part of the input speech signal, and experiments show that this discovery allows us to improve the existing SSL model performance. In the part of prosodic information, through 15 speech SSL models and 3 prosody- related tasks, it is verified that the SSL model can embed prosodic information into speech features. In addition, experiments show that models tend to store prosodic information in earlier layers, and SSL models can handle prosodic information in unknown languages. In summary, this paper provides experiments to verify how the SSL model handles non-textual information, and gives specific suggestions for improving the model according to its mechanism.