透過您的圖書館登入
IP:18.221.146.223
  • 學位論文

探討位置編碼對預訓練語言模型之影響

Investigating Positional Encoding in Pre-Trained Language Models

指導教授 : 陳縕儂

摘要


隨著轉換器(Transformer)模型的發表,基於轉換器的預訓練語言模型在眾多自然語言任務上持續有著突破性的進步。位置資訊在轉換器模型中是不可或缺的元件之一,然而近年各種預訓練轉換器模型之研究多著重於設計預訓練目標及改良注意力機制,鮮少深究位置編碼之影響。 本論文的前兩部分,我們對主流預訓練轉換器模型中的位置編碼進行實證研究,分別深入探討兩個議題:一、預訓練位置嵌入學習到的編碼函式是否隱含正確的位置資訊。二、不同預訓練轉換器中的位置嵌入如何影響自然語言任務之表現。我們透過特徵層的分析及多項自然語言任務上之實證實驗,對預訓練位置嵌入提出新的觀點,幫助未來轉換器模型之研究設計更適當的位置編碼函式。 本論文的第三部分,我們針對任務導向對話系統提出一個對話順序感知轉換器模型。對話順序感知轉換器透過學習一個新的對話順序嵌入,解決了遮罩語言模型中缺乏長期位置資訊之問題,並在對話狀態追蹤任務的表現中獲得了明顯的進步。

並列摘要


In recent years, pre-trained Transformers have dominated the majority of NLP benchmark tasks. Many variants of pre-trained Transformers have kept breaking out, and most focus on designing different pre-training objectives or variants of self-attention. Embedding the position information in the self-attention mechanism is also an indispensable factor in Transformers however is often discussed at will. In the first and the second part of this thesis, we carries out an empirical study on position embeddings of mainstream pre-trained Transformers, which focuses on two questions respectively: 1) Do position embeddings really learn the meaning of positions? 2) How do these different learned position embeddings affect Transformers for NLP tasks? In the first two parts, we focuses on providing a new insight of pre-trained position embeddings through feature-level analysis and empirical experiments on most of iconic NLP tasks. It is believed that our experimental results can guide the future work to choose the suitable positional encoding function for specific tasks given the application property. In the third part of this thesis, we propose a Turn-Aware Dialogue Transformer for task-oriented dialogue. Our method embeds the dialogue turn information in masked language models aiming to solve the problem of lack of long-term position information. With our proposed method, the model can learn better position information in long text dialogues, and gets a significant improvement on MWOZ 2.1 dialogue state tracking dataset.

參考文獻


[1]S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[2]J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555,2014.
[3]P. J. Werbos, “Backpropagation through time: what it does and how to do it,”Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[4]J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,”in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1243–1252.
[5]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, pp. 5998–6008, 2017.

延伸閱讀