在標籤雜訊上穩健地學習：應用在文字分類任務

深度學習模型成功在各個領域中的任務得到優良的表現，這些模型需要大量的精準標注的資料來作訓練，在實際應用中，請各個領域的專家精準地標注資料需要的金錢和時間成本非常高。為了降低標注資料的成本，近年有許多替代的方式被提出，這些方法雖然讓標注成本降低，同時也帶來了標記錯誤的風險，過去的研究指出深度神經網路會受到雜訊標籤的影響使得模型泛化能力降低。從雜訊標籤中學習主要關注如何從含有標注錯誤的資料中學習一個穩健的模型。近年來有許多相關的研究得到相當優秀的成果。然而，大多數相關研究都關注在圖片分類的任務上，關注在自然語言處理如文字分類任務的相關研究相當稀少，因此本篇論文提出兩種方法將相關的技術應用在文字分類的任務。第一，我們藉由將Mixup機制從資料前處理的階段移動到模型架構之中，成功將此領域當前領先的模型擴展至文字分類的任務上，第二，我們好奇Mixup機制是否在模型中是必要的，我們移除了Mixup機制以及修改非標籤資料的損失函數來達到一樣的目標，我們的方法在文字分類任務上的表現較近期相似的研究穩定，在五種不同類型的實驗設置中，我們的方法在四種設置中達到最小的最佳和最後精準度平均差距以及最高的最後平均精準度。

關鍵字

從雜訊標籤中學習；自然語言處理；文字分類；半監督式學習；深度學習

並列摘要

Deep Neural Networks (DNNs) achieve remarkable success in many machine learning tasks due to a massive amount of carefully annotated data, which are time-consuming and expensive. In recent years, alternative and inexpensive methods have been proposed to lower data annotating costs. These alternative methods inevitably yield samples with noisy labels corrupted from the ground-truth label. Recent research shows that the noisy label would hurt the generalization performance of DNNs. Several studies have been conducted to solve the problem caused by noisy labels. However, most studies in learning with noisy labels focus on computer vision tasks, while the corresponding progress in natural language processing has still been limited. In this paper, we proposed two approaches to expand the state-of-the-art in the field of learning from noise labels which only focus on the image classification tasks into the text classification tasks. Specifically, we migrated the mix-up mechanism from data preprocessing into model architecture. And by this approach, we achieve our goal, and this method is even more stable than the recent research on text classification tasks. Moreover, to study whether the mix-up mechanism is necessary, we also try to remove it and get a little unstable result. In 4 of 5 settings, our proposed methods got the slightest average difference between the last and best accuracy and the highest average last accuracy.

並列關鍵字

Learning from Noisy Label ； Natural Language Processing ； Text Classification ； Semi-supervised Learning ； Deep Learning

參考文獻

1. Tanno, R., et al. Learning from noisy labels by regularized estimation of annotator confusion. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

Google Scholar

2. Xiao, T., et al. Learning from massive noisy labeled data for image classification. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Google Scholar

3. Li, W., et al., Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862, 2017.

Google Scholar

4. Song, H., M. Kim, and J.-G. Lee. Selfie: Refurbishing unclean samples for robust deep learning. in International Conference on Machine Learning. 2019. PMLR.

Google Scholar

5. Zhang, C., et al., Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021. 64(3): p. 107-115.

Google Scholar

國際替代計量

在標籤雜訊上穩健地學習：應用在文字分類任務

全文下載

主題瀏覽