不平衡資料學習:從傳統樣本合成方法到深度學習資料增強方法

當使用深度學習技術來訓練帶有不平衡分佈的資料時，通常會遭遇到模型過適(overfit)於少數樣本類別的挑戰。傳統上，少數樣本資料合成技術(SMOTE)是用來解決模型泛化(generalization)的技術。然而，我們不確定傳統的少數樣本資料合成技術用於現今的深度學習技術是否也會有好的效果。因此，在本篇論文當中，我們先探討了傳統少數樣本資料合成技術在現代深度學習方法中的不足，接著利用軟性標籤的方法來增強了少數樣本資料合成技術在深度學習框架下的表現，而軟性標籤的概念引導我們使用了現今深度學習框架中常見的資料增強方法Mixup。經過詳細的實驗探討後發現，Mixup能夠透過達uneven margin來提升深度學習模型的泛化，根據此結果我們進一步提出了Margin-Aware Mixup，這項新的方法能夠幫助我們更進一步的去對應不平衡資料訓練帶來的問題。大規模的試驗結果發現我們提出的方法能夠達到甚至超越現今State-of-the-art技術的表現，而且在極端資料不平衡的資料集當中能達到更優的表現。

關鍵字

深度學習；不平衡資料分類；資料增強方法

並列摘要

Given imbalanced data, it is hard to train a good classifier using deep learning because of the poor generalization of minority classes. Traditionally, the well-known synthetic minority oversampling technique (SMOTE) for data augmentation, a data mining approach for imbalanced learning, has been used to improve this generalization. However, it is unclear whether SMOTE also benefits deep learning. In this work, we study why the original SMOTE is insufficient for deep learning, and enhance SMOTE using soft labels. Connecting the resulting soft SMOTE with Mixup, a modern data augmentation technique, leads to a unified framework that puts traditional and modern data augmentation techniques under the same umbrella. A careful study within this framework shows that Mixup improves generalization by implicitly achieving uneven margins between majority and minority classes. We then propose a novel margin-aware Mixup technique that more explicitly achieves uneven margins. Extensive experimental results demonstrate that our proposed technique yields state-of-the-art performance on deep imbalanced classification while achieving superior performance on extremely imbalanced data.

並列關鍵字

Deep Learning ； Imbalanced Classification ； Data Augmentation

參考文獻

[1] J. O. Awoyemi, A. O. Adetunmbi, and S. A. Oluwadare. Credit card fraud detection using machine learning techniques: A comparative analysis. In 2017 (ICCNI), pages 1–9, 2017.

Google Scholar

[2] S. Bej, N. Davtyan, M. Wolfien, M. Nassar, and O. Wolkenhauer. Loras: An oversampling approach for imbalanced datasets. CoRR, abs/1908.08346, 2019.

Google Scholar

[3] M. Buda, A. Maki, and M. A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.

Google Scholar

[4] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap. Safelevelsmote: Safelevelsynthetic minority oversampling technique for handling the class imbalanced problem. In T. Theeramunkong, B. Kijsirikul, N. Cercone, and T.B. Ho, editors, Advances in Knowledge Discovery and Data Mining, pages 475–482, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.

Google Scholar

[5] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbalanced datasets with label distributionaware margin loss. In NeurIPS, 2019.

Google Scholar

國際替代計量

不平衡資料學習:從傳統樣本合成方法到深度學習資料增強方法

未授權

主題瀏覽