不平衡資料問題: 深度判別特徵學習與取樣

資料不平衡的問題發生在許多不同的應用領域，並且在機器學習和數據探勘中被認為是一項艱難並具有挑戰性的問題。常見的方法有過採樣與降採樣，過採樣可能導致過擬合，而降採樣可能會失去具有代表性的數據樣本。此外，大多數的人造資料重採樣方法都只學習了少數類別的資訊，而沒有考慮大類別的資料分佈與特性。我們設計並提出了一個新的演算法架構，該演算法結合了深度學習中的特徵嵌入學習和可判別型式的損失函數，結合以上概念生成出人造合成數據。與以往的研究相比，我們所提出的新方法同時考慮了多數類別和少數類別並且進行特徵嵌入學習，並使用適當的損失函數使特徵嵌入盡可能具有區分性。因我們所提出的方法是一個架構，可以使用不同的特徵擷取方式因此可以用於不同的領域，甚至不同的資料型態。我們的實驗使用了在八個數值型資料集在二分類的問題和一個影像型資料型在多分類的問題中。這些實驗結果表明，我們提出的方法跟以往的方法比較有顯著的提升和更穩定的結果。此外，我們也對我們所提出的架構進行了完整的實驗研究，並使用可視化技術來探討我們提出的方法，為何能夠生成較好的人造資料樣本的原因。

關鍵字

資料不平衡問題；人造資料樣本；特徵嵌入空間；中心損失函數；三重損失函數

並列摘要

The imbalanced data problem occurs in many application domains and is considered to be a challenging problem in machine learning and data mining. Oversampling may lead to overfitting, while undersampling may discard representative data samples. Additionally, most resampling methods for synthetic data focus on minority class without considering the data distribution of major classes. This paper presents an algorithm that combines feature embedding with the loss functions from discriminative feature learning in deep learning to generate synthetic data samples. In contrast to previous works, the proposed method considers both majority classes and minority classes to learn feature embeddings and utilizes appropriate loss functions to make feature embedding as discriminative as possible. The proposed method is a comprehensive framework and different feature extractors can be utilized for different domains. We conduct experiments utilizing eight numerical datasets and one image dataset based on multiclass classification tasks. The experimental results indicate that the proposed method provides accurate and stable results. Additionally, we thoroughly investigate the proposed method and utilize a visualization technique to determine why the proposed method can generate good data samples.