基於共享模型改善資料不平衡之聯邦學習

近年來，隨著深度學習與隱私問題的研究日益增加，各大平台提供的公開資料集顯著降低研究的門檻。由於隱私保護的需求，傳統機器學習方法中，用戶端資料難以進行跨用戶端傳輸以實施集中式學習。為解決此問題，2016年興起的聯邦學習框架提供一種不需傳輸原始資料就能進行協同訓練的方法，有效避免直接傳輸隱私資料的問題。然而，在不同用戶端之間經常存在資料分布不平衡問題，直接影響模型訓練的效率。本論文提出一種新的聯邦學習方法FedISM，透過共享模型的訓練方法及新穎的資料評估機制，解決實際應用中的資料不平衡問題。本研究首先從改善共享資料的訓練流程著手，提出一種創新的聯邦學習流程，減少對傳輸共享資料的依賴。接著討論在移除共享資料的情況下，如何透過資料評估機制有效地從用戶端中，選擇最適合的候選者進行共享模型訓練。此外，本研究還強調在資料分布中發掘「平衡」指標對於共享模型的重要性，實驗證明該資料評估機制可以提升共享模型的訓練效率。實驗結果顯示，即使只有5%的共享資料，透過共享模型訓練能在極端的Non-IID資料分布下顯著提高準確率。具體來說，共享模型在CIFAR-10資料集中，準確率提高40%，而在COVID-19資料集中，準確率提高25%。此外，即便在去除共享資料的假設，透過資料評估機制搭配的共享模型在不平衡的Dirichlet分布中，仍能有效提升準確率，其中對於CIFAR-10資料集的準確率增加6%至7%，而COVID-19資料集的準確率增加約4%至6%。透過FedISM方法，不僅避免任何原始資料的共享，還能提供一個資料評估機制的方案，並顯著改善因資料不平衡而導致模型準確率降低的問題。

關鍵字

機器學習；聯邦學習；資料不平衡

並列摘要

In recent years, with the increasing research on deep learning and privacy issues, publicly available datasets provided by major platforms have significantly lowered the barriers to research. Due to the need for privacy protection, client data in traditional machine learning methods is difficult to transmit across clients when implementing centralized learning. To address this issue, the Federated Learning (FL) framework that emerged in 2016 offers a collaborative training method that does not require transmitting original data, effectively avoiding privacy issues associated with direct data transfer. However, data distribution imbalances often exist between different clients, directly affecting the efficiency of model training. This study proposes an FL method, named FedISM, to alleviate the problem of data imbalance in practical scenarios through a training method that utilizes a shared model and a new data assessment mechanism. This study initially focuses on improving the training process for shared data by proposing a new method that reduces dependence on data transmission. Next, the study explores practical strategies for selecting the most suitable candidates for shared model training through a data assessment mechanism without shared data. Furthermore, this study highlights the importance of identifying the "balance" factor within the data distribution. Experimental results have demonstrated that this data assessment mechanism can enhance the training efficiency of the shared model. Experimental results show that training with only 5% of shared data through the shared model significantly improves accuracy in extreme Non-IID data distributions. Specifically, accuracy in the CIFAR-10 dataset increased by 40%, and that of the COVID-19 dataset increased by 25%. Even after removing the assumption of shared data, the shared model combined with the data assessment mechanism still shows an accuracy enhancement in imbalanced Dirichlet distributions. The accuracy in the CIFAR-10 dataset was enhanced by 6% to 7%, and that of the COVID-19 dataset was enhanced by 4% to 6%. Overall, the FedISM method not only avoids sharing any original data but also provides a solution for a data assessment mechanism, significantly improving the imbalanced issues caused by data imbalance.

並列關鍵字

Machine Learning ； Federated Learning ； Data Imbalance

參考文獻

[1]R. Hecht-Nielsen, "Theory of the Backpropagation Neural Network," in Neural Networks for Perception: Elsevier, 1992, pp. 65-93.

Google Scholar

[2]F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A Unified Embedding for Face Recognition and Clustering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 815-823.

Google Scholar

[3]C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7464-7475.

Google Scholar

[4]Y. Xin et al., "Machine Learning and Deep Learning Methods for Cybersecurity," IEEE Access, vol. 6, pp. 35365-35381, 2018.

Google Scholar

[5]L. Wang, Z. Q. Lin, and A. Wong, "Covid-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest X-Ray Images," Scientific Reports, vol. 10, no. 1, pp. 1-12, 2020.

Google Scholar

延伸閱讀

查找全文

主題瀏覽