運用卷積神經網路於聯邦式學習與Spark叢集運算之研究

在大數據時代中，資訊量越來越大，為了因應大量資料存取與運算的需求，發展出利用多個節點進行機器學習的分散式學習這種新樣態，隨著分散式學習技術的發展，從早期的Spark叢集運算延伸出了一種能夠保護使用者隱私的聯邦式學習，Spark叢集運算可以將資料打散到不同節點中進行運算，而聯邦式學習則是在各個節點上有不同的資料進行運算後再將結果傳至伺服器上作整合，這兩種分散式運算之樣態，彼此間特性具有互補性，在一個名為Swarm的框架中，有初步進行概念驗證，來證明這兩者間結合之可能性。故本研究將透過MNIST及CIFAR-10資料集，使用卷積神經網路做訓練，觀察集中式學習、Spark叢集運算以及PySyft聯邦式學習框架之訓練速度以及準確度之差異，研究結果發現聯邦式學習框架運算時間遠久於另外兩者，而又以Spark叢集運算速度最快。在聯邦式學習下，當增加節點數量，對準確度影響不大，然而將資料集分割為IID及Non-IID不同的分布型態，發現使用聯邦式學習時Non-IID資料分佈會嚴重影響其準確度，而同樣的資料分佈在Spark叢集運算中卻未造成準確度之影響。透過實驗觀察到之現象發現當機敏資料不會對訓練結果造成影響時，則予以刪除，再做Spark叢集運算，以提升訓練速度，若非訓練機敏資料不可的情境下，就只能使用聯邦式學習。

關鍵字

叢集運算；分散式學習；聯邦式學習

並列摘要

In the era of big data, the amount of information is increasing. The demand for hardware storage and computing has also increased dramatically. With the development of decentralized learning technology, a new form of decentralized learning using multiple nodes for machine learning has been developed. After spark cluster computing , federated learning is a new technology to protect user privacy. Spark cluster computing support data to separate on different nodes and compute. In federated learning, user data preserve on owner node and calculate themselves. The result finally uploads to server for concatenation. These two types of decentralized computing have complementary properties. A proof-of-concept has been conducted in a framework called Swarm to demonstrate the possibility of combining them. In this study, we use the MNIST and CIFAR-10 datasets to train with a convolutional neural network to observe the difference in training speed and accuracy between the centralized learning, Spark cluster computing, and PySyft federated learning frameworks. It is found that the federal learning framework takes much longer than the other two frameworks, and the Spark cluster is the fastest. With federated learning, increasing the number of nodes has just only little impact on accuracy. When the dataset is split into IID and Non-IID distributions, the accuracy of the Non-IID data distribution is found to be severely affected when using federated learning. The same data distribution does not affect the accuracy in Spark cluster computing. When the data sensitive data does not affect the training result, we can remove and use Spark cluster computation. It is performed again to improve the training speed. If the training of sensitive data is necessary, we can use federal learning.

並列關鍵字

Cluster Computing ； Distributed Learning ； Federated Learning

參考文獻

[1] Karatas, M., Eriskin, L., Deveci, M., Pamucar, D., Garg, H. (2022). Big Data for healthcare industry 4.0: Applications, challenges and future perspectives. Expert Systems with Applications, 200, 116912. https://doi.org/10.1016/j.eswa.2022.116912

Google Scholar

[2] Long, T., Ren, X., Wang, Q., Wang, C. (2022). Verifying the safety properties of distributed systems via mergeable parallelism. Journal of Systems Architecture, 130, 102646. https://doi.org/10.1016/j.sysarc.2022.102646

Google Scholar

[3] Patil, M. S., Chickerur, S. (2023). Study of data and model parallelism in distributed deep learning for diabetic retinopathy classification. Procedia Computer Science, 218, 2253–2263. https://doi.org/10.1016/j.procs.2023.01.201

Google Scholar

[4] Qadir, Z., Le, K. N., Saeed, N., Munawar, H. S. (2022). Towards 6G internet of things: Recent advances, use cases, and open challenges. ICT Express. https://doi.org/10.1016/j.icte.2022.06.006

Google Scholar

[5] Wang, Z., Chen, T. H., Zhang, H., Wang, S. (2022). An empirical study on the challenges that developers encounter when developing Apache Spark Applications. Journal of Systems and Software, 194, 111488. https://doi.org/10.1016/j.jss.2022.111488

Google Scholar

主題瀏覽