透過您的圖書館登入
IP:18.219.14.63
  • 學位論文

運用卷積神經網路於聯邦式學習與Spark叢集運算之研究

Research on Federated Learning and Spark Cluster Computing using Convolutional Neural Network

指導教授 : 張家瑋
本文將於2026/02/13開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


在大數據時代中,資訊量越來越大,為了因應大量資料存取與運算的需求,發展出利用多個節點進行機器學習的分散式學習這種新樣態,隨著分散式學習技術的發展,從早期的Spark叢集運算延伸出了一種能夠保護使用者隱私的聯邦式學習,Spark叢集運算可以將資料打散到不同節點中進行運算,而聯邦式學習則是在各個節點上有不同的資料進行運算後再將結果傳至伺服器上作整合,這兩種分散式運算之樣態,彼此間特性具有互補性,在一個名為Swarm的框架中,有初步進行概念驗證,來證明這兩者間結合之可能性。故本研究將透過MNIST及CIFAR-10資料集,使用卷積神經網路做訓練,觀察集中式學習、Spark叢集運算以及PySyft聯邦式學習框架之訓練速度以及準確度之差異,研究結果發現聯邦式學習框架運算時間遠久於另外兩者,而又以Spark叢集運算速度最快。在聯邦式學習下,當增加節點數量,對準確度影響不大,然而將資料集分割為IID及Non-IID不同的分布型態,發現使用聯邦式學習時Non-IID資料分佈會嚴重影響其準確度,而同樣的資料分佈在Spark叢集運算中卻未造成準確度之影響。透過實驗觀察到之現象發現當機敏資料不會對訓練結果造成影響時,則予以刪除,再做Spark叢集運算,以提升訓練速度,若非訓練機敏資料不可的情境下,就只能使用聯邦式學習。

並列摘要


In the era of big data, the amount of information is increasing. The demand for hardware storage and computing has also increased dramatically. With the development of decentralized learning technology, a new form of decentralized learning using multiple nodes for machine learning has been developed. After spark cluster computing , federated learning is a new technology to protect user privacy. Spark cluster computing support data to separate on different nodes and compute. In federated learning, user data preserve on owner node and calculate themselves. The result finally uploads to server for concatenation. These two types of decentralized computing have complementary properties. A proof-of-concept has been conducted in a framework called Swarm to demonstrate the possibility of combining them. In this study, we use the MNIST and CIFAR-10 datasets to train with a convolutional neural network to observe the difference in training speed and accuracy between the centralized learning, Spark cluster computing, and PySyft federated learning frameworks. It is found that the federal learning framework takes much longer than the other two frameworks, and the Spark cluster is the fastest. With federated learning, increasing the number of nodes has just only little impact on accuracy. When the dataset is split into IID and Non-IID distributions, the accuracy of the Non-IID data distribution is found to be severely affected when using federated learning. The same data distribution does not affect the accuracy in Spark cluster computing. When the data sensitive data does not affect the training result, we can remove and use Spark cluster computation. It is performed again to improve the training speed. If the training of sensitive data is necessary, we can use federal learning.

參考文獻


[1] Karatas, M., Eriskin, L., Deveci, M., Pamucar, D., Garg, H. (2022). Big Data for healthcare industry 4.0: Applications, challenges and future perspectives. Expert Systems with Applications, 200, 116912. https://doi.org/10.1016/j.eswa.2022.116912
[2] Long, T., Ren, X., Wang, Q., Wang, C. (2022). Verifying the safety properties of distributed systems via mergeable parallelism. Journal of Systems Architecture, 130, 102646. https://doi.org/10.1016/j.sysarc.2022.102646
[3] Patil, M. S., Chickerur, S. (2023). Study of data and model parallelism in distributed deep learning for diabetic retinopathy classification. Procedia Computer Science, 218, 2253–2263. https://doi.org/10.1016/j.procs.2023.01.201
[4] Qadir, Z., Le, K. N., Saeed, N., Munawar, H. S. (2022). Towards 6G internet of things: Recent advances, use cases, and open challenges. ICT Express. https://doi.org/10.1016/j.icte.2022.06.006
[5] Wang, Z., Chen, T. H., Zhang, H., Wang, S. (2022). An empirical study on the challenges that developers encounter when developing Apache Spark Applications. Journal of Systems and Software, 194, 111488. https://doi.org/10.1016/j.jss.2022.111488

延伸閱讀