在Kubernetes集群中用於深度學習彈性訓練的GPU調度平台

動態調整一組訓練作業的資源分配的彈性深度學習可以有效提高硬體加速器的使用效率，這對於當今訓練大規模深度學習模型至關重要。儘管已經有許多用於彈性訓練的調度算法被提出，但它們缺乏易於使用且高效的平台來執行。在本論文中，我們介紹了Voda，一個用於彈性訓練的GPU調度平台。與之前使用參數服務器進行彈性訓練的作品不同，Voda是為AllReduce通信而設計的，它更有效，但調整起來也更複雜。Voda建立在Kubernetes之上，由一組鬆散耦合的組件組成，它們可以收集運行時信息，動態地改變資源分配，並根據底層GPU之間的通信成本決定作業放置以優化作業執行。我們在Voda上使用不同的工作負載、作業分佈和到達時間比較了四種彈性調度演算法、包含三種現有算法和一種新提出的算法。實驗結果表明，沒有一種算法可以支配所有的性能指標，例如完工時間、平均作業完成時間或平均運行時間。但是，某些算法在某些工作負載和作業分佈方面確實比其他算法工作得更好。實驗還表明，作業放置對GPU集群的性能至關重要，所提出的作業放置算法可以有效優化作業中不同工作者之間的通信成本。

關鍵字

深度學習平台；彈性訓練；集群調度；分散式計算

並列摘要

Elastic deep learning that dynamically adjusts the resource allocation for a group of training jobs can effectively enhance the utilization of accelerators, which are essential for training large scale deep learning models nowadays. Although many scheduling algorithms for elastic training have been proposed, they lack an easy use while efficient platform to carry out. In this thesis, we presented Voda, a GPU scheduling platform for elastic training. Unlike previous works that uses the parameter server for elastic training, Voda is designed for AllReduce communication, which is more effective, but also more complicated to be adjusted. Voda, built on top of Kubernetes, consists of a set of loosely coupled components, that can collect the run-time information, dynamically alter the resource allocation, and decide a job placement to optimize the job execution based on the communication cost among underlying GPUs. We compared four elastic algorithms, three existing methods and one newly proposed, on Voda, with different workloads, job distributions, and arrival times. Experimental results show that no algorithm can dominate all performance metrics, such as makespan, average job completion time, or average running time. However, some algorithms do work better than others in some workloads and job distributions. Experiments also showed that the job placement is critical to the performance on GPU clusters, and the proposed job placement algorithm can effectively optimize the communication cost among different workers of a job.

並列關鍵字

Deep Learning Platform ； Elastic Training ； Cluster Scheduling ； Distributed Computing ； Kubernetes

參考文獻

[1] Ross Girshick et al. “Rich feature hierarchies for accurate object detection and

Google Scholar

semantic segmentation”. Proceedings of the IEEE conference on computer vision

Google Scholar

and pattern recognition. 2014, pp. 580–587.

Google Scholar

[2] Jeff Donahue et al. “Decaf: A deep convolutional activation feature for generic

Google Scholar

visual recognition”. International conference on machine learning. PMLR. 2014,

Google Scholar

延伸閱讀

童俊森（2022）。基於GPU共享與零碎資源再利用的作業調度方法〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu202201566
Lo, Y. J. (2011). GPU平台下的高度平行化繞線演算法與其應用 [master's thesis, National Chiao Tung University]. Airiti Library. https://doi.org/10.6842/NCTU.2011.00237
呂理樺（2016）。使用GPU平行刪減三角網格演算法應用於五軸工具機〔碩士論文，國立中正大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0033-2110201614060217
Li, H. Y. (2019). Elastic TensorFlow: A Novel Network Overlay Design and Implementation to Support Elastic Deep Learning Computing [master's thesis, National Tsing Hua University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0016-0206202016172446
Yeh, T. A. (2020). KubeShare: A Framework to Support GPUs as First-Class and Shared Resources in Container Clouds for Maximizing System Throughput and Resource Utilization [master's thesis, National Tsing Hua University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0016-2502202114211799

國際替代計量

在Kubernetes集群中用於深度學習彈性訓練的GPU調度平台

全文下載

主題瀏覽