透過您的圖書館登入
IP:18.222.117.157
  • 學位論文

在Kubernetes集群中用於深度學習彈性訓練的GPU調度平台

Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Cluster

指導教授 : 李哲榮

摘要


動態調整一組訓練作業的資源分配的彈性深度學習可以有效提高硬體加速器的使用效率,這對於當今訓練大規模深度學習模型至關重要。儘管已經有許多用於彈性訓練的調度算法被提出,但它們缺乏易於使用且高效的平台來執行。在本論文中,我們介紹了Voda,一個用於彈性訓練的GPU調度平台。與之前使用參數服務器進行彈性訓練的作品不同,Voda是為AllReduce通信而設計的,它更有效,但調整起來也更複雜。Voda建立在Kubernetes之上,由一組鬆散耦合的組件組成,它們可以收集運行時信息,動態地改變資源分配,並根據底層GPU之間的通信成本決定作業放置以優化作業執行。我們在Voda上使用不同的工作負載、作業分佈和到達時間比較了四種彈性調度演算法、包含三種現有算法和一種新提出的算法。實驗結果表明,沒有一種算法可以支配所有的性能指標,例如完工時間、平均作業完成時間或平均運行時間。但是,某些算法在某些工作負載和作業分佈方面確實比其他算法工作得更好。實驗還表明,作業放置對GPU集群的性能至關重要,所提出的作業放置算法可以有效優化作業中不同工作者之間的通信成本。

並列摘要


Elastic deep learning that dynamically adjusts the resource allocation for a group of training jobs can effectively enhance the utilization of accelerators, which are essential for training large scale deep learning models nowadays. Although many scheduling algorithms for elastic training have been proposed, they lack an easy use while efficient platform to carry out. In this thesis, we presented Voda, a GPU scheduling platform for elastic training. Unlike previous works that uses the parameter server for elastic training, Voda is designed for AllReduce communication, which is more effective, but also more complicated to be adjusted. Voda, built on top of Kubernetes, consists of a set of loosely coupled components, that can collect the run-time information, dynamically alter the resource allocation, and decide a job placement to optimize the job execution based on the communication cost among underlying GPUs. We compared four elastic algorithms, three existing methods and one newly proposed, on Voda, with different workloads, job distributions, and arrival times. Experimental results show that no algorithm can dominate all performance metrics, such as makespan, average job completion time, or average running time. However, some algorithms do work better than others in some workloads and job distributions. Experiments also showed that the job placement is critical to the performance on GPU clusters, and the proposed job placement algorithm can effectively optimize the communication cost among different workers of a job.

參考文獻


[1] Ross Girshick et al. “Rich feature hierarchies for accurate object detection and
semantic segmentation”. Proceedings of the IEEE conference on computer vision
and pattern recognition. 2014, pp. 580–587.
[2] Jeff Donahue et al. “Decaf: A deep convolutional activation feature for generic
visual recognition”. International conference on machine learning. PMLR. 2014,

延伸閱讀