基於GPU共享與零碎資源再利用的作業調度方法

目前針對多重深度學習訓練作業共享GPU集群的調度方法，鮮少在討論GPU共享的調度設計，且依賴性能預測模型的演算法存在系統開銷的問題，再加上目前先進的演算法無法細粒度的調度作業，使空閒的GPU資源無法有效利用，導致現有的解決方案仍有改善空間。本論文基於暫停和恢復機制，可保存模型訓練狀態和遷移，提出輕量級的採樣分析方法預測每一作業完成時間，並在GPU共享的前提下，解決大量異質作業提交導致大型作業的飢餓問題，達到資源碎片再利用的目的。本論文基於Microsoft Philly真實集群的紀錄，透過TF-Slim工具進行基準測試得到的數據，以及設置深度學習訓練模擬實驗，進而評估四種影像分類模型的GPU平均利用率及作業時間。實驗使用三組隨機種子隨機產生100個模擬作業，分別在間隔一秒到達及基於卜瓦松分布到達下，比較二種無GPU共享的方法和五種基於GPU共享技術的性能。根據模擬實驗結果顯示，相較於無GPU共享的依序調度，本論文提出的方法可提升約4.1倍的資源利用率，以及減少約3.6倍的完工時間。

關鍵字

深度學習；作業排程；資源管理

並列摘要

Current scheduling methods for multiple Deep Learning Training (DLT) jobs on GPU clusters rarely discuss the scheduling design of GPU sharing. A system overhead is raised while the presented algorithms are relying on the performance prediction models. Additionally, current approaches cannot schedule resources for DLT jobs in a fine-grained level, which prevents the cluster system from the effective utilization of idle GPU resources. As a result, existing solutions still have rooms for the improvement. Based on the Suspend and Resume mechanisms, this paper proposes a lightweight sampling and analysis method to predict the completion time of DLT jobs. The starvation problem is also solved for the large-scale jobs caused by lots of heterogeneous job submissions under the premise of GPU sharing, so as to achieve the purpose of reusing resource fragments. Experiments are simulated based on the traces collected from real Microsoft Philly clusters and the benchmark data obtained by TF-Slim tool. Three random seeds are used to randomly generate 100 jobs for the simulation. Performance evaluations are compared with two methods without GPU sharing and five methods based on GPU sharing under the arrival rate of the one-second and the Poisson distribution. Results show that our approach improves the resource utilization by around 4.1 times and reduces the completion time by around 3.6 times when compared to the sequential scheduling without GPU sharing.

並列關鍵字

Deep Learning ； Job Scheduling ； Resource Management

參考文獻

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. "Deep Residual Learning for Image Recognition," in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778.

Google Scholar

[2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. "Going Deeper with Convolutions," in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9.

Google Scholar

[3] Karen Simonyan and Andrew Zisserman. 2014. "Very Deep Convolutional Networks for Large-Scale Image Recognition," arXiv Preprint arXiv:1409.1556.

Google Scholar

[4] [Accessed in 2013 ]. "gTTS (Google Text-to-Speech)." https://pypi.org/project/gTTS/.

Google Scholar

[5] Ravivanshikumar Sangpal, Tanvee Gawand, Sahil Vaykar, and Neha Madhavi. 2019. "Jarvis: An Interpretation of AIML with Integration of gTTS and Python," in 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT): IEEE, pp. 486-489.

Google Scholar

延伸閱讀

Chen, Z. H. (2013). 針對感測資料之即時查詢資料壓縮方法與其在GPU環境的實作 [master's thesis, National Tsing Hua University]. Airiti Library. https://doi.org/10.6843/NTHU.2013.00708
Hsieh, T. T. (2021). 在Kubernetes集群中用於深度學習彈性訓練的GPU調度平台 [master's thesis, National Tsing Hua University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0016-2412202116194455
吳政倫（2010）。在GPU上實作平行處理Bzip2資料壓縮演算法〔碩士論文，國立清華大學〕。華藝線上圖書館。https://doi.org/10.6843/NTHU.2010.00449
Abdulameer, M. H., Abdullah, S. N. H. S., & Othman, Z. A. (2014). Neural Gen Feature Selection for Supervised Learning Classifier. Research Journal of Applied Sciences, Engineering and Technology, 7(15), 3181-3187. https://www.airitilibrary.com/Article/Detail?DocID=20407467-201404-201507060017-201507060017-3181-3187
Cao, R. (2020). A Summary of Research on Deep Learning in Time Series Learning Methods. International Core Journal of Engineering, 6(9), 201-208. https://doi.org/10.6919/ICJE.202009_6(9).0024

不提供下載

主題瀏覽