針對深度學習計算研析容器化雲平台之自動部署及執行管理技術

隨著過去十年由深度學習驅使的人工智慧服務其快速增⻑趨勢，深度學習，特別是資源密集型和耗時的模型訓練工作，已成為當今生產集群的主要工作之一。然而，由於深度學習的特性，例如:複雜工作負載以及共享資源環境，管理集群中分佈式訓練工作的資源分配和執行生命週期具有挑戰性。這篇論文旨在通過開發和實施調度和擴展控制器來動態管理 Kubernetes(K8s) 集群上的分佈式訓練工作來解決這些問題，該集群是一個廣泛用於管理容器化工作負載和服務的平台。我們提出的方法的目標是通過三種功能來增強 K8S:(1)偵測任務間的依賴性以進行群組調度以避免空閒資源。(2)偵測任務放置位置以最小化通信開銷。(3)偵測工作負載量以動態縮放資源以提高成本效率。我們的方法在真實的測試平台和模擬環境使用一組 TensorFlow 工作進行評估。與預設的 K8S 調度程序相比，我們的方法成功地將資源利用率提高了 20% ∼ 30%，並將工作耗用時間減少了 65%。

關鍵字

深度學習；資源編排；工作排程；自動擴展

並列摘要

With the fast growing trend in deep learning driven AI services over the past decade, deep learning, especially the resource-intensive and time-consuming training jobs, have become one of the main workload in today’s production clusters. However, due to the complex workload characteristics of deep learning, and the dynamic natural of shared resource environment, managing the resource allocation and execution lifecycle of distributed training jobs in cluster can be challenging. This work aims to address these issues by developing and implementing a scheduling and scaling controller to dynamically manage distributed training jobs on a Kubernetes (K8S) cluster, which is a broadly used platform for managing containerized workloads and services. The objectives of our proposed approach is to enhance K8S with three capabilities: (1) Task dependency aware gang scheduling to avoid idle resources. (2) Locality aware task placement to minimize communication overhead. (3) Load aware job scaling to improve cost efficiency. Our approach is evaluated by real testbed and simulator using a set of TensorFlow jobs. Comparing to the default K8S scheduler, our approach successfully improved resource utilization by 20% ∼ 30% and reduced job elapsed time by over 65%.

並列關鍵字

Deep learning ； Resource Orchestration ； Job Scheduling ； Autoscaling

參考文獻

[1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on OSDI (2016), pp. 265–283.

Google Scholar

[2] Amaral, M., Polo, J., Carrera, D., Seelam, S. R., and Steinder, M. Topology- aware GPU scheduling for learning workloads in cloud environments. In Pro- ceedings of the SuperComputing (SC) (2017), pp. 17:1–17:12.

Google Scholar

[3] Bao, Y., Peng, Y., Wu, C., and Li, Z. Online job scheduling in distributed machine learning clusters. In 2018 IEEE Conference on Computer Commu- nications, INFOCOM 2018, Honolulu, HI, USA, April 16-19, 2018 (2018), pp. 495–503.

Google Scholar

[4] Chan-Yi Lin. DRAGON: Deep Learning with Auto-scale and Gang-schedule On Kubernetes. https://github.com/ChanYiLin/tf-operator-Dragon/, 2019.

Google Scholar

[5] DeSa,C.,Feldman,M.,Ré,C.,andOlukotun,K.Understandingandoptimiz- ing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual ISCA (2017), pp. 561–574.

Google Scholar

國際替代計量

針對深度學習計算研析容器化雲平台之自動部署及執行管理技術

全文下載

主題瀏覽