透過您的圖書館登入
IP:3.16.15.149
  • 學位論文

針對深度學習計算研析容器化雲平台之自動部署及執行管理技術

Building a Container-based Cloud Platform for Deep Learning from Service Deployment to Runtime Management

指導教授 : 周志遠

摘要


隨著過去十年由深度學習驅使的人工智慧服務其快速增⻑趨勢,深度學習, 特別是資源密集型和耗時的模型訓練工作,已成為當今生產集群的主要工作 之一。然而,由於深度學習的特性,例如:複雜工作負載以及共享資源環境, 管理集群中分佈式訓練工作的資源分配和執行生命週期具有挑戰性。這篇 論文旨在通過開發和實施調度和擴展控制器來動態管理 Kubernetes(K8s) 集 群上的分佈式訓練工作來解決這些問題,該集群是一個廣泛用於管理容器 化工作負載和服務的平台。我們提出的方法的目標是通過三種功能來增強 K8S:(1)偵測任務間的依賴性以進行群組調度以避免空閒資源。(2)偵測 任務放置位置以最小化通信開銷。(3)偵測工作負載量以動態縮放資源以提 高成本效率。我們的方法在真實的測試平台和模擬環境使用一組 TensorFlow 工作進行評估。與預設的 K8S 調度程序相比,我們的方法成功地將資源利用 率提高了 20% ∼ 30%,並將工作耗用時間減少了 65%。

並列摘要


With the fast growing trend in deep learning driven AI services over the past decade, deep learning, especially the resource-intensive and time-consuming training jobs, have become one of the main workload in today’s production clusters. However, due to the complex workload characteristics of deep learning, and the dynamic natural of shared resource environment, managing the resource allocation and execution lifecycle of distributed training jobs in cluster can be challenging. This work aims to address these issues by developing and implementing a scheduling and scaling controller to dynamically manage distributed training jobs on a Kubernetes (K8S) cluster, which is a broadly used platform for managing containerized workloads and services. The objectives of our proposed approach is to enhance K8S with three capabilities: (1) Task dependency aware gang scheduling to avoid idle resources. (2) Locality aware task placement to minimize communication overhead. (3) Load aware job scaling to improve cost efficiency. Our approach is evaluated by real testbed and simulator using a set of TensorFlow jobs. Comparing to the default K8S scheduler, our approach successfully improved resource utilization by 20% ∼ 30% and reduced job elapsed time by over 65%.

參考文獻


[1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on OSDI (2016), pp. 265–283.
[2] Amaral, M., Polo, J., Carrera, D., Seelam, S. R., and Steinder, M. Topology- aware GPU scheduling for learning workloads in cloud environments. In Pro- ceedings of the SuperComputing (SC) (2017), pp. 17:1–17:12.
[3] Bao, Y., Peng, Y., Wu, C., and Li, Z. Online job scheduling in distributed machine learning clusters. In 2018 IEEE Conference on Computer Commu- nications, INFOCOM 2018, Honolulu, HI, USA, April 16-19, 2018 (2018), pp. 495–503.
[4] Chan-Yi Lin. DRAGON: Deep Learning with Auto-scale and Gang-schedule On Kubernetes. https://github.com/ChanYiLin/tf-operator-Dragon/, 2019.
[5] DeSa,C.,Feldman,M.,Ré,C.,andOlukotun,K.Understandingandoptimiz- ing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual ISCA (2017), pp. 561–574.

延伸閱讀