KubeShare: 在容器雲中實現一等且共用的GPU資源管理提升系統效能與資源使用率

在雲端上操作和部屬分散式應用程式時，容器這個新的技術被用來取代傳統的虛擬機器。隨著雲端發展越來越興盛，像深度學習和高效能型應用程式這些高度依賴 GPU 的程式也開始執行在雲端上，如何有效分配 GPU 資源逐漸成為重要的議題。現今 GPU 在虛擬機器上的虛擬化已發展成熟，但在容器上的虛擬化卻尚未被廣泛討論。目前雲平台上要將 GPU 在多個容器間共享還有非常多的問題要解決，並且獨占 GPU 會讓一些不能充分利用 GPU 資源的應用程式浪費 GPU 與降低使用率。為了克服這些議題，我們設計和實做了 KubeShare，可以擴展 Kubernetes 並讓其支援容器間 GPU 共享與細粒資源分配，KubeShare 也是第一個讓 GPU 在 Kubernetes 中成為一等資源的解決辦法。我們也利用真實的深度學習應用印證使用 KubeShare 可以大幅提昇 GPU 使用率、比原先提昇至大約兩倍的系統吞吐量，並且在容器初始化和執行時期只產生不到百分之十的效能損耗。

關鍵字

雲計算；圖形處理器；容器；排程

並列摘要

Container has emerged as a new technology in clouds to replace virtual machines (VM) for distributed applications deployment and operation. With the increasing number of new cloud-focused applications, such as deep learning and high performance applications, started to rely on the high computing throughput of GPUs, efficiently supporting GPU in container cloud becomes essential. While GPU virtualization has been extensively studied for VM, limited work has been done for containers. One of the key challenges is the lack of support for GPU sharing between multiple concurrent containers. This limitation leads to low resource utilization when a GPU device cannot be fully utilized by a single application due to the burstiness of GPU workload and the limited memory bandwidth. To overcome this issue, we designed and implemented KubeShare, which extends Kubernetes to enable GPU sharing with fine-grained allocation. KubeShare is the first solution for Kubernetes to make GPU device as a first class resources for scheduling and allocations. Using real deep learning workloads, we demonstrated KubeShare can significantly increase GPU utilization and overall system throughput around 2x with less than 10% performance overhead during container initialization and execution.

並列關鍵字

Cloud computing ； GPU ； Container ； Scheduling

參考文獻

[1] Abadi, M., and et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

Google Scholar

[2] Alibaba Cloud. GPU Sharing Scheduler Extender in Kubernetes. [Online]. Available: https://github.com/AliyunContainerService/gpushare-scheduler-extender.

Google Scholar

[3] Basaran, C., and Kang, K. Supporting preemptive task executions and memory copies in GPGPUs. In Euromicro Conference on Real-Time Systems (July 2012), pp. 287–296.

Google Scholar

[4] Becchi, M., Sajjapongse, K., Graves, I., Procter, A., Ravi, V., and Chakradhar, S. A Virtual Memory Based Runtime to Support Multi-Tenancy in Clusters with GPUs. In Proceedings of the 21st International Symposium on HighPerformance Parallel and Distributed Computing (2012), p. 97‒108.

Google Scholar

[5] Belkin, M., Haas, R., Arnold, G. W., Leong, H. W., Huerta, E. A., Lesny, D., and Neubauer, M. Container solutions for hpc systems: A case study of using shifter on blue waters. In Proceedings of the Practice and Experience on Advanced Research Computing (2018).

Google Scholar

國際替代計量

KubeShare: 在容器雲中實現一等且共用的GPU資源管理提升系統效能與資源使用率

全文下載

主題瀏覽