在雲端上操作和部屬分散式應用程式時,容器這個新的技術被用來取代傳統 的虛擬機器。隨著雲端發展越來越興盛,像深度學習和高效能型應用程式這 些高度依賴 GPU 的程式也開始執行在雲端上,如何有效分配 GPU 資源逐漸 成為重要的議題。現今 GPU 在虛擬機器上的虛擬化已發展成熟,但在容器 上的虛擬化卻尚未被廣泛討論。目前雲平台上要將 GPU 在多個容器間共享 還有非常多的問題要解決,並且獨占 GPU 會讓一些不能充分利用 GPU 資源 的應用程式浪費 GPU 與降低使用率。為了克服這些議題,我們設計和實做 了 KubeShare,可以擴展 Kubernetes 並讓其支援容器間 GPU 共享與細粒資源 分配,KubeShare 也是第一個讓 GPU 在 Kubernetes 中成為一等資源的解決辦 法。我們也利用真實的深度學習應用印證使用 KubeShare 可以大幅提昇 GPU 使用率、比原先提昇至大約兩倍的系統吞吐量,並且在容器初始化和執行時 期只產生不到百分之十的效能損耗。
Container has emerged as a new technology in clouds to replace virtual machines (VM) for distributed applications deployment and operation. With the increasing number of new cloud-focused applications, such as deep learning and high performance applications, started to rely on the high computing throughput of GPUs, efficiently supporting GPU in container cloud becomes essential. While GPU virtualization has been extensively studied for VM, limited work has been done for containers. One of the key challenges is the lack of support for GPU sharing between multiple concurrent containers. This limitation leads to low resource utilization when a GPU device cannot be fully utilized by a single application due to the burstiness of GPU workload and the limited memory bandwidth. To overcome this issue, we designed and implemented KubeShare, which extends Kubernetes to enable GPU sharing with fine-grained allocation. KubeShare is the first solution for Kubernetes to make GPU device as a first class resources for scheduling and allocations. Using real deep learning workloads, we demonstrated KubeShare can significantly increase GPU utilization and overall system throughput around 2x with less than 10% performance overhead during container initialization and execution.