結合 Airflow 與 Kubeflow 設計具 GPU 動態排程之 MLOps 架構

MLOps 能夠在深度學習模型的開發、部署以及維護過程中，能夠實現持續整合以及持續部署，並訂定標準化的 MLPipeline 來縮短不同團隊間的溝通成本。而除了開發部署流程外，人工智慧模型的訓練時間也大幅度影響人工智慧產品的開發時間。尤其是深度學習模型，深度學習藉由人工神經網路架構，在能獲得高模型精準度的同時，所需要的計算量也大幅增加。為了盡快完成模型的訓練，這些高計算量的深度學習模型都需使用 GPU 運算資源進行運算。但是目前常見的 MLOps 開源框架只著重於建立一個通用高效的開發部署流程，並不注重運算資源的管理以及根據深度學習任務運算需求進行調配。尤其隨著平行化運算技術的興起，該技術能夠幫助在執行深度學習任務時，分配多個運算資源以達到加速執行的效果，因此如何高效分配運算資源給深度學習任務變得相當重要。鑑於上述問題，本研究將設計一套結合Airflow與KubeFlow的具GPU動態排程之MLOps 架構，其以基於 Kubernetes 開發的 KubeFlow 做為執行人工智慧容器化任務的平台，提供人工智慧模型開發、訓練、平行化運算以及部署模型的基礎設施，而為了最大化利用叢集的 GPU 運算資源，設計了一套監控排程叢集 GPU 資源的機制，用於監控 GPU 資源以及根據當前資源狀態動態分配 GPU 運算資源，提高叢集的 GPU 資源使用率同時並應用平行化運算。

關鍵字

MLOps ； Kubernetes ；深度學習； GPU ； Airflow ； Kubeflow

並列摘要

MLOps enables continuous integration(CI) and continuous deployment(CD) throughout the development, deployment, and maintenance of deep learning(DL) models. It establishes a standardized machine learning(ML) pipeline to reduce communication costs between different teams. However, besides the development and deployment processes, the training time of artificial intelligence(AI) models significantly impacts the development timeline, particularly for DL models. DL leverages artificial neural network architecture to achieve high model accuracy, but this comes at the cost of increased computational requirements. To expedite model training, these computationally intensive DL models need to utilize GPU computing resources. Unfortunately, existing MLOps open-source frameworks primarily focus on establishing a general and efficient development and deployment process, neglecting resource management and deployment based on the computing needs of deep learning tasks. With the emergence of parallel computing technology, effectively allocating computing resources to deep learning tasks has become crucial. To address these challenges, this research aims to design an MLOps architecture with GPU dynamic scheduling by combining Airflow and KubeFlow. It utilizes KubeFlow, developed on the Kubernetes platform, as an infrastructure for executing containerized artificial intelligence tasks, encompassing AI model development, training, parallel computing, and deployment. To maximize the utilization of cluster GPU computing resources, a monitoring and scheduling mechanism is designed to track GPU resources and dynamically allocate them based on the current resources. This proposed approach enhances cluster GPU resource utilization while facilitating parallelized operations.

並列關鍵字

MLOps ； Kubernetes ； Deep learning ； GPU ； Airflow ； Kubeflow

參考文獻

[1] Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

Google Scholar

[2] K. Abramski, S. Citraro, L. Lombardi, G. Rossetti and M. Stella, "Cognitive network science reveals bias in GPT-3, ChatGPT, and GPT-4 mirroring math anxiety in high-school students", arXiv preprint arXiv:2305.18320, 2023.

Google Scholar

[3] N. Nhan and N. Sarah, "An empirical evaluation of GitHub copilot's code suggestions" MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories pp.1-5, 2022

Google Scholar

[4] O. Keiron and N. Ryan, "An Introduction to Convolutional Neural Networks", arXiv preprint arXiv:1511.08458, 2015

Google Scholar

[5] D. Yasemin, C. Necmettin, Ç. Uğur, "Bulut Ü zerinde DevOps Mimarisi DevOps Architecture in the Cloud", 2019 27th Signal Processing and Communications Applications Conference (SIU) pp.24-26, August 2019.

Google Scholar

國際替代計量

結合 Airflow 與 Kubeflow 設計具 GPU 動態排程之 MLOps 架構

未授權

主題瀏覽