透過您的圖書館登入
IP:18.224.212.19
  • 學位論文

結合 Airflow 與 Kubeflow 設計具 GPU 動態排程之 MLOps 架構

Integrating Airflow and Kubeflow to Design a MLOps Architecture with GPU Dynamic Scheduling

指導教授 : 陳弘明
共同指導教授 : 盧永豐(Yung-Feng Lu)
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


MLOps 能夠在深度學習模型的開發、部署以及維護過程中,能夠實現持續 整合以及持續部署,並訂定標準化的 MLPipeline 來縮短不同團隊間的溝通成本。 而除了開發部署流程外,人工智慧模型的訓練時間也大幅度影響人工智慧產品的 開發時間。尤其是深度學習模型,深度學習藉由人工神經網路架構,在能獲得高 模型精準度的同時,所需要的計算量也大幅增加。為了盡快完成模型的訓練,這 些高計算量的深度學習模型都需使用 GPU 運算資源進行運算。但是目前常見的 MLOps 開源框架只著重於建立一個通用高效的開發部署流程,並不注重運算資 源的管理以及根據深度學習任務運算需求進行調配。尤其隨著平行化運算技術的 興起,該技術能夠幫助在執行深度學習任務時,分配多個運算資源以達到加速執 行的效果,因此如何高效分配運算資源給深度學習任務變得相當重要。鑑於上述 問題,本研究將設計一套結合Airflow與KubeFlow的具GPU動態排程之MLOps 架構,其以基於 Kubernetes 開發的 KubeFlow 做為執行人工智慧容器化任務的平 台,提供人工智慧模型開發、訓練、平行化運算以及部署模型的基礎設施,而為 了最大化利用叢集的 GPU 運算資源,設計了一套監控排程叢集 GPU 資源的機 制,用於監控 GPU 資源以及根據當前資源狀態動態分配 GPU 運算資源,提高叢 集的 GPU 資源使用率同時並應用平行化運算。

關鍵字

MLOps Kubernetes 深度學習 GPU Airflow Kubeflow

並列摘要


MLOps enables continuous integration(CI) and continuous deployment(CD) throughout the development, deployment, and maintenance of deep learning(DL) models. It establishes a standardized machine learning(ML) pipeline to reduce communication costs between different teams. However, besides the development and deployment processes, the training time of artificial intelligence(AI) models significantly impacts the development timeline, particularly for DL models. DL leverages artificial neural network architecture to achieve high model accuracy, but this comes at the cost of increased computational requirements. To expedite model training, these computationally intensive DL models need to utilize GPU computing resources. Unfortunately, existing MLOps open-source frameworks primarily focus on establishing a general and efficient development and deployment process, neglecting resource management and deployment based on the computing needs of deep learning tasks. With the emergence of parallel computing technology, effectively allocating computing resources to deep learning tasks has become crucial. To address these challenges, this research aims to design an MLOps architecture with GPU dynamic scheduling by combining Airflow and KubeFlow. It utilizes KubeFlow, developed on the Kubernetes platform, as an infrastructure for executing containerized artificial intelligence tasks, encompassing AI model development, training, parallel computing, and deployment. To maximize the utilization of cluster GPU computing resources, a monitoring and scheduling mechanism is designed to track GPU resources and dynamically allocate them based on the current resources. This proposed approach enhances cluster GPU resource utilization while facilitating parallelized operations.

並列關鍵字

MLOps Kubernetes Deep learning GPU Airflow Kubeflow

參考文獻


[1] Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[2] K. Abramski, S. Citraro, L. Lombardi, G. Rossetti and M. Stella, "Cognitive network science reveals bias in GPT-3, ChatGPT, and GPT-4 mirroring math anxiety in high-school students", arXiv preprint arXiv:2305.18320, 2023.
[3] N. Nhan and N. Sarah, "An empirical evaluation of GitHub copilot's code suggestions" MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories pp.1-5, 2022
[4] O. Keiron and N. Ryan, "An Introduction to Convolutional Neural Networks", arXiv preprint arXiv:1511.08458, 2015
[5] D. Yasemin, C. Necmettin, Ç. Uğur, "Bulut Ü zerinde DevOps Mimarisi DevOps Architecture in the Cloud", 2019 27th Signal Processing and Communications Applications Conference (SIU) pp.24-26, August 2019.

延伸閱讀