透過您的圖書館登入
IP:52.15.147.20
  • 學位論文

協同設計人工智慧及高效能計算系統

Co-designing Artificial Intelligence and High-performance Computing Systems

指導教授 : 洪士灝
本文將於2024/08/13開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


人工智慧 (Artificial Intelligence, AI) 以及深度學習已經在許多應用領域獲得到了成功的應用,不論是圖像識別到智慧農業再到金融科技,人工智慧都很可能在這些專業領域中成為其重要的發展趨勢。在實踐上,人工智慧應用程式的普及性可能會遭受到因高計算複雜度和大量數據傳輸等效能問題所阻礙。高效能計算技術 (High-performance Computing, HPC)可用於通過最先進的硬體和軟體技術來緩解此類問題。通過HPC系統的軟體和硬體的協同設計,可以優化深度學習應用程序的效能。實際上,完整的AI應用程序可能包含除深度學習模型之外的任務。因此,經常看到高效能計算系統上運行的人工智慧程序的相關軟體堆疊通常由一組函式庫組成,這些函式庫使各種任務能夠協同工作以處理多層系統組件。要充份發揮人工智慧-高效能計算協同設計 (AI-HPC Co-design)的優化效果,不僅需要了解人工智慧應用程序的功能和高效能計算系統的特性,還需要使用上述複雜的軟體堆疊。因此,我們不可避免地面臨人工智慧-高效能計算協同設計的兩個主要問題:巨量效能數據處理 (Big Performance Data) 和巨大配置空間探索 (Big Configuration Space)。 在本論文中,我們提出了兩種方法並構建工具來解決這兩個挑戰:SOFA (Swarm-oriented Function Call Analysis) 有助於收集和分析巨量效能數據,APOML (Automatic Performance Optimization for Machine Learning) 可幫助用戶調整巨大配置空間中的系統配置和軟體可調參數。SOFA/APOML簡要地介紹如下:(1) SOFA人工智慧-高效能計算系統深層軟體堆疊的剖析框架。通過集成多個現有的效能工具來分析深度學習系統,SOFA能夠提供目標系統的全面視圖。特別的是,SOFA可以通過檢查易於觀察的「函式群」來有效地發現隱藏的瓶頸,函式群是一組由於調用/被調用者關係、相同軟體模塊關係、進程/線程同步而形成的函數調用跟踪、資源訪問和系統調用等。SOFA還可以探索功能群之間的關係或功能群與特定係統資源使用之間的關係。(2)APOML是我們設計的一個自動效能優化平台,可以利用SOFA的效能報告自動探索應用程序效能與HPC硬體/軟體配置之間的關聯。 結合SOFA和APOML,我們實現了先進的人工智慧-高效能計算協同設計和研究工作。在我們的實驗中,APOML自動建議適當的硬體互連架構和軟體堆疊以致加速範圍從1.2倍到2.8倍。本論文做出了以下貢獻:(1) 提出一種新的深度軟體堆疊效能分析方法,用於理解基於時間模式(迭代執行)和空間模式(函數調用地址)的程序行為; (2) 實現人工智慧-高效能計算協同設計自動優化平台,用於效能分析和效能預測,然後匯總所需的軟體/硬體資源以實現效能優化;(3) 應用SOFA/APOML至各種實際案例以進行人工智慧應用程式的優化,例如充份運用現代互連架構效能並探索分佈式訓練任務的潛在加速可能性。最後,這些工作被證明能夠促進軟體堆疊發展,從而促進人工智慧-高效能計算協同設計和創新。

並列摘要


Recently, the power of artificial intelligence (AI), especially deep learning, has been demonstrated in many application domains ranging from image recognition to agriculture to financial applications. In practice, the application of AI can be hampered by issues such as huge computational complexity and huge data transfers which causes performance problems. High-performance computing (HPC) technologies can be used to mitigate such problems via state-of-the-art hardware and software techniques. The performance of a deep learning application can be optimized via a collaborative design of the software and hardware for an HPC system. In fact, a complete AI application may consist of tasks other than deep learning models. Thus, it is often seen that the associated software stack for AI programs running on an HPC system usually consists of a group of libraries that enables various tasks to work in tandem to deal with many levels of system components. To facilitate the concept of AI-HPC co-design, one need not only to understand the features of the AI application and the characteristics of the HPC system but also to work with the aforementioned complex software stack. Therefore, we inevitably face two main problems for AI-HPC co-design: Big Performance Data, and Big Configuration Space. In this dissertation, we propose two approaches and build tools to address the two challenges. SOFA (Swarm-oriented Function Call Analysis) facilitates the collection and analysis of big performance data, and APOML (Automatic Performance Optimization for Machine Learning) assists the user in tuning the system configurations and software tunables in the big configuration space. SOFA is a profiling framework for the deep software stack of AI-HPC computing systems. By integrating several existing performance tools to profile the deep learning systems, SOFA is able to provide a comprehensive view of the target systems. More importantly, SOFA can efficiently uncover hidden bottlenecks by inspecting easy-to-observe~emph{function swarm}. SOFA also enables exploring the relationship among function swarms or the relationship between a function swarm and a specific system resource usage. Second, we proposed APOML, another automatic performance optimization platform, can automate exploring the correlation between the application performance and HPC hardware/software configurations by leveraging performance reports from SOFA. In our experiments, we have combined both SOFA and APOML to suggest the appropriate hardware interconnection and software stack led to speedups ranging from 1.2x to 2.8x. This dissertation has made the following contributions: (1) A new approach for performance profiling on deep software stack and for understanding a program’s behavior based on temporal patterns (iterative execution) and spatial patterns (function call addresses), (2) a AI-HPC co-design automatic optimization platform for performance analysis and performance projection followed by aggregating required SW/HW resources to achieve performance improvements, and (3) the experimental evaluations that show the advantages of using SOFA/APOML in various situations, such as spotting bottlenecks, analyzing hardware performance and exploring the potential speedup of distributed training tasks using modern interconnections. In the end, these works are proven to be able to boost software stack so as to boost AI/HPC co-design and innovation.

參考文獻


[1] grpc motivation and design principles. https://grpc.io/blog/principles/, 2015.
[2] Tensorflow profiler and advisor. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tfprof, 2016.
[3] Tensorflow performance best practices. https://www.tensorflow.org/lite/ performance/best_practices, 2019.
[4] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
[5] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs http://hpctoolkit.org. Concurr. Comput. : Pract. Exper., 22(6):685–701, Apr. 2010.

延伸閱讀