統一計算架構自動調諧框架

圖形處理器為一種可高度平行運算之計算設備，相較於中央處理器能，能更有效率的平行處理大量的資料。然而，圖形處理器以及中央處理器間的異質記憶體架構造成資料搬運的成本，降低了圖形處理器的效能。輝達在統一記憶體架構 6.0 版本中，提出了統一記憶體的方法。中央處理器以及圖形處理器之記憶體被視為在同一個記憶體空間中。這項技術減輕了程式設計師的負擔，讓使用者不必再人工管理繁雜的記憶體搬運。然而統一記憶體架構可能會造成效能低下，即使輝達提供了進階應用程式介面，讓程式設計師可以傳遞記憶體建議給統一記憶體驅動，如何選擇正確的記憶體建議卻也相當困難。在這篇論文中，我們提出了一個輕量化的自適應框架，用以找出對於各個應用程式中最合適的統一記憶體建議。實驗結果顯示我們的方法最高可以提升圖形處理器至多12倍的總體運算效能。

關鍵字

異構計算；圖形處理器；自動調節；編譯器；統一記憶體架構

並列摘要

A Graphics Processing Unit (GPU) is a highly parallel computing device that processes large blocks of data more efficiently than a Compute Processing Unit (CPU). However, the heterogeneity of memory between GPU and CPU could lead to significant data movement overhead. In CUDA (Compute Unified Device Architecture) 6.0, NVIDIA proposed the Unified Memory (UM) method, and hence the memory of CPU and GPU is seen as a single memory scope. This technique relieves programmers＇ burden of manually managing tedious data movement between CPU and GPU. However, UM could lead to low efficiency and huge performance degradation. Although NVIDIA has provided advanced APIs for programmers to pass memory hints to the UM driver, it is still a problem to choose the right UM advice. In this thesis, we proposed a lightweight auto-tuning framework to find the optimal UM advice for each application. Results show that our approach can achieve up to 12x speedup in overall GPU performance in the nw application.

並列關鍵字

heterogeneous computing, ； GPU ； auto-tuning ； compiler ； unified memory

參考文獻

M. Harris, “Nvidia cuda 6.0,” 2013, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/

Google Scholar

S. N, “Beyond gpu memory limits with unified memory on pascal,” 2016, accessed Jul. 8, 2022. [Online]. Available: https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/

Google Scholar

W. Li, G. Jin, X. Cui, and S. See, “An evaluation of unified memory technology on NVIDIA gpus,” in 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4-7, 2015. IEEE Computer Society, 2015, pp. 1092–1098. [Online]. Available: https://doi.org/10.1109/CCGrid.2015.105

Google Scholar

H. Xu, M. Emani, P. Lin, L. Hu, and C. Liao, “Machine learning guided optimal use of GPU unified memory,” in 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing, MCHPC@SC 2019, Denver, CO, USA, November 18, 2019. IEEE, 2019, pp. 64–70. [Online]. Available: https://doi.org/10.1109/MCHPC49590.2019.00016

Google Scholar

P. Bruel, M. Amaris, and A. Goldman, “Autotuning CUDA compiler parameters for heterogeneous applications using the opentuner framework,” Concurr. Comput.Pract. Exp., vol. 29, no. 22, 2017. [Online]. Available: https://doi.org/10.1002/cpe. 3973

Google Scholar

主題瀏覽