透過您的圖書館登入
IP:3.81.97.37
  • 學位論文

針對CUDA 程式之編譯器輔助資源管理

Compiler-assisted Resource Management for CUDA Programs

指導教授 : 游逸平

摘要


近年來,為因應高效能以及低耗能的需求,系統的開發已朝向多核心發展,其中異質多核心系統(heterogeneous multi-core system)為現今的發展趨勢,而又以CPUs+GPUs 架構最引人注目。然而,隨著系統中核心數量的增長,現今的系統變得相當複雜,也讓程式設計者面對更大 的挑戰。CUDA (Compute Unified Device Architecture)正是因應愈來愈普及的GPGPU 運算環境,其制訂程式開發介面,簡化程式開發者撰寫CPU 與GPU 間工作分配與溝通的程序,幫助運用執行緒(thread)的平行化,達到高性能運算的效率。在CUDA 架構中,資源管理是程式效 能很重要的考量點,其中包含了全域記憶體(global memory)、共享記憶體(shared memory)及暫存器(register)的管理,許多研究提出了針對記憶體管理來增進效能,但卻沒有研究能有效的管理暫存器來提升效能。在GPU 中,暫存器的使用量深深的影響到執行緒同時執行的數量,因為暫存器的數量是有限的,若每個執行緒對暫存器的使用量能夠減少到某一定程度,則表示能增加更多同時執行的執行緒,則全域記憶體的存取延遲能夠被更有效的遮掩,如此,便能獲得較好的程式效能。此篇論文中,我們提出了一個暫存器管理的架構,此架構包含了一個模組 (model)來決定暫存器使用量需降到何種程度來提升同時執行的執行緒數量以及兩個暫存器最 佳化方法,其一利用重算指令來重算暫存器中的值,其二利用共享記憶體來儲存暫存器的值, 使用此二個最佳化方法來將暫存器的使用量降低至某一程度,進而增加程式的效能。我們將使 用此架構與未使用此架構的CUDA 程式比較,結果顯示能使核心程式(kernel code)有14.8%幾何平均的效能提升,而整個程式能有5.5%幾何平均的效能提升。

並列摘要


CUDA allows programmers to write code for both CPUs and GPUs. In general, GPUs require high thread-level parallelism (TLP) to reach their maximal performance, and the TLP of a CUDA program is deeply affected by the resource allocation of GPUs. There are some researches work focusing on the management of memory allocation for performance enhancement, but none proposed an effective approach to speed up programs in which TLP is limited by insufficient registers. In this thesis, we propose a TLP-aware register pressure reduction framework to reduce register usage of a CUDA kernel to a desired degree so as to allow more threads active and thereby to hide the long-latency global memory accesses. The framework includes a cost model that determines the desired degree of register usage and two register-pressure reduction schemes, rematerialization and spilling. The experimental results show that the framework is effective in performance improvement of kernels by a geometric average of 14.8%, while the geometric average performance for CUDA programs is 5.5%.

參考文獻


[11] The IMPACT Research Group. Parboil Benchmark suite, 2009.
[2] ATI Cooperation. Radeon R3xx 3D Reference Guide, August 2002.
[15] NVIDIA Cooperation. NVIDIA GeForce 256 Guide, August 1999.
[26] SGI Cooperation. Open64 Released Document, March 2006.
[30] Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, and Xipeng Shen. Streamlining GPU Applications on the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pages 115–125, Tsukuba, Japan, 2010. ACM.

延伸閱讀