在NVIDIA圖形處理器上管理暫存器以增加線程級並行處理

圖形處理單元具有大量的運算處理器，這些運算處理器是以單指令流多資料流的方式執行，因此圖形處理單元能處理每秒兆個浮點運算，運算量是中央處理器的數十甚至數百倍。通用圖形處理器依靠大量的執行緒來隱藏會花費400~800時序的off-chip記憶體延遲，然而能平行執行的執行緒數量會特別受到執行緒使用的暫存器數量影響，因此在這篇論文中，我們提出了降低暫存器壓力以最佳化線程級並行處理的架構，這個架構的目的就是要降低執行緒使用的暫存器數目，以增加線程級並行處理。在這個架構中包含了兩個降低暫存器使用量的方法，第一個是暫存器的重算，第二個是溢出暫存器至on-chip記憶體。實驗結果顯示這個架構是有效果的，平均減少了5.7%的執行時間，最多能減少27%。

關鍵字

編譯器最佳化；暫存器配置；線程級並行處理；圖形處理單元； OpenCL ； CUDA

並列摘要

Graphics processing units (GPUs) are equipped with enormous amounts of arithmetic processors running in a single-instruction, multiple-data fashion, producing a throughput of Tera floating-point operations per second, which is ten or even hundred times higher than the throughput of central processing units. GPUs reply on massive hardware multithreading to hide off-chip memory latencies, which are approximately 400–800 cycles. However, the number of parallel threads running on GPUs is highly restricted by the resource requirement of such a thread, especially the register requirement. In this thesis, we proposed a thread-level parallelism-aware register-pressure reduction framework to reduce the register usage of threads on GPGPUs, thereby increasing the thread-level parallelism. This framework includes two register-pressure reduction methods: (1) register rematerialization, (2) spilling registers to on-chip memory. The experimental results demonstrate that the proposed framework was effective in improving performance of OpenCL kernel programs by a maximum of 27% and an average of 5.7%.

並列關鍵字

Compiler optimization ； register allocation ； thread-level parallelism ； GPU ； OpenCL ； CUDA

參考文獻

workloads using a detailed GPU simulator. In Performance Analysis of Systems and

Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–174,

[2] Preston Briggs, Keith D Cooper, and Linda Torczon. Rematerialization. SIGPLAN

[3] G J Chaitin. Register Allocation & Spilling via Graph Coloring. SIGPLAN Not.,

17(6):98–101, 1982.

國際替代計量

在NVIDIA圖形處理器上管理暫存器以增加線程級並行處理

全文下載

主題瀏覽