在異質系統架構下提升圖形處理器轉址效能之區塊排程器

現今許多處理器製造商開始擁護異質系統架構。而對整合處理器及圖形處理器架構來說，最主要的一部分是兩者間共享定址空間，目的是為了能有效使用系統記憶體以及獲得虛擬定址帶來的程式撰寫便利性。然而，雖然圖形處理器有著容忍延遲的能力，但數篇論文顯示虛擬至實體的轉址對效能的影響是不容被忽視的。本碩論中，我們探索每個核心獨有一級轉譯後備緩衝區及共享二級轉譯後備緩衝區的效能情況。我們發現目前的區塊排程器傾向於將使用到相同轉譯後備緩衝區項目的區塊分散至不同的核心上，所以導致一級的轉譯後備緩衝區有很高的失誤率。實驗結果表示有些程式不會因容量大的轉譯後備緩衝區而有效能上的改善。因此，為了能改善圖形處理器轉址效能，我們設計了轉譯後備緩衝感知區塊排程器。基於我們提出的軟硬體共同設計，區塊排程器可以事前知道區塊會使用到哪些轉譯後備緩衝區項目以便於指派適當的區塊至核心上執行，進而提升轉譯後備緩衝區的使用率。結果顯示轉譯後備緩衝感知區塊排程器平均可以減少21%的全域失誤率而最終平均改善10%的效能而最高能達到22%。

關鍵字

異質系統架構；通用圖形處理器；聯合定址空間；轉址；區塊排程器

並列摘要

Processor vendors have already embraced heterogeneous systems today. A key component is a shared unified address space in order to efficiently utilize system memory as well as obtain the programmability benefits of virtual memory for integrated CPU and GPU architecture. However, though GPUs are latency-tolerant, a couple studies show that performance of virtual-to-physical address translation is still critical. In the thesis, we do performance characterization of TLB design with per-core L1 Translation Look-aside Buffers (TLBs) and shared L2 TLB. We find current block scheduler tends to allocate blocks that use the same TLB entry to different SMs and hence cause high miss rate of L1 TLB. Experimental results show some workloads cannot be improved even with large TLB size. Therefore, we design TLB-Aware block scheduler for improving GPU address translation performance. Based on our proposed software and hardware support, block scheduler knows what TLB entries the block uses in advance so that it can assigns proper blocks to a SM to optimize the TLB reuse opportunities. The results show TLB-Aware block scheduler reduces an average of 21% global TLB miss rate. Finally, TLB-Aware block scheduler improves 10% performance on average and maximum 22%.

並列關鍵字

Heterogeneous System Architecture (HSA) ； GPGPU ； Unified address space ； Address translation ； Block (CTA) scheduler

參考文獻

[13] Bhattacharjee, A. and Martonosi, M. 2010. Inter-core cooperative TLB Prefetchers for chip multiprocessors. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems - ASPLOS ’10 (New York, New York, USA, 2010), 359.

[2] J. Nickolls et al. The GPU Computing Era. Micro, IEEE, March-April.

[3] Nickolls, J. and Dally, W.J. 2010. The GPU Computing Era. IEEE Micro. 30, 2 (Mar. 2010), 56–69.

[6] Power, J. et al. 2014. gem5-gpu: A Heterogeneous CPU-GPU Simulator. Computer Architecture Letters. 13, 1 (2014).

[12] Bhattacharjee, A. and Martonosi, M. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. 2009 18th International Conference on Parallel Architectures and Compilation Techniques (Sep. 2009), 29–40.

國際替代計量

在異質系統架構下提升圖形處理器轉址效能之區塊排程器

全文下載

主題瀏覽