利用快取一致性互連以加速資料傳輸的CPU/GPU執行模式

現今的異質系統架構提供快取一致性互連連接中央處理器與通用圖形處理器，藉此提供低延遲以及低耗能的晶片內部資料傳輸。然而傳統的中央處理器與通用圖形處理器的執行模式並無法有效率的使用快取一致性互連，原因是傳統的執行模式會讓中央處理器先將資料準備完成，再呼叫通用圖形處理器開始執行。這樣的方式會讓通用圖形處理器開始執行時中央處理器所準備的資料都溢出到外部的記憶體。本碩論中，我們出新的中央處理器與通用圖形處理器的共同執行模式讓中央處理器準備資料的同時通用圖形處理器能夠一邊執行準備好的資料，藉此減少準備的資料溢出到外部的記憶體，而能夠更有效的利用快取一致性互連。藉由新提出的應用程式介面，我們能夠用更小的細粒度控制中央處理器準備資料以及通用圖形處理器的執行，讓準備的資料能夠從中央處理器的快取透過快取一致性互連傳到通用圖形處理器的快取。實驗結果顯示，我們提出的中央處理器與通用圖形處理器的共同執行模式相較於傳統的執行模式能夠有效降低 64% 外部記憶體存取，改善11% 通用圖形處理器效能，以及 58% 整體執行時間。

關鍵字

異質系統架構；快取一致性互連；通用圖形處理器；區塊排程器；高效能運算；行動系統晶片

並列摘要

Modern HSAs support the cache coherent interconnect between CPU and GPU to provide low latency and energy-efficient on-chip data movement. However, the coventional CPU-GPU execution model incurs inefficient usage of the cache coherent interconnect since separation of CPU data preparation and GPU kernel execution can result in large data eviction. In this paper, we propose a coordinated CPU-GPU execution model that enables CPU to prepare data in finer granularity while GPU executes kernel at the same time to better utilize the coherent interconnect. With new APIs, we are allowed to control data preparation and kernel execution such that data can be fine-grainedly transferred from CPU to GPU cache. Evaluations show that, on average, the proposed scheme saves 64% external memory accesses, improves GPU kernel time by 11% and total execution time by 58% over the conventional one.

並列關鍵字

HSA ； Cache Coherent Interconnect ； GPGPU ； Block (CTA) Scheduler ； High Performance Computing ； Mobile SoC

參考文獻

[5] L.-J. Chen, H.-Y. Cheng, P.-H. Wang, and C.-L. Yang. Improving GPGPU performance via cache locality aware thread block scheduling. CAL, 2017.

[6] L. Cheng, J. B. Carter, and D. Dai. An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing. In HPCA, 2007.

[7] A. Kayi, O. Serres, and T. El-Ghazawi. Adaptive Cache Coherence Mechanisms with Producer/– Consumer Sharing Optimization for Chip Multiprocessors. IEEE Transactions on Computers, 2013.

[10] P. Rogers. Heterogeneous system architecture overview. In HCS, 2013.

[12] Y. Yang, P. Xiang, M. Mantor, and H. Zhou. CPU-Assisted GPGPU on Fused CPUGPU Architecures. In HPCA, 2012.

國際替代計量

利用快取一致性互連以加速資料傳輸的CPU/GPU執行模式

全文下載

主題瀏覽