在多加速器架構下藉由快取遞送消除輸入輸出記憶體管理單元的定址轉換

新興的多加速器架構由將傳統的處理器結合上許多不同的客製化加速器到同一個晶粒上。當越來越多的客製化加速起逐漸地被使用，一個統一的虛擬定址空間介於中央處理器單元與客製化加速器已經被提出來去減輕程序員的負擔。先前的研究引進了輸入輸出記憶體管理單元，使加速器能夠擁有統一的虛擬定址空間。然而，緩慢的輸入輸出記憶體管理單元無法達到有效率的頁表查詢，縮減了客製化加速器所帶來的優勢。此外，高關聯的輸入輸出轉譯後輩緩衝區在加速器的執行過程當中，消耗了無法被忽視的功率。相關研究提出了藉由卸載頁表查詢到中央處理器的記憶體管理單元的機制，來加速輸入輸出記憶體管理單元的定址轉換。然而，定址轉換仍然存在並且對整體效能及功率消耗造成一定程度的傷害。在我們的研究中，與其讓加速器藉由直接存取單元經過定址轉換去提取資料，我們提出了讓中央處理器的第一層快取主動地遞送資料到加速器的草稿記憶體。實驗評估中顯示了我們的機制分別對於基本架構以及相關研究的機制，達到14.8%以及8%的執行時間的改善，並且平均達到22.1%的功率節省。

關鍵字

異質計算；多加速器架構；虛擬記憶體系統

並列摘要

Emerging accelerator-rich architectures combine conventional processors with multiple customized accelerators onto the same die. Prior studies have introduced a IOMMU to enable the unified virtual address space for accelerators. However, the slow IOMMU is not capable of delivering efficient page walks and diminishes the gain of customized accelerators. Moreover, the highly-associative IOTLB can account for an unnegligible power consumption. Related work presents an offload page walker to speed up the IOMMU address translation via utilizing the CPU’s MMU page walk cache. However, the IOMMU address translation still exists and harms the performance and the power. In this work, instead of letting DMA fetch data through the IOMMU address translation, we make the CPU’s L1 data cache directly forward the data to the accelerator’s scratchpad to avoid the IOMMU address translation. Evaluations show our proposed mechanism can achieve 14.8% and 8% improvements on execution time compared to the baseline and the state-of-the-art offload page walker and overall reach 22.1% power reduction on average.

並列關鍵字

Heterogeneous Computing ； Accelerator-rich Architecture ； Virtual Memory System

參考文獻

[2] A. AMD. I/o virtualization technology spec., feb. 2007.

[5] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne. Accelerating two-dimensional page walks for virtualized systems. In ACM SIGARCH Computer Architecture News, volume 36, pages 26–35. ACM, 2008.

[6] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne. Accelerating two-dimensional page walks for virtualized systems. In ACM SIGARCH Computer Architecture News, volume 36, pages 26–35. ACM, 2008.

[8] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. A quantitative analysis on microarchitectures of modern cpu-fpga platforms. In DAC, 2016 53nd ACM/EDAC/IEEE, pages 1–6. IEEE, 2016.

[13] Y. Hao, Z. Fang, G. Reinman, and J. Cong. Supporting address translation for accelerator-centric architectures. In HPCA, 2017 IEEE International Symposium on, pages 37–48. IEEE, 2017.

國際替代計量

在多加速器架構下藉由快取遞送消除輸入輸出記憶體管理單元的定址轉換

全文下載

主題瀏覽