高階合成從OpenCL到FPGA的記憶體存取面向之編譯器優化

為了滿足快速變化的應用與更短的上市時間，高階合成(HLS)這門領域近來變得相當熱門且值得研究。工程師能夠在較高階的層級，如行為層次(behavioral level)下進行程式設計，而將實作的細節，如時間控制與信號的傳輸等交給高階合成工具來完成。在過往這些工具未被廣泛接納與使用的主要原因在於，對循序的輸入語言進行軟硬體切割(hardware/software partitioning)是相當困難的一項作業，而平行程式設計的架構，如開放計算語言(OpenCL)則為此一問題提供了較佳的解決方式。在我們的方法中，開放計算語言中的核心(kernel)將被合成為客製化的硬體以達到加速之目的，而此篇論文則為我們的高階合成平台提出Register Decision、Reordering和Burst Transfer三種優化方式以針對現場可程式化邏輯閘(FPGA)上的瓶頸，即記憶體存取的部分加以改善。這些優化方式是基於對開放計算語言中的工作項目(work-items)進行群組化，並有效利用工作項目間資料的重複利用而減少記憶體存取之次數，而優化的改善幅度則取決於工作項目間記憶體存取的重疊(overlapping)程度。

關鍵字

高階合成；開放計算語言；現場可程式化邏輯閘；硬體加速；優化

並列摘要

OpenCL provides an approach to specify parallel computing between hardware and software. Customized hardware can be synthesized from OpenCL kernels for acceleration. In this paper, we propose three optimization techniques, called Register Decision, Reordering, and Burst transfer, for memory accesses on FPGA platforms. Register Decision determines suitable register declaration for keeping reused data and uses shift registers to simplify the replacement of register content. Reordering gathers instructions that access the same memory location together and finds the most efficient loop unrolling and instruction scheduling. Burst transfer assembles all data movements from global to local memory and decides the burst count and the number of bursts for transfer. Based on work-items grouping, these techniques leverage the concept of data sharing and reusing between work-items. Performance is gained by decreasing the memory access times and transfer requests. The optimization relies on the overlapping of memory access patterns. The more the overlapping, the better the improvement can be obtained.

並列關鍵字

HLS ； OpenCL ； FPGA ； Hardware Acceleration ； Optimization

參考文獻

[5] S. A. Edwards, "The Challenges of Synthesizing Hardware from C-Like Languages," Design & Test of Computers, IEEE, vol. 23, pp. 375-386, 2006.

[6] R. Nikhil, "Bluespec System Verilog: efficient, correct RTL from high level specifications," in Formal Methods and Models for Co-Design, 2004. MEMOCODE '04. Proceedings. Second ACM and IEEE International Conference on, 2004, pp. 69-70.

[7] P. O. Jaaskelainen, C. S. de La Lama, P. Huerta, and J. H. Takala, "OpenCL-based design methodology for application-specific processors," in Embedded Computer Systems (SAMOS), 2010 International Conference on, 2010, pp. 223-230.

[10] S. Gupta, R. K. Gupta, N. D. Dutt, and A. Nicolau, SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits: Springer Science + Business Media, Inc., 2004.

[12] S. D. Khan and S. Hyunchul, "Effective memory access optimization by memory delay modeling, memory allocation, and buffer allocation," in SoC Design Conference (ISOCC), 2009 International, 2009, pp. 153-156.

國際替代計量

高階合成從OpenCL到FPGA的記憶體存取面向之編譯器優化

全文下載

主題瀏覽