最佳化技巧應用於降低圖形處理器上之資料溝通成本

對於使用圖形處理器來計算的應用程式,當在圖形處理器上計算速度遠快於資料傳輸速度,整體效能的關鍵在於降低溝通成本。本論文提出兩個技術與探討這兩個技術在減少資料傳輸成本的可行性。第一個是資料壓縮與資料串流技術,此技術降低記憶體存取次數並且隱藏解壓縮之成本與傳輸成本於圖形處理器計算時間中。第二個技術是使用人造的屏障同步,這可以降低圖形處理器的記憶體系統衝突,進而提昇溝通的效益。這兩個技術用增加計算的方式換來溝通成本之降低,致整體效能之改善。本研究提出之技術可用來改善六個計算問題的效能,包括基數排序、箱交疊、感測器佈置、向量相加、內積、串列對齊。經由理論與實驗的分析,結果證實這兩個技術能有效地改善應用程式在圖形處理器上的效能。

關鍵字

圖形處理器；資料壓縮；資料串流；柵欄同步；基數排序；框交疊

並列摘要

As the computational speed on GPUs is increasing faster than the communication bandwidth, reducing communication costs will continue to be crucial to the performance of GPU-accelerating applications. This dissertation presents two techniques and investigates the feasibility of using two techniques to reduce communication costs on GPU systems. The first technique is data compression plus data streaming, which limits the number of memory accesses and overlaps the decompression overhead and the communication cost with GPU computation. The second technique is artificial barrier synchronization, which improves the communication efficiency by reducing the contention for memory systems. Both techniques require an insignificant computational overhead for the exchange of the communication cost reduction, resulting in an increase of overall performance. This study demonstrates how the proposed techniques can be performed for performance in six computational kernels: radix sort, box intersection, sensor deployment, vector addition, scalar product and sequence alignment. We conducted theoretical analyses and extensive experimental tests for the presented techniques. From the analytical and experimental results, we got encouraged remarks regarding to the effectiveness of the presented techniques for GPU performance.

並列關鍵字

graphics processing units ； data compression ； data streaming ； barrier synchronization ； radix sort ； box intersection

參考文獻

[2] C. Gregg and K. Hazelwood, “Where is the data? why you cannot debate cpu vs. gpu performance without the answer,” in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, ser. ISPASS ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 134–144.

[4] M. Thuresson, L. Spracklen, and P. Stenstrom, “Memory-link compression schemes: A value locality perspective,” IEEE Trans. Comput., vol. 57, no. 7, pp. 916–927, Jul. 2008.

[6] J. Byunghyun, D. Kaeli, D. Synho, and H. Pien, “Multi gpu implementation of iterative tomographic reconstruction algorithms,” in the 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2009, pp. 185–188.

[7] M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens, “Efficient computation of sum-products on gpus through software-managed cache,” in the 22nd annual international conference on Supercomputing. 1375572: ACM, 2008, pp. 309–318.

[8] B. Jang, P. Mistry, D. Schaa, R. Dominguez, and D. Kaeli, “Data transfor- mations enabling loop vectorization on multithreaded data parallel architec- tures,” in the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming. 1693510: ACM, 2010, pp. 353–354.

國際替代計量

最佳化技巧應用於降低圖形處理器上之資料溝通成本

主題瀏覽