透過您的圖書館登入
IP:18.116.52.29
  • 學位論文

最佳化技巧應用於降低圖形處理器上之資料溝通成本

Optimization Techniques for Reducing Data Communication Costs on GPU Systems

指導教授 : 鍾葉青
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


對於使用圖形處理器來計算的應用程式,當在圖形處理器上計算速度遠快於 資料傳輸速度,整體效能的關鍵在於降低溝通成本。本論文提出兩個技術與探 討這兩個技術在減少資料傳輸成本的可行性。第一個是資料壓縮與資料串流 技術,此技術降低記憶體存取次數並且隱藏解壓縮之成本與傳輸成本於圖形處 理器計算時間中。第二個技術是使用人造的屏障同步,這可以降低圖形處理器 的記憶體系統衝突,進而提昇溝通的效益。這兩個技術用增加計算的方式換來 溝通成本之降低,致整體效能之改善。本研究提出之技術可用來改善六個計算問題的效能,包括基數排序、箱交疊、感測器佈置、向量相加、內積、串列對 齊。經由理論與實驗的分析,結果證實這兩個技術能有效地改善應用程式在圖 形處理器上的效能。

並列摘要


As the computational speed on GPUs is increasing faster than the communication bandwidth, reducing communication costs will continue to be crucial to the performance of GPU-accelerating applications. This dissertation presents two techniques and investigates the feasibility of using two techniques to reduce communication costs on GPU systems. The first technique is data compression plus data streaming, which limits the number of memory accesses and overlaps the decompression overhead and the communication cost with GPU computation. The second technique is artificial barrier synchronization, which improves the communication efficiency by reducing the contention for memory systems. Both techniques require an insignificant computational overhead for the exchange of the communication cost reduction, resulting in an increase of overall performance. This study demonstrates how the proposed techniques can be performed for performance in six computational kernels: radix sort, box intersection, sensor deployment, vector addition, scalar product and sequence alignment. We conducted theoretical analyses and extensive experimental tests for the presented techniques. From the analytical and experimental results, we got encouraged remarks regarding to the effectiveness of the presented techniques for GPU performance.

參考文獻


[2] C. Gregg and K. Hazelwood, “Where is the data? why you cannot debate cpu vs. gpu performance without the answer,” in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, ser. ISPASS ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 134–144.
[4] M. Thuresson, L. Spracklen, and P. Stenstrom, “Memory-link compression schemes: A value locality perspective,” IEEE Trans. Comput., vol. 57, no. 7, pp. 916–927, Jul. 2008.
[6] J. Byunghyun, D. Kaeli, D. Synho, and H. Pien, “Multi gpu implementation of iterative tomographic reconstruction algorithms,” in the 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2009, pp. 185–188.
[7] M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens, “Efficient computation of sum-products on gpus through software-managed cache,” in the 22nd annual international conference on Supercomputing. 1375572: ACM, 2008, pp. 309–318.
[8] B. Jang, P. Mistry, D. Schaa, R. Dominguez, and D. Kaeli, “Data transfor- mations enabling loop vectorization on multithreaded data parallel architec- tures,” in the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming. 1693510: ACM, 2010, pp. 353–354.

延伸閱讀