透過您的圖書館登入
IP:3.141.199.243
  • 學位論文

現代繪圖晶片之精確週期模擬器

A Cycle-Accurate Simulator for Modern GPU

指導教授 : 楊佳玲

摘要


現代繪圖晶片提供比一般中央處理器更多的平行度與運算能力,讓繪圖晶片在目前學術研究慢慢的受到重視。為了提供高平行度的運算,繪圖晶片會分組所有的執行緒並使用單一指令來排程執行一組中的多個執行緒,利用這種大量平行化的執行方式,簡化控制硬體的花費用來提升計算硬體能力。現在的繪圖晶片研究都沒有考量與實做硬體微架構細節,例如可程式化處理器的微架構:指令單元、排程單元、貼圖管線、等等;而實驗部分太注重在通用運算應用,而非最重要的圖學應用,圖學應用乃是繪圖晶片設計最基本的目的。經由收集許多論文、專利、公開資料,本論文是第一個有整理出一套合理之類似NVIDIA 繪圖晶片中可程式化處理器微架構,並針對OpenGL ES 應用程式平台實做出模擬器框架的研究,模擬器框架包含:繪圖晶片模擬器、前端OpenGL ES 平台程式轉換成模擬器的工具鏈(擷取器、編譯器、驅動程式)。 我們利用模擬器框架實做出五個具有代表3D 繪圖領域重要的繪圖技術核心應用來實驗探索繪圖晶片之執行模式的設計空間,我們發現一味增加計算能力來提升平行度會使得有較多運算的程式有較多的好處,但是對於跳躍分歧(branch divergence)的影響則相對加劇,使得硬體使用率變差,反而傷害效能,無法使得效能依照運算能力等比的上升。此外對於3D 繪圖程式中貼圖行為的比重與繪圖 晶片同時可執行的執行緒組數目會影響有多少的記憶體存取延遲可以被隱藏,因此不同的程式特性會對應到一種適合的計算能力與同時執行緒組數目的比率,來達到同時最大化運算平行度與隱藏延遲的能力。

並列摘要


Modern Graphic Processing Units (GPUs) has obtained a lot of attention recently since it provides orders of magnitude more computing power than CPUs. GPUs adopt the SIMT (Single Instruction Multiple Threads) execution model, which groups several threads into a warp for scheduling and execution. Current studies on GPUs are often based on an abstraction level of the SIMT execution model, and focus only on non-graphics applications. Micro-architectural details on fetch logics, warp scheduling units, and texture pipelines are not modeled. This thesis attempts to sketch the complete GPU micro-architecture that supports SIMT execution. Due to the limited information released from major GPU vendors, we derive an Nvidia-like GPU architecture through extensive patent search. A cycle-accurate simulator is also developed. We evaluate SIMT execution efficiency on a set of 3D graphics kernels with complicated shading effects. We explore the design space of the SIMT pipeline width, the number of concurrent warps, and the trade-off between these two factors. We make the following observations. First, wider SIMT pipeline benefits computation -bound workload but its performance is limited by branch divergence. Second, texture performance plays an important role for 3D graphics application. For applications with high texture accesses, the number of concurrent warps is critical to performance.

參考文獻


[4] T. Austin, E. Larson, and D. Ernst. Simplescalar: an infrastructure for computer system modeling. volume 35, pages 59 –67, feb 2002.
[12] V. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and null Espasa E. Attila: a cycle-level execution-driven simulator for modern gpu architectures. Performance Analysis of Systems and Software, IEEE International Symmposium on, 0:231–241, 2006.
[16] W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on,
[17] K. Gee. Introduction to the direct3d 11 graphics pipeline, 2008.
[19] S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ISCA ’09: Proceedings of the 36th annual international symposium on Computer architecture, pages 152–163, New York, NY, USA, 2009. ACM.

延伸閱讀