基於LLVM技術開發之異質核心模擬器中GPU編譯器 : HTranslator

異質系統架構(HSA)是由HSA基金會制定之工業標準，許多重要的應用處理器廠商皆為此基金會的成員，如：超微、安謀、聯發科技、三星以及高通，本論文將基於根據此標準開發之模擬器，闡述模擬器中GPU部分之編譯器設計及實作，並且產生Single Instruction Multiple Data(SIMD)指令進行優化。模擬異質系統架構GPU執行的過程中，CPU相較於實體GPU在執行緒數目上顯得相當缺乏，倘若每次GPU的執行都交由一個CPU的執行緒執行，每個執行緒都將被分配到多個原先實體GPU的工作並依序執行之，在大部分的情況下，GPU皆是在不同的資料上執行相同的指令，在這種情況下加入SIMD指令，便可藉由硬體的幫助在一個SIMD指令內同時處理數筆資料，讓一個執行緒完成原先需要數個執行緒才能完成的工作，進而提升模擬器執行效率並更貼近GPU實際運作。在條件跳躍指令存在的情況下，不同的GPU其跳躍目的位址可能不同，進而無法直接使用SIMD的指令進行模擬，因此，編譯器產生機器碼之前須重新建構程式執行流程，確保任一目的位址所指向區塊中所有指令都將被執行，同時為了確保執行結果的正確性，使用bitmap紀錄各GPU條件跳躍的結果，條件跳躍發生的同時，會將各GPU是否跳躍寫入bitmap中，對於那些GPU不該執行此目的位址指令的部分，則利用此bitmap遮蔽其執行結果。

關鍵字

異質架構系統； SIMD ； GPU ；模擬器；編譯器

並列摘要

Heterogeneous System Architecture (HSA) is an open industry standard formulated by HSA foundation. Many Application processor vendors, such as AMD, ARM, Me-dia Tek, Samsung, qualcomm are member of it. This thesis will focus on emulator base on this standard, and descript GPU compiler design. In additional, add Single Instruction Multiple Data (SIMD) instruction to speed up emulator’s execution. In the procedure of simulation GPU’s execution with the heterogeneous system ar-chitecture, the number of threads in CPU is far less than in physical GPU. If emulator assigns each physical GPU’s task to a CPU thread, each thread will receive more than one task and iterate complete them. In most situations, physical GPUs are executing same instructions to deal with different data. In these cases, it can add SIMD instruc-tion to speed up the execution. With the help of hardware, emulator can handle dif-ferent data at the same time in the clocks of a SIMD instruction and make a thread completes tasks assigned few threads before. Then improve emulator’s performance and this way is much closer physical GPU’s execution. When conditional branch instructions exist, different GPU may jump to different target address and can’t be simulated by SIMD instructions straightly. To resolve this case, before compiler generates target code, it should reconstruct the control flow of program to make sure each instruction in blocks pointed by target address will be ex-ecuted. To avoid adding SIMD instruction in emulator and reconstructing control flow can still get correct result, it’s necessary to use a bitmap to record conditional jump’s result of GPUs. When compiler finds conditional jump instructions, it writes result of GPUs into bitmap. For these GPUs should not execution instructions in the block, emulator using bitmap to mask the result.

並列關鍵字

heterogeneous system architecture ； SIMD ； GPU ； emulator ； compiler

參考文獻

[7] Karrenberg, R.; Hack, S. Whole-Function Vectorization Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on

[8] R. Karrenberg; S. Hank, Improving Performance of OpenCL on CPUs, Compiler Construction, 2012

[9] Chris Lattner and Vikram Adve, LLVM: "A Compilation Framework for Lifelong Program Analysis & Transformation", Proceedings of the 2004 Interna-tional Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar. 2004.

[11] Jiun-Hung Ding, Po-Chun Chang, Wei-Chung Hsu, Yeh-Ching Chung, "PQEMU: A Parallel System Emulator Based on QEMU," icpads, pp.276-283, 2011 IEEE 17th International Conference on Parallel and Distributed Systems, 2011

[12] The LLVM Compiler Infrastructure, http://llvm.org/

國際替代計量

基於LLVM技術開發之異質核心模擬器中GPU編譯器 : HTranslator

全文下載

主題瀏覽