透過您的圖書館登入
IP:18.217.228.35
  • 會議論文
  • OpenAccess

基於語意分析之指令標示方法設計具迴圈展開機制之超多純量處理器架構

摘要


目前主流的機器學習、圖像處理或加密演算法的程式中皆使用了大量的迴圈程式,而這些迴圈程式中的指令大量且重複地執行。迴圈指令之間資料相依性的問題、迴圈分支指令預測錯誤造成指令流停頓或受限於分支指令影響的Basic Block,都使支援ILP處理器在執行這些迴圈程式時,指令並行度無法有好的表現。另外如VLIW架構處理器,透過編譯器編排指令,再將指令送入至處理器執行以提升指令並行度。但是當程式過於複雜時,仍需要人工手動編排程式,但此做法缺乏彈性且無法動態調整。本論文提出在編譯器上建立基於語意分析之指令標示方法,用以靜態分析程式指令找出程式中的迴圈結構。因不同類型的迴圈結構有不同的指令編排方式,所以能歸納出不同的迴圈語意。藉由迴圈語意在編譯器上偵測指令中是否含有迴圈結構,以及找出迴圈結構中所有的分支指令。將符合迴圈語意的迴圈指令根據其迴圈類型與指令類型標示一位元組的指令標籤。本論文在超多純量處理器上建立根據指令標籤的迴圈展開機制,迴圈展開器分成以下三個部分:(1)迴圈指令蒐集器、(2)迴圈指令展開器、(3)迴圈指令相依性標籤產生器。根據制定的指令標籤,迴圈指令蒐集器只需要解碼器與比較器就能解碼指令標籤並依照迴圈類型與指令類型儲存迴圈指令。迴圈指令展開器抓取並展開迴圈指令,透過建立Branch Flush Table,避免迴圈分支指令預測錯誤而造成的指令流停頓。迴圈指令相依性標籤產生器產生指令資料相依性標籤,並重新編排指令派發順序,使指令並行度與執行效能上升。本論文使用Keil uVision5 Compiler編譯C語言所產生的ARM組合語言。將指令與標籤輸入模擬器中驗證,實測奇偶和、泡泡排序以及矩陣乘法程式。將迴圈展開機制加入八核心超多純量處理器,於不同的測試指令下,效能提升有1.2倍至4.1倍,ILP為4.77至5.87,ILP提升倍數1.3倍至1.7倍。

並列摘要


The mainstream program like machine learning, image processing or encryption algorithms used a large number of loop programs at present. And the instructions in these loop programs are executed in large numbers and repeatedly. The problem of data dependencies between loop instructions, branch misprediction causes the instruction flow to stall or the basic block limited by the branch instructions to cause poor performance while the ILP processor executing these loop programs. Besides, at the VLIW architecture processor, the instructions are scheduled by the compiler and sent to the processor for execution to improve the degree of parallelism of the instructions. However, if the program is too complicated, manual programming is still required, but this method is not flexible and cannot be dynamically adjusted. This paper proposes to establish a semantic labeling method based on semantic analysis on the compiler to statically analyze program instructions to find the loop structure in the program. Because different types of loop structures have different instruction patterns which can be summarized into different loop instructions semantics. The instructions of a program are detected on the compiler whether the instructions contain a loop structure based on loop instruction semantics and find out all of the branch instructions in the loop structure. The loop instruction that conforms to the loop semantics which added one-byte length instruction tag according to its loop type and instruction type. In this thesis, the loop unrolling mechanism based on the instruction tag is established on the hyper-scalar processor. The loop unrolling mechanism is divided into the following three parts: loop instruction collector, loop instruction unrolling and loop instruction dependency tag generator. According to the instruction tag, the loop instruction collector stores the loop instruction according to the loop type and the instruction type only needs the decoder and the comparator. The loop instruction unrolling fetches and unrolls the loop instruction, and establishes the Branch Flush Table to avoid the instruction flow stall caused by the branch misprediction. The loop instruction dependency tag generator generates an instruction data dependency tag and rearranges the instruction dispatch order to increase the degree of instruction parallelism and execution performance. This verification uses the Keil uVision5 Compiler to compile C language to generate the ARM assembly language. The instructions and tags are entered into the hyper-scalar processor simulator for verifications. The loop unrolling mechanism is added to the eight-core ultra-multiple-storage processor. Under different test programs, the performance improvement is 1.2 times to 4.1 times, the ILP is 4.77 to 5.87, and the ILP lifting factor is 1.3 times to 1.7 times.

延伸閱讀