在超長指令平行架構核心以及數位訊號處理器上驅動軟體管線之研究

在編譯器中，軟體管線化是一種功能很強大的技術。它讓迴圈鄰近的迭代執行時間能夠重疊，因此能夠增進程式的效能。然而這個必須在排程的時候考慮一些限制才能夠達成這個目的。許多不同有關軟體管線化的演算法已經被提出來了。我提出了一個可以適用於核心平行處理器的架構之下的方法。我們藉助了ORC這個編譯器。ORC原本是支援Itanium六十四位元硬體架構的編譯器。因此我們必須對ORC做修正才行。核心平行處理器和Itanium六十四位元架構主要有三個差異的部份。第一點、核心平行處理器是群集的架構，我們必須將指令分配到不同的群集，並且處理不同群集指令之間的溝通。第二點、在核心平行處理器沒有旋轉暫存器這種硬體支援。這使得原本ORC產生程式碼的部份已經不適用於核心平行處理器了，因此必須針對這部份做修正，否則將會導致程式執行的錯誤。我們參考了以前的研究資料，使用了Modulo Variable Expansion來解決這樣的問題。第三點、存取平行核心處理器的全域暫存器有特殊的限制，這是Itanium六十四位元所沒有的。我們使用自己的資料結構來考慮這樣的限制，並且修正原本軟體管線化中排程的部份。在實驗的部份，我們使用平行核心處理器的指令集模擬器，並且用DSPstone做為我們的測試程式。我們比較不同最佳化層級的結果、使用不同群集驅動軟體管線化的結果、驅動軟體管線化跟沒有驅動軟體管線化的結果。實驗結果顯示驅動軟體管線化在最佳化層級O1平均會比沒有驅動軟體管線化在最佳化層級O0快一倍以上。

關鍵字

軟體管線化； ORC ； Itanium六十四位元；群集；旋轉暫存器； Modulo Variable Expansion ；指令集模擬器； DSPstone

並列摘要

Software pipelining is a powerful loop optimization technologyi in compiler. It overlaps the execution of adjacent loop iterations to improve performance. However it has to consider many constraints in the scheduling phase to achieve this purpose. Many miscellaneous algorithms of software pipelining have already come out and we propose a method for a clustered VLIW DSP processor known as PAC platform. We enable the work of the software pipelining with ORC over PAC platform. However, ORC is not available for the PAC platform. The ORC is originally construct for IA-64 architectures. We need to modify ORC to fit PAC architectures. There are mainly three differences between PAC and IA-64 architectures. First, The VLIW data paths of PAC architectures are clustered. We have to assign instructions to appropriate clusters and deal with communications between clusters. Second, there is no rotating register hardware support in PAC architectures. The code generations of ORC must be modified, otherwise it may cause errors. We reference previous work called modulo variable expansion to solve the problem. Third, there are ping-pong constraints when we access the global register files of PAC architectures. We use our data structures to consider this constraint and we modify the modulo scheduling of software pipelining. We run the experiment by Instruction Set Simulator for PAC DSP architecture and we take DSPstone suite as our benchmark. We compare the results of different optimization levels and different number of clusters of software pipelining. The result shows that there is at least 2 times speedup for each case of ii the benchmark by incorporting our scheme over -O0 code generations.

並列關鍵字

Software Pipelining ； ORC ； Itanium Architecture -64 bits ； Cluster ； Rotating Register ； Modulo Variable Expansion ； Instruction Set Simulator ； DSPstone

參考文獻

[3] B. Rau. ”Iterative Modulo Scheduling: An Algorithm for software pipelining loops” MICRO-27, 1994, pp. 63-74 .

[4] B. Rau, M. Schlansker, and P.Tirumalai ”Code Generation Schemas for Modulo Scheduled DO-Loops and WHILE-Loops” MICRO-25, Dec. 1992.

[5] M.Lam. ”Software pipelining: an effective scheduling technique for VLIW machines” Proceedings of the SIGPLAN ’88 conference on Programming language design and implementation. 1988.

[6] ME Wolf, MS Lam. ”A loop transformation theory and an algorithm to maximize parallelism” IEEE Transactions on Parallel and Distributed Systems, 1991.

[7] Yung-Chia Lin, Yi-Ping You, and Jenq Kuen Lee. ”Register Allocation for VLIW DSP Processors with Irregular Register Files” Compiler for Parallel Computing. 2006.

國際替代計量

在超長指令平行架構核心以及數位訊號處理器上驅動軟體管線之研究

主題瀏覽