使用最深處迴圈控制的區塊基礎迴圈緩衝器以增加能源改善

我們可以在中央處理單元跟指令快取記憶體之間設置一個簡單的緩衝記憶體，叫做迴圈緩衝器，它跟一般快取記憶體的差異在於只存放最深處的迴圈，因此有面積小、速度快的優點。當緩衝器容量夠大能填進整個迴圈的話，最大的效益在於這些指令只需要從主快取記憶體中讀取一次即可，然後由緩衝器以很少的能量來提供指令給處理器核心。就先前的一些研究中，通常在擷取的階段時就開始運作最深處迴圈的偵測。分支預測跳開或不跳開主要是依據分支預測器的決定，一旦有向後分支指令或是迴圈裡面的向前分支指令預測錯誤的話，緩衝器中已存放的指令要被清空，然後重新去偵測新的迴圈，尤其是向前分支變動性很大的時候，預測器通常不能完全發揮它的價值，這種現象反而會造成許多無用指令擷取的能源浪費。在此，我們試著將蹤跡快取記憶體的觀念運用在迴圈緩衝器中，蹤跡快取記憶體是相當龐大且複雜的，如果將它當作一個迴圈緩衝器，這樣不會達到節省能源的效益，反而因為擷取指令的延遲過長，降低整體的效能。在本篇論文中，我們提出幾個方法:(1)在完成階段時，做最深處迴圈偵測，然後在擷取階段填充從指令快取記憶體抓取到的指令(2)將迴圈本體內的分支指令放進去放進緩衝器內，並且把指令包裝成以基本區塊為基礎來存放。

關鍵字

迴圈緩衝器；最深處迴圈；蹤跡快取記憶體；基礎區塊

並列摘要

A loop buffer is a memory located between CPU and level one instruction cache, called IL1 hereafter. The difference between the loop buffer and the cache dedicate for instructions is that the loop buffer only keeps the instructions in sequence. Therefore it contains the advantages of smaller size and high speed over the main cache. The instruction fetch unit can obtain the maximum benefit from loop buffer while the size of loop buffer is large enough to contain whole instructions in a loop, the instructions just need to be fetched from the cache only one time and then it can deliver instructions to CPU core at very low energy level. In the previous researches, the controller begins to detect the innermost loop at the fetch stage. The branches whether are predicted taken or not taken mainly depend on the branch predictor. Once the backward branches or the forward branches in the loop are miss-predicted, the controlled have to flush the instructions in the buffer, detect and refill a new loop from the main cache. Especially, the forward branches are so instable that the predictor cannot bring its value into play. Instead, this appearance will cause more wasted fetch power. Here, we attempt to lead the concept of a trace cache, which is quiet bulky and complicated in the architecture of the loop buffer. If using a trace cache as a loop buffer, we do save the energy. Contrarily, it debases the integral performance due to long latency at fetch stage. We therefore propose these methods of (1) doing innermost loop detection at commit stage and filling/active at fetch stage; and (2) assisting loop buffer in storing the innermost loops with forward branches to pack the instructions captured from the instruction cache as basic blocks. With the preceding modifications, we hope to strengthen the loop buffer for gaining performance and reducing more power. Results with SPEC2000 indicate that up to 45% (integer benchmarks) and 55% (floating benchmarks) of reductions in instruction fetch power compared with the design without loop buffer. Furthermore, we got 3% (integer benchmarks) and 2% (floating benchmarks) of power improvement than the design of the loop buffer that deal with loops at fetch stage.

並列關鍵字

Loop buffer ； Innermost loop ； Trace cache ； Basic block

參考文獻

[2] T. Anderson and S. Agarwala, “Effective hardware-based two-way loop cache for high performance low power processors,” in International Conference on Computer Design: VLSI in Computers & Processors, 2000.

[4] J. Rivers, S. Asaad, J. -D. Wellman, and J. Moerno, “Reducing Instruction Energy with Backward Branch Control Information and Buffering,” in International Symposium on Low Power Electronics and Design, 2003.

[5] N. Bellas, I. Hajj, C. Polychronopoulos and G. Stamoulis, “Energy and Performance Improvements in Microprocessor Design using a Loop Cache,” in International Conference on Computer Design, pp.378-383, 1999.

[6] C. -T. Wu, A.-C. Hsieh, and T. -T. Hwang, “Instruction buffering for nested loops in low-power design,” IEEE Transactions on, Very Large Scale Integration (VLSI) Systems, pp.780-784, July 2006.

[11] D. Brooks, V. Tiwari and M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” in International Symposium on Computer Architecture, 2000.

國際替代計量

使用最深處迴圈控制的區塊基礎迴圈緩衝器以增加能源改善

未授權

主題瀏覽