同步多線程架構之動態提取引擎設計

同步多線程架構是一種結合了超純量處理器和多線程處理器兩者硬體特性的處理器設計，並藉由分享處理器的資源來達到不只是使用指令階層平行度，更進一步開發線程平行度以獲取更高的效能。在這種架構下，提取單元被認為是主要的效能瓶頸之一，先前的研究工作也已提出了釵h的提取方案試圖來增進指令提取效率。在之前提出的方案之中，由Tullsen等人所提出的ICOUNT方案不只是增進了效能，更具有在實作上的便利性而被認為是一種很好的方案。此方案是由線程在解碼單元、暫存器重新命名單元和指令佇列等之內的指令數來決定此線程的優先權。ICOUNT方案成左滬鴞]主要是因為它偏好那些能快速通過處理器管線的線程，故而能夠很有效率的使用處理器資源。我們認為適時的讓傾向有較多長延遲指令的線程取得優先權會更好，因為長延遲指令通常有很高的機會是在關鍵路徑上。我們提出了一個動態的方案，在暫存器更新單元（RUU）和存取佇列（LSQ）使用率低的情況下讓傾向有較多長延遲指令的線程取得較高的優先權。我們的動機不僅僅是想要更有效率而且更考慮到指令的急迫性來使用處理器資源。我們提出的方案增進了RUU和LSQ等處理器資源的使用率，進一步增進了效能。實驗顯示，相對於ICOUNT方案來說，我們提出的方案最高可以增進到17%的效能。不僅如此，它還很容易實作。

關鍵字

同步多線程；提取引擎

並列摘要

Simultaneous Multithreading is a processor design that attempts to combine both the hardware features of superscalar and multithreaded processors, and gain performance by sharing the processor resources dynamically to exploit thread-level parallelism along with instruction-level parallelism. While the fetch unit has been identified as one of the major bottlenecks of this architecture, several fetch schemes were proposed by prior works to enhance the fetching efficiency. Among these schemes, ICOUNT, proposed by Tullsen et al. in which priority is assigned to a thread according to the number of instructions it has in the decode unit, register renaming unit and instruction queues were considered to be a great scheme not only the performance it achieved but also in the efficiency of implementation. The ICOUNT scheme works mainly because it favors the thread which fast moving through the pipeline, thus use the resource effectively. We think it is better let the thread which tends to have more long latency instructions get the priority at adequate time since long latency instructions are very likely on program’s critical path. We proposed a dynamic fetch scheme which gives the long latency bound thread higher priority while the RUU or LSQ is under low usage. Our motivation is to gain further performance by not only use the resource effectively but also by the urgency of the instructions. The proposed scheme aggressively attacks the LSQ and RUU usage which does further utilize the shared processor resources and achieve overall performance improvements. Experiments show that our scheme achieves 17% speedup in maximum compared to the ICOUNT scheme. Further more, it is easy to implement.

並列關鍵字

fetch engine ； Simultaneous Multithreading

參考文獻

[1] G.. Dorai, and D. Yeung. Transparent threads: resource sharing in SMT processors for high single-thread performance. In 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT'02), September 22 - 25, 2002

[3] A. El-Moursy, and D. Albonesi. Front-end policies for improved issue efficiency in SMT processors. 9th International Symposium on High-Performance Computer Architecture, pages 31-40, February 2003.

[4] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. An elementary processor architecture with simultaneous instruction issuing from multiple threads. In 19th Annual International Symposium on Computer Architecture, pages 136-145, May 1992.

[6] J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting thread-level parallelism into instruction-Level parallelism via simultaneous multithreading. In ACM Transactions on Computer Systems, pages 322-354, August 1997.

[7] K. Luo, M. Franklin, S. Mukherjee, and A. Sezne. Boosting SMT performance by speculation control. In 15th Proceedings of International Parallel and Distributed Processing Symposium (IPDPS)}, 2001.

國際替代計量

同步多線程架構之動態提取引擎設計

未授權

主題瀏覽