透過您的圖書館登入
IP:3.135.217.228
  • 學位論文

在跨指令集架構動態二進制碼轉譯中利用非對稱單指令多資料流暫存器組態

Exploiting Asymmetric SIMD Register Configurations in Cross-ISA Dynamic Binary Translation

指導教授 : 徐慰中

摘要


近幾十年來單指令多資料流 (SIMD) 在執行效能與能源效率方面的優勢已經促使硬體廠商廣泛地將其應用在處理器上。此外,SIMD 暫存器組態——暫存器的數量與寬度,也隨著新指令集擴充的引入而在不同處理器架構上快速地演進與分歧。然而在透過動態二進制碼轉譯移植已經對客戶指令集架構最佳化的現存或商業應用程式至另一擁有較少但較寬 SIMD 暫存器的宿主指令集架構時,客戶和宿主指令集架構之間非對稱的 SIMD 暫存器組態目前尚未被妥善利用。而未充份使用宿主的 SIMD 平行度和暫存器容量會嚴重限縮執行效能而無法達到宿主的理論最大值。在這篇論文中我們提出一個稱為「暫存器溢出感知超字層級平行」 (Spill-aware Superword Level Parallelism, saSLP)的新動態二進制碼轉譯技術,將已對客戶指令集架構最佳化的二進制碼迴圈進行轉換,藉此完全利用宿主 SIMD 的平行度和暫存器容量。我們提出的 saSLP 演算法可將數個客戶較短的 SIMD 指令與暫存器合併為單一較長的宿主版本,藉此完全利用宿主 SIMD 的平行度並同時減少暫存器溢出。在真實硬體上的實驗結果顯示提出的演算法可分別在多個實際應用與標準測試程式的 ARMv8 NEON 至 x86 AVX2 以及 AVX-512 動態二進制碼轉譯中獲得 1.6 與 2.3 倍的執行效能增益,同時顯著地減少 97% 與 99% 的暫存器溢出。

並列摘要


Processor manufacturers have embraced single instruction multiple data (SIMD) for decades because of its superior performance and power efficiency. The configurations of SIMD registers (i.e., the number and width) have evolved and diverged rapidly through various ISA extensions on different architectures. However, migrating legacy or proprietary applications optimized for one guest ISA to another host ISA that has fewer but longer SIMD registers through binary translation raises the issues of asymmetric SIMD register configurations. To date, these issues have been overlooked. As a result, only a small fraction of the potential performance gain is realized due to underutilization of the host's SIMD parallelism and register capacity. In this paper, we present a novel dynamic binary translation technique called spill-aware SLP (saSLP), which transforms binary loops optimized for a guest ISA to exploit longer host registers in terms of both data parallelism and register capacity. Proposed saSLP combines short guest SIMD instructions and registers to fully utilize the host's parallelism as well as minimize register spilling. Experiment results show that saSLP improves the performance by 1.6X (2.3X) across a number of benchmarks, and reduces spilling by 97% (99%) for ARMv8 NEON to x86 AVX2 (AVX-512) translation.

參考文獻


David H. Bailey, Eric Barszcz, John T. Barton, D. S. Browning, Robert L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, T. A. Lasinski, Robert Schreiber, Horst D. Simon, V. Venkatakrishnan, and Sisira Weeratunga. 1991. The Nas Parallel Benchmarks. International Journal of High Performance Computing Applications (IJHPCA) 5, 3 (1991), 63-73. https://doi.org/10.1177/109434209100500306
Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: a transparent dynamic optimization system. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), Monica S. Lam (Ed.). ACM, 1-12. https://doi.org/10.1145/349299.349303
Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO). ACM/IEEE Computer Society, 191-204. https://doi.org/10.1109/MICRO. 2003.1253195
Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic IntraRegister Vectorization for the Intel Architecture. International Journal of Parallel Programming (IJPP) 30, 2 (2002), 65-98. https://doi.org/10.1023/A:1014230429447
Anton Chernoff, Mark A. Herdeg, Raymond J. Hookway, Chris Reeve, Norman Rubin, Tony Tye, S. Bharadwaj Yadavalli, and John Yates. 1998. FX!32 a profiledirected binary translator. IEEE Micro 18, 2 (1998), 56-64. https://doi.org/10.1109/40.671403

延伸閱讀