使用受限事務內存的應用程式之動態績效調整

事務性同步擴展是英特爾第四代處理器上所實作的事務內存，提供兩種編程接口，分別為：硬件鎖省略及受限事務內存。前者較容易做編程，且擁有向下相容性、可以在不支援事務性同步擴展的硬體上執行；後者則是提供較大的彈性及擴充性。在以前的研究中顯示，由受限事務內存所保護的臨界區段配合良好設計的重試機制通常可以擁有優於硬件所省略的執行效能。簡而言之，雖然易於使用的緣故，可能較多的平行應用是使用硬件鎖省略，但改用受限事務內存可能帶來更佳的效能體驗。我們提出一個機於QEMU上實作的框架，可以在運行中將硬件鎖省略的指令轉換成受限事務內存的程式碼片段，並能夠動態地進行績效調整。與原本的硬件鎖省略執行結果相比，我們機於動態二進制轉換上的實作可以在四執行緒的狀況下獲得平均1.15倍的效能提升，以及在八執行緒的狀況下獲得平均1.56倍的效能提升。因為受限事務內存所擁有的擴展性，當執行緒數量越多時，效能提升的現象會更加顯著。

關鍵字

硬體事務內存；事務性同步擴展；動態績效調整；重試機制；動態二進制轉換

並列摘要

Transactional Synchronization Extensions (TSX) support hardware Transactional Memory (TM) on Intel 4th generation Core processors. Two programming interfaces, Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM), are provided to support software development using TSX. HLE is easy to use and maintains backward compatible with processors without TSX support while RTM is more flexible and scalable. Previous researches have shown that critical sections protected by RTM with a well-designed retry mechanism as its fallback code path can often achieve better performance than HLE. More parallel programs may be programmed in HLE, however, using RTM may obtain greater performance. To embrace both productivity and high performance of parallel program with TSX, we present a framework built on QEMU that can dynamically transform HLE instructions in an application binary to fragments of RTM codes with adaptive tuning on the fly. Compared to HLE execution, our prototype achieves 1.15x speedup with 4 threads and 1.56x speedup with 8 threads on average. Due to the scalability of RTM, the speedup will be more significant as the number of threads increases.

並列關鍵字

Hardware Transactional Memory ； Intel Transactional Synchronization Extensions ； Dynamic Tuning ； Retry Mechanism ； Dynamic Binary Translation

參考文獻

[1] Y. Afek, A. Levy, and A. Morrison. Programming with hardware lock elision. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, Shenzhen, China, February 23-27, 2013, pages 295–296, 2013.

Google Scholar

[2] C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, and S. Lie. Unbounded transactional memory. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 316–327. IEEE, 2005.

Google Scholar

[3] C. S. Ananian and M. Rinard. Efficient object-based software transactions. In Proceedings, Workshop on Synchronization and Concurrency in Object-Oriented Languages, San Diego, CA. Citeseer, 2005.

Google Scholar

[4] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Vancouver, Britith Columbia, Canada, June 18-21, 2000, pages 1–12, 2000.

Google Scholar

[5] L. Baraz, T. Devor, O. Etzion, S. Goldenberg, A. Skaletsky, Y. Wang, and Y. Zemach. IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on itanium-based systems. In Proceedings of the 36th Annual International Symposium on Microarchitecture, San Diego, CA, USA, December 3-5, 2003, pages 191–204, 2003.

Google Scholar

國際替代計量

使用受限事務內存的應用程式之動態績效調整

全文下載

主題瀏覽