透過您的圖書館登入
IP:18.222.240.21
  • 學位論文

高效能的可重定目標動態二進制碼轉譯

Efficient and Retargetable Dynamic Binary Translation

指導教授 : 鍾葉青
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead; (2) translation overhead; (3) translated code quality. The issues also include its retargetability that supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs—an important feature to system virtualization. An investigation of the JIT compiler design decisions reveals that the lightweight, template-based code emitter is inadequate for generating the optimal host instructions; the heavyweight aggressive optimizer causes too much translation overhead to the running applications. Emulation overheads from region transitions, helper function invocation and thread synchronization, also cause the impediments to building an efficient DBT system. Addressing the dual issue of good translated code quality and low translation, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it can off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU and LLVM as our building blocks, we demonstrated in a multi-threaded DBT prototype, called HQEMU (Hybrid-QEMU), that this framework can be beneficial to both short-running and long-running applications. A study of the translation granularity reveals that considerable overhead is incurred from code region transitions. Two region formation approaches, HPM-based and software- based trace merging, are designed to improve existing trace selection algorithms. The novel HPM-based trace merging technique can detect and merge separated traces based on the information provided by the on-chip hardware HPM. The software-based region formation combines the potential separate traces early in the program execution, and is helpful for emulating short-running applications. Both approaches can result in the elimination of region transition overhead and improve the overall code performance significantly. We also address the performance scalability issue of multi-threaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: (1) coarse- grained locks used to protect shared data structures, and (2) inefficient emulation of atomic instructions across ISA’s. And two techniques are proposed to mitigate those problems: using IBTC to avoid frequent accesses to locks, and lightweight memory transactions to emulate atomic instructions across ISAs. Finally, a distributed DBT of the client/server model is proposed for embedded systems: a thin translator on each thin client and an aggressive optimizer on the server to service the optimization requests from thin clients. It successfully off-loads the optimization overhead of thin clients to the server. Moreover, the proposed asynchronous translation model can tolerate network disruption, and hide the optimization overhead and network latency.

並列摘要


Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead; (2) translation overhead; (3) translated code quality. The issues also include its retargetability that supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs—an important feature to system virtualization. An investigation of the JIT compiler design decisions reveals that the lightweight, template-based code emitter is inadequate for generating the optimal host instructions; the heavyweight aggressive optimizer causes too much translation overhead to the running applications. Emulation overheads from region transitions, helper function invocation and thread synchronization, also cause the impediments to building an efficient DBT system. Addressing the dual issue of good translated code quality and low translation, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it can off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU and LLVM as our building blocks, we demonstrated in a multi-threaded DBT prototype, called HQEMU (Hybrid-QEMU), that this framework can be beneficial to both short-running and long-running applications. A study of the translation granularity reveals that considerable overhead is incurred from code region transitions. Two region formation approaches, HPM-based and software- based trace merging, are designed to improve existing trace selection algorithms. The novel HPM-based trace merging technique can detect and merge separated traces based on the information provided by the on-chip hardware HPM. The software-based region formation combines the potential separate traces early in the program execution, and is helpful for emulating short-running applications. Both approaches can result in the elimination of region transition overhead and improve the overall code performance significantly. We also address the performance scalability issue of multi-threaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: (1) coarse- grained locks used to protect shared data structures, and (2) inefficient emulation of atomic instructions across ISA’s. And two techniques are proposed to mitigate those problems: using IBTC to avoid frequent accesses to locks, and lightweight memory transactions to emulate atomic instructions across ISAs. Finally, a distributed DBT of the client/server model is proposed for embedded systems: a thin translator on each thin client and an aggressive optimizer on the server to service the optimization requests from thin clients. It successfully off-loads the optimization overhead of thin clients to the server. Moreover, the proposed asynchronous translation model can tolerate network disruption, and hide the optimization overhead and network latency.

參考文獻


[1] Hotspot parallel collector. In Memory Management in the Java HotSpot Virtual Ma- chine Whitepaper.
[4] Todd Austin, Eric Larson, and Dan Ernst. SimpleScalar: An infrastructure for com- puter system modeling. Computer, 35(2):59–67, 2002.
[8] Sorav Bansal and Alex Aiken. Binary translation using peephole superoptimizers. In 8th USENIX Conference on Operating Systems Design and Implementation, 2008.
[9] Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003.
[12] Fabrice Bellard. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, pages 41–46, 2005.

延伸閱讀