Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead; (2) translation overhead; (3) translated code quality. The issues also include its retargetability that supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs—an important feature to system virtualization. An investigation of the JIT compiler design decisions reveals that the lightweight, template-based code emitter is inadequate for generating the optimal host instructions; the heavyweight aggressive optimizer causes too much translation overhead to the running applications. Emulation overheads from region transitions, helper function invocation and thread synchronization, also cause the impediments to building an efficient DBT system. Addressing the dual issue of good translated code quality and low translation, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it can off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU and LLVM as our building blocks, we demonstrated in a multi-threaded DBT prototype, called HQEMU (Hybrid-QEMU), that this framework can be beneficial to both short-running and long-running applications. A study of the translation granularity reveals that considerable overhead is incurred from code region transitions. Two region formation approaches, HPM-based and software- based trace merging, are designed to improve existing trace selection algorithms. The novel HPM-based trace merging technique can detect and merge separated traces based on the information provided by the on-chip hardware HPM. The software-based region formation combines the potential separate traces early in the program execution, and is helpful for emulating short-running applications. Both approaches can result in the elimination of region transition overhead and improve the overall code performance significantly. We also address the performance scalability issue of multi-threaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: (1) coarse- grained locks used to protect shared data structures, and (2) inefficient emulation of atomic instructions across ISA’s. And two techniques are proposed to mitigate those problems: using IBTC to avoid frequent accesses to locks, and lightweight memory transactions to emulate atomic instructions across ISAs. Finally, a distributed DBT of the client/server model is proposed for embedded systems: a thin translator on each thin client and an aggressive optimizer on the server to service the optimization requests from thin clients. It successfully off-loads the optimization overhead of thin clients to the server. Moreover, the proposed asynchronous translation model can tolerate network disruption, and hide the optimization overhead and network latency.
Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead; (2) translation overhead; (3) translated code quality. The issues also include its retargetability that supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs—an important feature to system virtualization. An investigation of the JIT compiler design decisions reveals that the lightweight, template-based code emitter is inadequate for generating the optimal host instructions; the heavyweight aggressive optimizer causes too much translation overhead to the running applications. Emulation overheads from region transitions, helper function invocation and thread synchronization, also cause the impediments to building an efficient DBT system. Addressing the dual issue of good translated code quality and low translation, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it can off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU and LLVM as our building blocks, we demonstrated in a multi-threaded DBT prototype, called HQEMU (Hybrid-QEMU), that this framework can be beneficial to both short-running and long-running applications. A study of the translation granularity reveals that considerable overhead is incurred from code region transitions. Two region formation approaches, HPM-based and software- based trace merging, are designed to improve existing trace selection algorithms. The novel HPM-based trace merging technique can detect and merge separated traces based on the information provided by the on-chip hardware HPM. The software-based region formation combines the potential separate traces early in the program execution, and is helpful for emulating short-running applications. Both approaches can result in the elimination of region transition overhead and improve the overall code performance significantly. We also address the performance scalability issue of multi-threaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: (1) coarse- grained locks used to protect shared data structures, and (2) inefficient emulation of atomic instructions across ISA’s. And two techniques are proposed to mitigate those problems: using IBTC to avoid frequent accesses to locks, and lightweight memory transactions to emulate atomic instructions across ISAs. Finally, a distributed DBT of the client/server model is proposed for embedded systems: a thin translator on each thin client and an aggressive optimizer on the server to service the optimization requests from thin clients. It successfully off-loads the optimization overhead of thin clients to the server. Moreover, the proposed asynchronous translation model can tolerate network disruption, and hide the optimization overhead and network latency.