QEMU internals

A series of posts about QEMU internals:

View the Project on GitHub airbus-seclab/qemu_blog

A deep dive into QEMU: The execution loop

This is the beginning of the second part of the blog post series. We will go deeper into QEMU internals this time to give insights to hack into core components. Let’s look at the virtual CPU execution loop.

The Big picture

In the very first blog post we explained how accelerators were started, through qemu_init_vcpu(). Suppose we run QEMU with a single threaded TCG and no hardware assisted virtualization backend, we will end up running our virtual CPU in a dedicated thread:

static void qemu_tcg_init_vcpu(CPUState *cpu)
{
...
    qemu_thread_create(cpu->thread, thread_name,
                       qemu_tcg_rr_cpu_thread_fn,
                       cpu, QEMU_THREAD_JOINABLE);
...
}

static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
{
...
    while (1) {
        while (cpu && !cpu->queued_work_first && !cpu->exit_request) {

            qemu_clock_enable(QEMU_CLOCK_VIRTUAL, ...);
            
            if (cpu_can_run(cpu)) {
                r = tcg_cpu_exec(cpu);

                if (r == EXCP_DEBUG) {
                    cpu_handle_guest_debug(cpu);
                    break;
                }
            }
            cpu = CPU_NEXT(cpu);
        }
    }
}

This is a very simplified view but we can see the big picture. If the vCPU is in a runnable state then we execute instructions via the TCG. We will detail how it handles asynchronous events such as interrupts and exceptions, but we can already see there is a special handling for EXCP_DEBUG in the previous code excerpt.

There is nothing architecture dependent at this level, we are still in a generic part of the QEMU engine. The debug exception special treament here is usually triggered by underlying architecture dependent events (ie. breakpoints) and require particular attention from QEMU to be forwarded to other subsystems such as a GDB server stub out of the context of the VM. We will also cover breakpoints handling in a dedicated post.

Entering the TCG execution loop

The interesting function to start with is tcg_cpu_exec and more specifically the cpu_exec one. We will cover (definitely) in a future blog post the internals of the TCG engine, but for now we only give an overview of the VM execution. Simplified, it looks like:

int cpu_exec(CPUState *cpu)
{
    cc->cpu_exec_enter(cpu);

    /* prepare setjmp context for exception handling */
    sigsetjmp(cpu->jmp_env, 0);

    /* if an exception is pending, we execute it here */
    while (!cpu_handle_exception(cpu, &ret)) {
        while (!cpu_handle_interrupt(cpu, &last_tb)) {
            tb = tb_find(cpu, last_tb, tb_exit, cflags);
            cpu_loop_exec_tb(cpu, tb, &last_tb, &tb_exit);
        }
    }

    cc->cpu_exec_exit(cpu);
}

QEMU makes use of setjmp/longjmp C library feature to implement exception handling. This allows to get out of deep and complex TCG translation functions whenever an event has been triggered, such as a CPU interrupt or exception. The corresponding functions to exit the CPU execution loop are cpu_loop_exit_xxx:

void cpu_loop_exit(CPUState *cpu)
{
    /* Undo the setting in cpu_tb_exec.  */
    cpu->can_do_io = 1;
    siglongjmp(cpu->jmp_env, 1);
}

The vCPU thread code execution goes back to the point it called sigsetjmp. Then QEMU tries to deal with the event as soon as possible. But if there is no pending one, it executes the so-called Translated Blocks (TB).

A primer on Translated Blocks

The TCG engine is a JIT compiler, this means it dynamically translates the target architecture instructions set to the host architecture instruction set. For those not familiar with the concept please refer to this and have a look at an introduction to the QEMU TCG engine here. The translation is done in two steps:

QEMU first tries to look for existing TBs, with tb_find. If no one exists for the current location, it generates a new one with tb_gen_code:

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
    if (tb == NULL) {
        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
...
}

When a TB is available, QEMU runs it with cpu_loop_exec_tb which in short calls cpu_tb_exec and then tcg_qemu_tb_exec. At this point the target (VM) code has been translated to host code, QEMU can run it directly on the host CPU. If we look at the definition of this last function:

#define tcg_qemu_tb_exec(env, tb_ptr) \
    ((uintptr_t (*)(void *, void *))tcg_ctx->code_gen_prologue)(env, tb_ptr)

The translation buffer receiving generated opcodes is casted to a function pointer and called with arguments.

In the TCG dedicated blog post, we will see the TCG strategy in detail and present various helpers for system instructions, memory access and things which can’t be translated from an architecture to the other.

Back to events handling

When an hardware interrupt (IRQ) or exception is raised, QEMU helps the vCPU redirects execution to the appropriate handler. These mechanisms are very specific to the target architecture, consequently hardly translatable. The answer comes from helpers which are tiny wrappers written in C, built with QEMU for a target architecture and natively callable on the host architecture directly from the translated blocks. Again, we will cover them in details later.

For instance for the PPC target (VM), the helpers backend to inform QEMU that an exception is being raised is located into excp_helper.c:

void raise_exception(CPUPPCState *env, uint32_t exception)
{
    raise_exception_err_ra(env, exception, 0, 0);
}

void raise_exception_err_ra(CPUPPCState *env, uint32_t exception,
                            uint32_t error_code, uintptr_t raddr)
{
    CPUState *cs = env_cpu(env);

    cs->exception_index = exception;
    env->error_code = error_code;
    cpu_loop_exit_restore(cs, raddr);
}

Notice the call to cpu_loop_exit_restore to get back to the main cpu loop execution context and enter cpu_handle_exception:

static inline bool cpu_handle_exception(CPUState *cpu, int *ret)
{
    if (cpu->exception_index >= EXCP_INTERRUPT) {
        /* exit request from the cpu execution loop */
        *ret = cpu->exception_index;
        if (*ret == EXCP_DEBUG) {
            cpu_handle_debug_exception(cpu);
        }
        cpu->exception_index = -1;
        return true;
    } else {
...
        /* deal with exception/interrupt */
        CPUClass *cc = CPU_GET_CLASS(cpu);
        cc->do_interrupt(cpu);
...
    }
}

There is once again a specific handling on debug exceptions, but in essence if there is a pending exception in cpu->exception_index it will be managed by cc->do_interrupt which is architecture dependent.

The exception_index field can hold the real hardware exception but is also used for meta information (QEMU debug event, halt instruction, VMEXIT for nested virtualization on x86).

The CPUClass type has several function pointers to be initialized by specific target code. For an i386 target we may find it at x86_cpu_common_class_init:

static void x86_cpu_common_class_init(ObjectClass *oc, void *data)
{
    X86CPUClass *xcc = X86_CPU_CLASS(oc);
    CPUClass *cc = CPU_CLASS(oc);
    DeviceClass *dc = DEVICE_CLASS(oc);
...
#ifdef CONFIG_TCG
    cc->do_interrupt = x86_cpu_do_interrupt;
    cc->cpu_exec_interrupt = x86_cpu_exec_interrupt;
#endif
    cc->dump_state = x86_cpu_dump_state;
    cc->get_crash_info = x86_cpu_get_crash_info;
    cc->set_pc = x86_cpu_set_pc;
    cc->synchronize_from_tb = x86_cpu_synchronize_from_tb;
    cc->gdb_read_register = x86_cpu_gdb_read_register;
    cc->gdb_write_register = x86_cpu_gdb_write_register;
    cc->get_arch_id = x86_cpu_get_arch_id;
    cc->get_paging_enabled = x86_cpu_get_paging_enabled;
...
}

The underlying x86_cpu_do_interrupt is a place holder for various cases (userland, system emulation or nested virtualization). In basic system emulation mode it will call do_interrupt_all which implements low level x86 specific interrupt handling.