A series of posts about QEMU internals:
This blog post details how the different running states of the virtual machine (guest) are handled internally.
As you may imagine, a virtual machine and especially its virtual CPU goes through different running states during its lifetime:
/* from QAPI types */
typedef enum RunState {
RUN_STATE_DEBUG,
RUN_STATE_INMIGRATE,
RUN_STATE_INTERNAL_ERROR,
RUN_STATE_IO_ERROR,
RUN_STATE_PAUSED,
RUN_STATE_POSTMIGRATE,
RUN_STATE_PRELAUNCH,
RUN_STATE_FINISH_MIGRATE,
RUN_STATE_RESTORE_VM,
RUN_STATE_RUNNING,
RUN_STATE_SAVE_VM,
RUN_STATE_SHUTDOWN,
RUN_STATE_SUSPENDED,
RUN_STATE_WATCHDOG,
RUN_STATE_GUEST_PANICKED,
RUN_STATE_COLO,
RUN_STATE_PRECONFIG,
RUN_STATE__MAX,
} RunState;
These
RunState
are handled in the QEMU
main_loop
which executes in the QEMU startup thread, not the ones dedicated to
virtual
CPUs. The
function just wait for event requests to be processed in
main_loop_should_exit
. They
are generally raised from the virtual CPU threads:
static void main_loop(void)
{
...
while (!main_loop_should_exit()) {
main_loop_wait(false);
}
}
static bool main_loop_should_exit(void)
{
RunState r;
...
if (qemu_debug_requested()) {
vm_stop(RUN_STATE_DEBUG);
}
if (qemu_suspend_requested()) {
qemu_system_suspend();
}
...
if (qemu_powerdown_requested()) {
qemu_system_powerdown();
}
if (qemu_vmstop_requested(&r)) {
vm_stop(r);
}
...
}
If you remember the breakpoints handling post, while
handling the debug exception being raised, QEMU prepared a debug
request to the main loop from
cpu_handle_guest_debug
:
static void cpu_handle_guest_debug(CPUState *cpu)
{
gdb_set_stop_cpu(cpu);
qemu_system_debug_request();
cpu->stopped = true;
}
The
qemu_system_debug_request
actually triggers an event notification to the main loop:
void qemu_system_debug_request(void)
{
debug_requested = 1;
qemu_notify_event();
}
Back to the main loop, QEMU checks for a debug event with
qemu_debug_requested
and in that case changes the virtual machine running state to that of
the related event with
vm_stop
:
static bool main_loop_should_exit(void)
{
...
if (qemu_debug_requested()) {
vm_stop(RUN_STATE_DEBUG);
...
}
int vm_stop(RunState state)
{
if (qemu_in_vcpu_thread()) {
qemu_system_vmstop_request_prepare();
qemu_system_vmstop_request(state);
/*
* FIXME: should not return to device code in case
* vm_stop() has been requested.
*/
cpu_stop_current();
return 0;
}
return do_vm_stop(state, true);
}
The vm_stop
function checks where it is called from. If for instance a virtual CPU
calls it during the emulation of an instruction, a stop request is
raised instead of handling the run state transition directly. Because
state transitions only happen in the QEMU main loop thread.
Obviously, there exists the opposite service to start/resume a VM:
vm_start
.
The real state transition service is
do_vm_stop
and we can see it implements all the low level mechanics:
static int do_vm_stop(RunState state, bool send_stop)
{
int ret = 0;
if (runstate_is_running()) {
cpu_disable_ticks();
pause_all_vcpus();
runstate_set(state);
vm_state_notify(0, state);
if (send_stop) {
qapi_event_send_stop();
}
}
bdrv_drain_all();
ret = bdrv_flush_all();
return ret;
}
QEMU stops the vCPU, tick counting and associated virtual clocks. It
also calls a special service:
vm_state_notify
with the new running state of the VM as argument. Interestingly, we
are able to register callbacks to get notified of every VM running
state change thanks to
qemu_add_vm_change_state_handler
:
VMChangeStateEntry *qemu_add_vm_change_state_handler_prio(
VMChangeStateHandler *cb, void *opaque, int priority)
{
VMChangeStateEntry *e;
VMChangeStateEntry *other;
e = g_malloc0(sizeof(*e));
e->cb = cb;
e->opaque = opaque;
e->priority = priority;
/* Keep list sorted in ascending priority order */
QTAILQ_FOREACH(other, &vm_change_state_head, entries) {
if (priority < other->priority) {
QTAILQ_INSERT_BEFORE(other, e, entries);
return e;
}
}
QTAILQ_INSERT_TAIL(&vm_change_state_head, e, entries);
return e;
}
void vm_state_notify(int running, RunState state)
{
VMChangeStateEntry *e, *next;
trace_vm_state_notify(running, state, RunState_str(state));
if (running) {
QTAILQ_FOREACH_SAFE(e, &vm_change_state_head, entries, next) {
e->cb(e->opaque, running, state);
}
} else {
QTAILQ_FOREACH_REVERSE_SAFE(e, &vm_change_state_head, entries, next) {
e->cb(e->opaque, running, state);
}
}
}
This is extremely convenient. As an example, the GDB server stub does register a callback to intercept debug events and check for gdb client breakpoints:
int gdbserver_start(const char *device)
{
...
qemu_add_vm_change_state_handler(gdb_vm_state_change, NULL);
...
}
static void gdb_vm_state_change(void *opaque, int running, RunState state)
{
...
switch (state) {
case RUN_STATE_DEBUG:
...
ret = GDB_SIGNAL_TRAP;
break;
...
}
We have seen that we can’t change running state from every where. We should rather request for a change.
This is especially true for some events. Consider you implement a clock device which uses QEMU internal timers. The virtual clock timers expiration is processed in the QEMU main loop and the associated callbacks are called from that thread.
In that context, we should not try to stop the VM because we have
seen that the
do_vm_stop
service will try to disable all vCPU associated virtual clocks. And
disabling the clocks will wait for related timerlists to stop as
stated in the documentation
code. This
will lead to a dead-lock.
The correct way to proceed is to request a state transition from the timer callback itself:
void my_user_timeout_cb(void *opaque)
{
debug("--> vm timeout()\n");
qemu_system_vmstop_request_prepare();
qemu_system_vmstop_request(RUN_STATE_PAUSED);
}
And in your vm change state handler, aysnchronously deal with that
state thanks to
async_run_on_cpu
:
void my_vm_state_change(void *opaque, int running, RunState state)
{
debug("vm state %s\n", RunState_str(state));
if (state == RUN_STATE_PAUSED) {
async_run_on_cpu(cpu, my_async_timeout_vm, RUN_ON_CPU_HOST_PTR(arg));
return;
}
This way, the my_async_timeout_vm
function is added into the given
cpu
work queue as a new
qemu_work_item
and will be called out of the main loop context. It is safe to
consider your VM in the requested state (PAUSED) now and try to resume
it with
vm_start
for instance.