rate inside the limit. This leads to more steady reading performance during
live migration and can aid in improving large guest responsiveness.
-Postcopy
-========
-
-'Postcopy' migration is a way to deal with migrations that refuse to converge
-(or take too long to converge) its plus side is that there is an upper bound on
-the amount of migration traffic and time it takes, the down side is that during
-the postcopy phase, a failure of *either* side causes the guest to be lost.
-
-In postcopy the destination CPUs are started before all the memory has been
-transferred, and accesses to pages that are yet to be transferred cause
-a fault that's translated by QEMU into a request to the source QEMU.
-
-Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
-doesn't finish in a given time the switch is made to postcopy.
-
-Enabling postcopy
------------------
-
-To enable postcopy, issue this command on the monitor (both source and
-destination) prior to the start of migration:
-
-``migrate_set_capability postcopy-ram on``
-
-The normal commands are then used to start a migration, which is still
-started in precopy mode. Issuing:
-
-``migrate_start_postcopy``
-
-will now cause the transition from precopy to postcopy.
-It can be issued immediately after migration is started or any
-time later on. Issuing it after the end of a migration is harmless.
-
-Blocktime is a postcopy live migration metric, intended to show how
-long the vCPU was in state of interruptible sleep due to pagefault.
-That metric is calculated both for all vCPUs as overlapped value, and
-separately for each vCPU. These values are calculated on destination
-side. To enable postcopy blocktime calculation, enter following
-command on destination monitor:
-
-``migrate_set_capability postcopy-blocktime on``
-
-Postcopy blocktime can be retrieved by query-migrate qmp command.
-postcopy-blocktime value of qmp command will show overlapped blocking
-time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
-time per vCPU.
-
-.. note::
- During the postcopy phase, the bandwidth limits set using
- ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
- the destination is waiting for).
-
-Postcopy device transfer
-------------------------
-
-Loading of device data may cause the device emulation to access guest RAM
-that may trigger faults that have to be resolved by the source, as such
-the migration stream has to be able to respond with page data *during* the
-device load, and hence the device data has to be read from the stream completely
-before the device load begins to free the stream up. This is achieved by
-'packaging' the device data into a blob that's read in one go.
-
-Source behaviour
-----------------
-
-Until postcopy is entered the migration stream is identical to normal
-precopy, except for the addition of a 'postcopy advise' command at
-the beginning, to tell the destination that postcopy might happen.
-When postcopy starts the source sends the page discard data and then
-forms the 'package' containing:
-
- - Command: 'postcopy listen'
- - The device state
-
- A series of sections, identical to the precopy streams device state stream
- containing everything except postcopiable devices (i.e. RAM)
- - Command: 'postcopy run'
-
-The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
-contents are formatted in the same way as the main migration stream.
-
-During postcopy the source scans the list of dirty pages and sends them
-to the destination without being requested (in much the same way as precopy),
-however when a page request is received from the destination, the dirty page
-scanning restarts from the requested location. This causes requested pages
-to be sent quickly, and also causes pages directly after the requested page
-to be sent quickly in the hope that those pages are likely to be used
-by the destination soon.
-
-Destination behaviour
----------------------
-
-Initially the destination looks the same as precopy, with a single thread
-reading the migration stream; the 'postcopy advise' and 'discard' commands
-are processed to change the way RAM is managed, but don't affect the stream
-processing.
-
-::
-
- ------------------------------------------------------------------------------
- 1 2 3 4 5 6 7
- main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
- thread | |
- | (page request)
- | \___
- v \
- listen thread: --- page -- page -- page -- page -- page --
-
- a b c
- ------------------------------------------------------------------------------
-
-- On receipt of ``CMD_PACKAGED`` (1)
-
- All the data associated with the package - the ( ... ) section in the diagram -
- is read into memory, and the main thread recurses into qemu_loadvm_state_main
- to process the contents of the package (2) which contains commands (3,6) and
- devices (4...)
-
-- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
-
- a new thread (a) is started that takes over servicing the migration stream,
- while the main thread carries on loading the package. It loads normal
- background page data (b) but if during a device load a fault happens (5)
- the returned page (c) is loaded by the listen thread allowing the main
- threads device load to carry on.
-
-- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
-
- letting the destination CPUs start running. At the end of the
- ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
- is no longer used by migration, while the listen thread carries on servicing
- page data until the end of migration.
-
-Postcopy Recovery
------------------
-
-Comparing to precopy, postcopy is special on error handlings. When any
-error happens (in this case, mostly network errors), QEMU cannot easily
-fail a migration because VM data resides in both source and destination
-QEMU instances. On the other hand, when issue happens QEMU on both sides
-will go into a paused state. It'll need a recovery phase to continue a
-paused postcopy migration.
-
-The recovery phase normally contains a few steps:
-
- - When network issue occurs, both QEMU will go into PAUSED state
-
- - When the network is recovered (or a new network is provided), the admin
- can setup the new channel for migration using QMP command
- 'migrate-recover' on destination node, preparing for a resume.
-
- - On source host, the admin can continue the interrupted postcopy
- migration using QMP command 'migrate' with resume=true flag set.
-
- - After the connection is re-established, QEMU will continue the postcopy
- migration on both sides.
-
-During a paused postcopy migration, the VM can logically still continue
-running, and it will not be impacted from any page access to pages that
-were already migrated to destination VM before the interruption happens.
-However, if any of the missing pages got accessed on destination VM, the VM
-thread will be halted waiting for the page to be migrated, it means it can
-be halted until the recovery is complete.
-
-The impact of accessing missing pages can be relevant to different
-configurations of the guest. For example, when with async page fault
-enabled, logically the guest can proactively schedule out the threads
-accessing missing pages.
-
-Postcopy states
----------------
-
-Postcopy moves through a series of states (see postcopy_state) from
-ADVISE->DISCARD->LISTEN->RUNNING->END
-
- - Advise
-
- Set at the start of migration if postcopy is enabled, even
- if it hasn't had the start command; here the destination
- checks that its OS has the support needed for postcopy, and performs
- setup to ensure the RAM mappings are suitable for later postcopy.
- The destination will fail early in migration at this point if the
- required OS support is not present.
- (Triggered by reception of POSTCOPY_ADVISE command)
-
- - Discard
-
- Entered on receipt of the first 'discard' command; prior to
- the first Discard being performed, hugepages are switched off
- (using madvise) to ensure that no new huge pages are created
- during the postcopy phase, and to cause any huge pages that
- have discards on them to be broken.
-
- - Listen
-
- The first command in the package, POSTCOPY_LISTEN, switches
- the destination state to Listen, and starts a new thread
- (the 'listen thread') which takes over the job of receiving
- pages off the migration stream, while the main thread carries
- on processing the blob. With this thread able to process page
- reception, the destination now 'sensitises' the RAM to detect
- any access to missing pages (on Linux using the 'userfault'
- system).
-
- - Running
-
- POSTCOPY_RUN causes the destination to synchronise all
- state and start the CPUs and IO devices running. The main
- thread now finishes processing the migration package and
- now carries on as it would for normal precopy migration
- (although it can't do the cleanup it would do as it
- finishes a normal migration).
-
- - Paused
-
- Postcopy can run into a paused state (normally on both sides when
- happens), where all threads will be temporarily halted mostly due to
- network errors. When reaching paused state, migration will make sure
- the qemu binary on both sides maintain the data without corrupting
- the VM. To continue the migration, the admin needs to fix the
- migration channel using the QMP command 'migrate-recover' on the
- destination node, then resume the migration using QMP command 'migrate'
- again on source node, with resume=true flag set.
-
- - End
-
- The listen thread can now quit, and perform the cleanup of migration
- state, the migration is now complete.
-
-Source side page map
---------------------
-
-The 'migration bitmap' in postcopy is basically the same as in the precopy,
-where each of the bit to indicate that page is 'dirty' - i.e. needs
-sending. During the precopy phase this is updated as the CPU dirties
-pages, however during postcopy the CPUs are stopped and nothing should
-dirty anything any more. Instead, dirty bits are cleared when the relevant
-pages are sent during postcopy.
-
-Postcopy with hugepages
------------------------
-
-Postcopy now works with hugetlbfs backed memory:
-
- a) The linux kernel on the destination must support userfault on hugepages.
- b) The huge-page configuration on the source and destination VMs must be
- identical; i.e. RAMBlocks on both sides must use the same page size.
- c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
- RAM if it doesn't have enough hugepages, triggering (b) to fail.
- Using ``-mem-prealloc`` enforces the allocation using hugepages.
- d) Care should be taken with the size of hugepage used; postcopy with 2MB
- hugepages works well, however 1GB hugepages are likely to be problematic
- since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
- and until the full page is transferred the destination thread is blocked.
-
-Postcopy with shared memory
----------------------------
-
-Postcopy migration with shared memory needs explicit support from the other
-processes that share memory and from QEMU. There are restrictions on the type of
-memory that userfault can support shared.
-
-The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
-(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
-for hugetlbfs which may be a problem in some configurations).
-
-The vhost-user code in QEMU supports clients that have Postcopy support,
-and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
-to support postcopy.
-
-The client needs to open a userfaultfd and register the areas
-of memory that it maps with userfault. The client must then pass the
-userfaultfd back to QEMU together with a mapping table that allows
-fault addresses in the clients address space to be converted back to
-RAMBlock/offsets. The client's userfaultfd is added to the postcopy
-fault-thread and page requests are made on behalf of the client by QEMU.
-QEMU performs 'wake' operations on the client's userfaultfd to allow it
-to continue after a page has arrived.
-
-.. note::
- There are two future improvements that would be nice:
- a) Some way to make QEMU ignorant of the addresses in the clients
- address space
- b) Avoiding the need for QEMU to perform ufd-wake calls after the
- pages have arrived
-
-Retro-fitting postcopy to existing clients is possible:
- a) A mechanism is needed for the registration with userfault as above,
- and the registration needs to be coordinated with the phases of
- postcopy. In vhost-user extra messages are added to the existing
- control channel.
- b) Any thread that can block due to guest memory accesses must be
- identified and the implication understood; for example if the
- guest memory access is made while holding a lock then all other
- threads waiting for that lock will also be blocked.
-
-Postcopy Preemption Mode
-------------------------
-
-Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
-allows urgent pages (those got page fault requested from destination QEMU
-explicitly) to be sent in a separate preempt channel, rather than queued in
-the background migration channel. Anyone who cares about latencies of page
-faults during a postcopy migration should enable this feature. By default,
-it's not enabled.
-
--- /dev/null
+Postcopy
+========
+
+'Postcopy' migration is a way to deal with migrations that refuse to converge
+(or take too long to converge) its plus side is that there is an upper bound on
+the amount of migration traffic and time it takes, the down side is that during
+the postcopy phase, a failure of *either* side causes the guest to be lost.
+
+In postcopy the destination CPUs are started before all the memory has been
+transferred, and accesses to pages that are yet to be transferred cause
+a fault that's translated by QEMU into a request to the source QEMU.
+
+Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
+doesn't finish in a given time the switch is made to postcopy.
+
+Enabling postcopy
+-----------------
+
+To enable postcopy, issue this command on the monitor (both source and
+destination) prior to the start of migration:
+
+``migrate_set_capability postcopy-ram on``
+
+The normal commands are then used to start a migration, which is still
+started in precopy mode. Issuing:
+
+``migrate_start_postcopy``
+
+will now cause the transition from precopy to postcopy.
+It can be issued immediately after migration is started or any
+time later on. Issuing it after the end of a migration is harmless.
+
+Blocktime is a postcopy live migration metric, intended to show how
+long the vCPU was in state of interruptible sleep due to pagefault.
+That metric is calculated both for all vCPUs as overlapped value, and
+separately for each vCPU. These values are calculated on destination
+side. To enable postcopy blocktime calculation, enter following
+command on destination monitor:
+
+``migrate_set_capability postcopy-blocktime on``
+
+Postcopy blocktime can be retrieved by query-migrate qmp command.
+postcopy-blocktime value of qmp command will show overlapped blocking
+time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
+time per vCPU.
+
+.. note::
+ During the postcopy phase, the bandwidth limits set using
+ ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
+ the destination is waiting for).
+
+Postcopy device transfer
+------------------------
+
+Loading of device data may cause the device emulation to access guest RAM
+that may trigger faults that have to be resolved by the source, as such
+the migration stream has to be able to respond with page data *during* the
+device load, and hence the device data has to be read from the stream completely
+before the device load begins to free the stream up. This is achieved by
+'packaging' the device data into a blob that's read in one go.
+
+Source behaviour
+----------------
+
+Until postcopy is entered the migration stream is identical to normal
+precopy, except for the addition of a 'postcopy advise' command at
+the beginning, to tell the destination that postcopy might happen.
+When postcopy starts the source sends the page discard data and then
+forms the 'package' containing:
+
+ - Command: 'postcopy listen'
+ - The device state
+
+ A series of sections, identical to the precopy streams device state stream
+ containing everything except postcopiable devices (i.e. RAM)
+ - Command: 'postcopy run'
+
+The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
+contents are formatted in the same way as the main migration stream.
+
+During postcopy the source scans the list of dirty pages and sends them
+to the destination without being requested (in much the same way as precopy),
+however when a page request is received from the destination, the dirty page
+scanning restarts from the requested location. This causes requested pages
+to be sent quickly, and also causes pages directly after the requested page
+to be sent quickly in the hope that those pages are likely to be used
+by the destination soon.
+
+Destination behaviour
+---------------------
+
+Initially the destination looks the same as precopy, with a single thread
+reading the migration stream; the 'postcopy advise' and 'discard' commands
+are processed to change the way RAM is managed, but don't affect the stream
+processing.
+
+::
+
+ ------------------------------------------------------------------------------
+ 1 2 3 4 5 6 7
+ main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
+ thread | |
+ | (page request)
+ | \___
+ v \
+ listen thread: --- page -- page -- page -- page -- page --
+
+ a b c
+ ------------------------------------------------------------------------------
+
+- On receipt of ``CMD_PACKAGED`` (1)
+
+ All the data associated with the package - the ( ... ) section in the diagram -
+ is read into memory, and the main thread recurses into qemu_loadvm_state_main
+ to process the contents of the package (2) which contains commands (3,6) and
+ devices (4...)
+
+- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
+
+ a new thread (a) is started that takes over servicing the migration stream,
+ while the main thread carries on loading the package. It loads normal
+ background page data (b) but if during a device load a fault happens (5)
+ the returned page (c) is loaded by the listen thread allowing the main
+ threads device load to carry on.
+
+- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
+
+ letting the destination CPUs start running. At the end of the
+ ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
+ is no longer used by migration, while the listen thread carries on servicing
+ page data until the end of migration.
+
+Postcopy Recovery
+-----------------
+
+Comparing to precopy, postcopy is special on error handlings. When any
+error happens (in this case, mostly network errors), QEMU cannot easily
+fail a migration because VM data resides in both source and destination
+QEMU instances. On the other hand, when issue happens QEMU on both sides
+will go into a paused state. It'll need a recovery phase to continue a
+paused postcopy migration.
+
+The recovery phase normally contains a few steps:
+
+ - When network issue occurs, both QEMU will go into PAUSED state
+
+ - When the network is recovered (or a new network is provided), the admin
+ can setup the new channel for migration using QMP command
+ 'migrate-recover' on destination node, preparing for a resume.
+
+ - On source host, the admin can continue the interrupted postcopy
+ migration using QMP command 'migrate' with resume=true flag set.
+
+ - After the connection is re-established, QEMU will continue the postcopy
+ migration on both sides.
+
+During a paused postcopy migration, the VM can logically still continue
+running, and it will not be impacted from any page access to pages that
+were already migrated to destination VM before the interruption happens.
+However, if any of the missing pages got accessed on destination VM, the VM
+thread will be halted waiting for the page to be migrated, it means it can
+be halted until the recovery is complete.
+
+The impact of accessing missing pages can be relevant to different
+configurations of the guest. For example, when with async page fault
+enabled, logically the guest can proactively schedule out the threads
+accessing missing pages.
+
+Postcopy states
+---------------
+
+Postcopy moves through a series of states (see postcopy_state) from
+ADVISE->DISCARD->LISTEN->RUNNING->END
+
+ - Advise
+
+ Set at the start of migration if postcopy is enabled, even
+ if it hasn't had the start command; here the destination
+ checks that its OS has the support needed for postcopy, and performs
+ setup to ensure the RAM mappings are suitable for later postcopy.
+ The destination will fail early in migration at this point if the
+ required OS support is not present.
+ (Triggered by reception of POSTCOPY_ADVISE command)
+
+ - Discard
+
+ Entered on receipt of the first 'discard' command; prior to
+ the first Discard being performed, hugepages are switched off
+ (using madvise) to ensure that no new huge pages are created
+ during the postcopy phase, and to cause any huge pages that
+ have discards on them to be broken.
+
+ - Listen
+
+ The first command in the package, POSTCOPY_LISTEN, switches
+ the destination state to Listen, and starts a new thread
+ (the 'listen thread') which takes over the job of receiving
+ pages off the migration stream, while the main thread carries
+ on processing the blob. With this thread able to process page
+ reception, the destination now 'sensitises' the RAM to detect
+ any access to missing pages (on Linux using the 'userfault'
+ system).
+
+ - Running
+
+ POSTCOPY_RUN causes the destination to synchronise all
+ state and start the CPUs and IO devices running. The main
+ thread now finishes processing the migration package and
+ now carries on as it would for normal precopy migration
+ (although it can't do the cleanup it would do as it
+ finishes a normal migration).
+
+ - Paused
+
+ Postcopy can run into a paused state (normally on both sides when
+ happens), where all threads will be temporarily halted mostly due to
+ network errors. When reaching paused state, migration will make sure
+ the qemu binary on both sides maintain the data without corrupting
+ the VM. To continue the migration, the admin needs to fix the
+ migration channel using the QMP command 'migrate-recover' on the
+ destination node, then resume the migration using QMP command 'migrate'
+ again on source node, with resume=true flag set.
+
+ - End
+
+ The listen thread can now quit, and perform the cleanup of migration
+ state, the migration is now complete.
+
+Source side page map
+--------------------
+
+The 'migration bitmap' in postcopy is basically the same as in the precopy,
+where each of the bit to indicate that page is 'dirty' - i.e. needs
+sending. During the precopy phase this is updated as the CPU dirties
+pages, however during postcopy the CPUs are stopped and nothing should
+dirty anything any more. Instead, dirty bits are cleared when the relevant
+pages are sent during postcopy.
+
+Postcopy with hugepages
+-----------------------
+
+Postcopy now works with hugetlbfs backed memory:
+
+ a) The linux kernel on the destination must support userfault on hugepages.
+ b) The huge-page configuration on the source and destination VMs must be
+ identical; i.e. RAMBlocks on both sides must use the same page size.
+ c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
+ RAM if it doesn't have enough hugepages, triggering (b) to fail.
+ Using ``-mem-prealloc`` enforces the allocation using hugepages.
+ d) Care should be taken with the size of hugepage used; postcopy with 2MB
+ hugepages works well, however 1GB hugepages are likely to be problematic
+ since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
+ and until the full page is transferred the destination thread is blocked.
+
+Postcopy with shared memory
+---------------------------
+
+Postcopy migration with shared memory needs explicit support from the other
+processes that share memory and from QEMU. There are restrictions on the type of
+memory that userfault can support shared.
+
+The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
+(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
+for hugetlbfs which may be a problem in some configurations).
+
+The vhost-user code in QEMU supports clients that have Postcopy support,
+and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
+to support postcopy.
+
+The client needs to open a userfaultfd and register the areas
+of memory that it maps with userfault. The client must then pass the
+userfaultfd back to QEMU together with a mapping table that allows
+fault addresses in the clients address space to be converted back to
+RAMBlock/offsets. The client's userfaultfd is added to the postcopy
+fault-thread and page requests are made on behalf of the client by QEMU.
+QEMU performs 'wake' operations on the client's userfaultfd to allow it
+to continue after a page has arrived.
+
+.. note::
+ There are two future improvements that would be nice:
+ a) Some way to make QEMU ignorant of the addresses in the clients
+ address space
+ b) Avoiding the need for QEMU to perform ufd-wake calls after the
+ pages have arrived
+
+Retro-fitting postcopy to existing clients is possible:
+ a) A mechanism is needed for the registration with userfault as above,
+ and the registration needs to be coordinated with the phases of
+ postcopy. In vhost-user extra messages are added to the existing
+ control channel.
+ b) Any thread that can block due to guest memory accesses must be
+ identified and the implication understood; for example if the
+ guest memory access is made while holding a lock then all other
+ threads waiting for that lock will also be blocked.
+
+Postcopy Preemption Mode
+------------------------
+
+Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
+allows urgent pages (those got page fault requested from destination QEMU
+explicitly) to be sent in a separate preempt channel, rather than queued in
+the background migration channel. Anyone who cares about latencies of page
+faults during a postcopy migration should enable this feature. By default,
+it's not enabled.