From: Alex Bennée <alex.bennee@linaro.org>
Date: Thu, 9 Jul 2020 14:13:15 +0000 (+0100)
Subject: docs/devel: convert and update MTTCG design document
X-Git-Url: http://git.maquefel.me/?a=commitdiff_plain;h=c8c06e520d389dcde5963cc5a73d5ecbaf6b8e55;p=qemu.git

docs/devel: convert and update MTTCG design document

Do a light conversion to .rst and clean-up some of the language at the
start now MTTCG has been merged for a while.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20200709141327.14631-2-alex.bennee@linaro.org>
---

diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index bb8238c5d6..4ecaea3643 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -23,6 +23,7 @@ Contents:
    decodetree
    secure-coding-practices
    tcg
+   multi-thread-tcg
    tcg-plugins
    bitops
    reset
diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst
new file mode 100644
index 0000000000..42158b77c7
--- /dev/null
+++ b/docs/devel/multi-thread-tcg.rst
@@ -0,0 +1,372 @@
+..
+  Copyright (c) 2015-2020 Linaro Ltd.
+
+  This work is licensed under the terms of the GNU GPL, version 2 or
+  later. See the COPYING file in the top-level directory.
+
+Introduction
+============
+
+This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
+system-mode emulation. user-mode emulation has always mirrored the
+thread structure of the translated executable although some of the
+changes done for MTTCG system emulation have improved the stability of
+linux-user emulation.
+
+The original system-mode TCG implementation was single threaded and
+dealt with multiple CPUs with simple round-robin scheduling. This
+simplified a lot of things but became increasingly limited as systems
+being emulated gained additional cores and per-core performance gains
+for host systems started to level off.
+
+vCPU Scheduling
+===============
+
+We introduce a new running mode where each vCPU will run on its own
+user-space thread. This is enabled by default for all FE/BE
+combinations where the host memory model is able to accommodate the
+guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
+guest has had the required work done to support this safely
+(TARGET_SUPPORTS_MTTCG).
+
+System emulation will fall back to the original round robin approach
+if:
+
+* forced by --accel tcg,thread=single
+* enabling --icount mode
+* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
+
+In the general case of running translated code there should be no
+inter-vCPU dependencies and all vCPUs should be able to run at full
+speed. Synchronisation will only be required while accessing internal
+shared data structures or when the emulated architecture requires a
+coherent representation of the emulated machine state.
+
+Shared Data Structures
+======================
+
+Main Run Loop
+-------------
+
+Even when there is no code being generated there are a number of
+structures associated with the hot-path through the main run-loop.
+These are associated with looking up the next translation block to
+execute. These include:
+
+    tb_jmp_cache (per-vCPU, cache of recent jumps)
+    tb_ctx.htable (global hash table, phys address->tb lookup)
+
+As TB linking only occurs when blocks are in the same page this code
+is critical to performance as looking up the next TB to execute is the
+most common reason to exit the generated code.
+
+DESIGN REQUIREMENT: Make access to lookup structures safe with
+multiple reader/writer threads. Minimise any lock contention to do it.
+
+The hot-path avoids using locks where possible. The tb_jmp_cache is
+updated with atomic accesses to ensure consistent results. The fall
+back QHT based hash table is also designed for lockless lookups. Locks
+are only taken when code generation is required or TranslationBlocks
+have their block-to-block jumps patched.
+
+Global TCG State
+----------------
+
+User-mode emulation
+~~~~~~~~~~~~~~~~~~~
+
+We need to protect the entire code generation cycle including any post
+generation patching of the translated code. This also implies a shared
+translation buffer which contains code running on all cores. Any
+execution path that comes to the main run loop will need to hold a
+mutex for code generation. This also includes times when we need flush
+code or entries from any shared lookups/caches. Structures held on a
+per-vCPU basis won't need locking unless other vCPUs will need to
+modify them.
+
+DESIGN REQUIREMENT: Add locking around all code generation and TB
+patching.
+
+(Current solution)
+
+Code generation is serialised with mmap_lock().
+
+!User-mode emulation
+~~~~~~~~~~~~~~~~~~~~
+
+Each vCPU has its own TCG context and associated TCG region, thereby
+requiring no locking during translation.
+
+Translation Blocks
+------------------
+
+Currently the whole system shares a single code generation buffer
+which when full will force a flush of all translations and start from
+scratch again. Some operations also force a full flush of translations
+including:
+
+  - debugging operations (breakpoint insertion/removal)
+  - some CPU helper functions
+  - linux-user spawning it's first thread
+
+This is done with the async_safe_run_on_cpu() mechanism to ensure all
+vCPUs are quiescent when changes are being made to shared global
+structures.
+
+More granular translation invalidation events are typically due
+to a change of the state of a physical page:
+
+  - code modification (self modify code, patching code)
+  - page changes (new page mapping in linux-user mode)
+
+While setting the invalid flag in a TranslationBlock will stop it
+being used when looked up in the hot-path there are a number of other
+book-keeping structures that need to be safely cleared.
+
+Any TranslationBlocks which have been patched to jump directly to the
+now invalid blocks need the jump patches reversing so they will return
+to the C code.
+
+There are a number of look-up caches that need to be properly updated
+including the:
+
+  - jump lookup cache
+  - the physical-to-tb lookup hash table
+  - the global page table
+
+The global page table (l1_map) which provides a multi-level look-up
+for PageDesc structures which contain pointers to the start of a
+linked list of all Translation Blocks in that page (see page_next).
+
+Both the jump patching and the page cache involve linked lists that
+the invalidated TranslationBlock needs to be removed from.
+
+DESIGN REQUIREMENT: Safely handle invalidation of TBs
+                      - safely patch/revert direct jumps
+                      - remove central PageDesc lookup entries
+                      - ensure lookup caches/hashes are safely updated
+
+(Current solution)
+
+The direct jump themselves are updated atomically by the TCG
+tb_set_jmp_target() code. Modification to the linked lists that allow
+searching for linked pages are done under the protection of tb->jmp_lock,
+where tb is the destination block of a jump. Each origin block keeps a
+pointer to its destinations so that the appropriate lock can be acquired before
+iterating over a jump list.
+
+The global page table is a lockless radix tree; cmpxchg is used
+to atomically insert new elements.
+
+The lookup caches are updated atomically and the lookup hash uses QHT
+which is designed for concurrent safe lookup.
+
+Parallel code generation is supported. QHT is used at insertion time
+as the synchronization point across threads, thereby ensuring that we only
+keep track of a single TranslationBlock for each guest code block.
+
+Memory maps and TLBs
+--------------------
+
+The memory handling code is fairly critical to the speed of memory
+access in the emulated system. The SoftMMU code is designed so the
+hot-path can be handled entirely within translated code. This is
+handled with a per-vCPU TLB structure which once populated will allow
+a series of accesses to the page to occur without exiting the
+translated code. It is possible to set flags in the TLB address which
+will ensure the slow-path is taken for each access. This can be done
+to support:
+
+  - Memory regions (dividing up access to PIO, MMIO and RAM)
+  - Dirty page tracking (for code gen, SMC detection, migration and display)
+  - Virtual TLB (for translating guest address->real address)
+
+When the TLB tables are updated by a vCPU thread other than their own
+we need to ensure it is done in a safe way so no inconsistent state is
+seen by the vCPU thread.
+
+Some operations require updating a number of vCPUs TLBs at the same
+time in a synchronised manner.
+
+DESIGN REQUIREMENTS:
+
+  - TLB Flush All/Page
+    - can be across-vCPUs
+    - cross vCPU TLB flush may need other vCPU brought to halt
+    - change may need to be visible to the calling vCPU immediately
+  - TLB Flag Update
+    - usually cross-vCPU
+    - want change to be visible as soon as possible
+  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
+    - This is a per-vCPU table - by definition can't race
+    - updated by its own thread when the slow-path is forced
+
+(Current solution)
+
+We have updated cputlb.c to defer operations when a cross-vCPU
+operation with async_run_on_cpu() which ensures each vCPU sees a
+coherent state when it next runs its work (in a few instructions
+time).
+
+A new set up operations (tlb_flush_*_all_cpus) take an additional flag
+which when set will force synchronisation by setting the source vCPUs
+work as "safe work" and exiting the cpu run loop. This ensure by the
+time execution restarts all flush operations have completed.
+
+TLB flag updates are all done atomically and are also protected by the
+corresponding page lock.
+
+(Known limitation)
+
+Not really a limitation but the wait mechanism is overly strict for
+some architectures which only need flushes completed by a barrier
+instruction. This could be a future optimisation.
+
+Emulated hardware state
+-----------------------
+
+Currently thanks to KVM work any access to IO memory is automatically
+protected by the global iothread mutex, also known as the BQL (Big
+Qemu Lock). Any IO region that doesn't use global mutex is expected to
+do its own locking.
+
+However IO memory isn't the only way emulated hardware state can be
+modified. Some architectures have model specific registers that
+trigger hardware emulation features. Generally any translation helper
+that needs to update more than a single vCPUs of state should take the
+BQL.
+
+As the BQL, or global iothread mutex is shared across the system we
+push the use of the lock as far down into the TCG code as possible to
+minimise contention.
+
+(Current solution)
+
+MMIO access automatically serialises hardware emulation by way of the
+BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
+and also defer the reset/startup of vCPUs to the vCPU context by way
+of async_run_on_cpu().
+
+Updates to interrupt state are also protected by the BQL as they can
+often be cross vCPU.
+
+Memory Consistency
+==================
+
+Between emulated guests and host systems there are a range of memory
+consistency models. Even emulating weakly ordered systems on strongly
+ordered hosts needs to ensure things like store-after-load re-ordering
+can be prevented when the guest wants to.
+
+Memory Barriers
+---------------
+
+Barriers (sometimes known as fences) provide a mechanism for software
+to enforce a particular ordering of memory operations from the point
+of view of external observers (e.g. another processor core). They can
+apply to any memory operations as well as just loads or stores.
+
+The Linux kernel has an excellent `write-up
+<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`
+on the various forms of memory barrier and the guarantees they can
+provide.
+
+Barriers are often wrapped around synchronisation primitives to
+provide explicit memory ordering semantics. However they can be used
+by themselves to provide safe lockless access by ensuring for example
+a change to a signal flag will only be visible once the changes to
+payload are.
+
+DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
+
+This would enforce a strong load/store ordering so all loads/stores
+complete at the memory barrier. On single-core non-SMP strongly
+ordered backends this could become a NOP.
+
+Aside from explicit standalone memory barrier instructions there are
+also implicit memory ordering semantics which comes with each guest
+memory access instruction. For example all x86 load/stores come with
+fairly strong guarantees of sequential consistency whereas Arm has
+special variants of load/store instructions that imply acquire/release
+semantics.
+
+In the case of a strongly ordered guest architecture being emulated on
+a weakly ordered host the scope for a heavy performance impact is
+quite high.
+
+DESIGN REQUIREMENTS: Be efficient with use of memory barriers
+       - host systems with stronger implied guarantees can skip some barriers
+       - merge consecutive barriers to the strongest one
+
+(Current solution)
+
+The system currently has a tcg_gen_mb() which will add memory barrier
+operations if code generation is being done in a parallel context. The
+tcg_optimize() function attempts to merge barriers up to their
+strongest form before any load/store operations. The solution was
+originally developed and tested for linux-user based systems. All
+backends have been converted to emit fences when required. So far the
+following front-ends have been updated to emit fences when required:
+
+    - target-i386
+    - target-arm
+    - target-aarch64
+    - target-alpha
+    - target-mips
+
+Memory Control and Maintenance
+------------------------------
+
+This includes a class of instructions for controlling system cache
+behaviour. While QEMU doesn't model cache behaviour these instructions
+are often seen when code modification has taken place to ensure the
+changes take effect.
+
+Synchronisation Primitives
+--------------------------
+
+There are two broad types of synchronisation primitives found in
+modern ISAs: atomic instructions and exclusive regions.
+
+The first type offer a simple atomic instruction which will guarantee
+some sort of test and conditional store will be truly atomic w.r.t.
+other cores sharing access to the memory. The classic example is the
+x86 cmpxchg instruction.
+
+The second type offer a pair of load/store instructions which offer a
+guarantee that a region of memory has not been touched between the
+load and store instructions. An example of this is Arm's ldrex/strex
+pair where the strex instruction will return a flag indicating a
+successful store only if no other CPU has accessed the memory region
+since the ldrex.
+
+Traditionally TCG has generated a series of operations that work
+because they are within the context of a single translation block so
+will have completed before another CPU is scheduled. However with
+the ability to have multiple threads running to emulate multiple CPUs
+we will need to explicitly expose these semantics.
+
+DESIGN REQUIREMENTS:
+  - Support classic atomic instructions
+  - Support load/store exclusive (or load link/store conditional) pairs
+  - Generic enough infrastructure to support all guest architectures
+CURRENT OPEN QUESTIONS:
+  - How problematic is the ABA problem in general?
+
+(Current solution)
+
+The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
+can be used directly or combined to emulate other instructions like
+Arm's ldrex/strex instructions. While they are susceptible to the ABA
+problem so far common guests have not implemented patterns where
+this may be a problem - typically presenting a locking ABI which
+assumes cmpxchg like semantics.
+
+The code also includes a fall-back for cases where multi-threaded TCG
+ops can't work (e.g. guest atomic width > host atomic width). In this
+case an EXCP_ATOMIC exit occurs and the instruction is emulated with
+an exclusive lock which ensures all emulation is serialised.
+
+While the atomic helpers look good enough for now there may be a need
+to look at solutions that can more closely model the guest
+architectures semantics.
diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
deleted file mode 100644
index 3c85ac0eab..0000000000
--- a/docs/devel/multi-thread-tcg.txt
+++ /dev/null
@@ -1,358 +0,0 @@
-Copyright (c) 2015-2016 Linaro Ltd.
-
-This work is licensed under the terms of the GNU GPL, version 2 or
-later. See the COPYING file in the top-level directory.
-
-Introduction
-============
-
-This document outlines the design for multi-threaded TCG system-mode
-emulation. The current user-mode emulation mirrors the thread
-structure of the translated executable. Some of the work will be
-applicable to both system and linux-user emulation.
-
-The original system-mode TCG implementation was single threaded and
-dealt with multiple CPUs with simple round-robin scheduling. This
-simplified a lot of things but became increasingly limited as systems
-being emulated gained additional cores and per-core performance gains
-for host systems started to level off.
-
-vCPU Scheduling
-===============
-
-We introduce a new running mode where each vCPU will run on its own
-user-space thread. This will be enabled by default for all FE/BE
-combinations that have had the required work done to support this
-safely.
-
-In the general case of running translated code there should be no
-inter-vCPU dependencies and all vCPUs should be able to run at full
-speed. Synchronisation will only be required while accessing internal
-shared data structures or when the emulated architecture requires a
-coherent representation of the emulated machine state.
-
-Shared Data Structures
-======================
-
-Main Run Loop
--------------
-
-Even when there is no code being generated there are a number of
-structures associated with the hot-path through the main run-loop.
-These are associated with looking up the next translation block to
-execute. These include:
-
-    tb_jmp_cache (per-vCPU, cache of recent jumps)
-    tb_ctx.htable (global hash table, phys address->tb lookup)
-
-As TB linking only occurs when blocks are in the same page this code
-is critical to performance as looking up the next TB to execute is the
-most common reason to exit the generated code.
-
-DESIGN REQUIREMENT: Make access to lookup structures safe with
-multiple reader/writer threads. Minimise any lock contention to do it.
-
-The hot-path avoids using locks where possible. The tb_jmp_cache is
-updated with atomic accesses to ensure consistent results. The fall
-back QHT based hash table is also designed for lockless lookups. Locks
-are only taken when code generation is required or TranslationBlocks
-have their block-to-block jumps patched.
-
-Global TCG State
-----------------
-
-### User-mode emulation
-We need to protect the entire code generation cycle including any post
-generation patching of the translated code. This also implies a shared
-translation buffer which contains code running on all cores. Any
-execution path that comes to the main run loop will need to hold a
-mutex for code generation. This also includes times when we need flush
-code or entries from any shared lookups/caches. Structures held on a
-per-vCPU basis won't need locking unless other vCPUs will need to
-modify them.
-
-DESIGN REQUIREMENT: Add locking around all code generation and TB
-patching.
-
-(Current solution)
-
-Code generation is serialised with mmap_lock().
-
-### !User-mode emulation
-Each vCPU has its own TCG context and associated TCG region, thereby
-requiring no locking.
-
-Translation Blocks
-------------------
-
-Currently the whole system shares a single code generation buffer
-which when full will force a flush of all translations and start from
-scratch again. Some operations also force a full flush of translations
-including:
-
-  - debugging operations (breakpoint insertion/removal)
-  - some CPU helper functions
-
-This is done with the async_safe_run_on_cpu() mechanism to ensure all
-vCPUs are quiescent when changes are being made to shared global
-structures.
-
-More granular translation invalidation events are typically due
-to a change of the state of a physical page:
-
-  - code modification (self modify code, patching code)
-  - page changes (new page mapping in linux-user mode)
-
-While setting the invalid flag in a TranslationBlock will stop it
-being used when looked up in the hot-path there are a number of other
-book-keeping structures that need to be safely cleared.
-
-Any TranslationBlocks which have been patched to jump directly to the
-now invalid blocks need the jump patches reversing so they will return
-to the C code.
-
-There are a number of look-up caches that need to be properly updated
-including the:
-
-  - jump lookup cache
-  - the physical-to-tb lookup hash table
-  - the global page table
-
-The global page table (l1_map) which provides a multi-level look-up
-for PageDesc structures which contain pointers to the start of a
-linked list of all Translation Blocks in that page (see page_next).
-
-Both the jump patching and the page cache involve linked lists that
-the invalidated TranslationBlock needs to be removed from.
-
-DESIGN REQUIREMENT: Safely handle invalidation of TBs
-                      - safely patch/revert direct jumps
-                      - remove central PageDesc lookup entries
-                      - ensure lookup caches/hashes are safely updated
-
-(Current solution)
-
-The direct jump themselves are updated atomically by the TCG
-tb_set_jmp_target() code. Modification to the linked lists that allow
-searching for linked pages are done under the protection of tb->jmp_lock,
-where tb is the destination block of a jump. Each origin block keeps a
-pointer to its destinations so that the appropriate lock can be acquired before
-iterating over a jump list.
-
-The global page table is a lockless radix tree; cmpxchg is used
-to atomically insert new elements.
-
-The lookup caches are updated atomically and the lookup hash uses QHT
-which is designed for concurrent safe lookup.
-
-Parallel code generation is supported. QHT is used at insertion time
-as the synchronization point across threads, thereby ensuring that we only
-keep track of a single TranslationBlock for each guest code block.
-
-Memory maps and TLBs
---------------------
-
-The memory handling code is fairly critical to the speed of memory
-access in the emulated system. The SoftMMU code is designed so the
-hot-path can be handled entirely within translated code. This is
-handled with a per-vCPU TLB structure which once populated will allow
-a series of accesses to the page to occur without exiting the
-translated code. It is possible to set flags in the TLB address which
-will ensure the slow-path is taken for each access. This can be done
-to support:
-
-  - Memory regions (dividing up access to PIO, MMIO and RAM)
-  - Dirty page tracking (for code gen, SMC detection, migration and display)
-  - Virtual TLB (for translating guest address->real address)
-
-When the TLB tables are updated by a vCPU thread other than their own
-we need to ensure it is done in a safe way so no inconsistent state is
-seen by the vCPU thread.
-
-Some operations require updating a number of vCPUs TLBs at the same
-time in a synchronised manner.
-
-DESIGN REQUIREMENTS:
-
-  - TLB Flush All/Page
-    - can be across-vCPUs
-    - cross vCPU TLB flush may need other vCPU brought to halt
-    - change may need to be visible to the calling vCPU immediately
-  - TLB Flag Update
-    - usually cross-vCPU
-    - want change to be visible as soon as possible
-  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
-    - This is a per-vCPU table - by definition can't race
-    - updated by its own thread when the slow-path is forced
-
-(Current solution)
-
-We have updated cputlb.c to defer operations when a cross-vCPU
-operation with async_run_on_cpu() which ensures each vCPU sees a
-coherent state when it next runs its work (in a few instructions
-time).
-
-A new set up operations (tlb_flush_*_all_cpus) take an additional flag
-which when set will force synchronisation by setting the source vCPUs
-work as "safe work" and exiting the cpu run loop. This ensure by the
-time execution restarts all flush operations have completed.
-
-TLB flag updates are all done atomically and are also protected by the
-corresponding page lock.
-
-(Known limitation)
-
-Not really a limitation but the wait mechanism is overly strict for
-some architectures which only need flushes completed by a barrier
-instruction. This could be a future optimisation.
-
-Emulated hardware state
------------------------
-
-Currently thanks to KVM work any access to IO memory is automatically
-protected by the global iothread mutex, also known as the BQL (Big
-Qemu Lock). Any IO region that doesn't use global mutex is expected to
-do its own locking.
-
-However IO memory isn't the only way emulated hardware state can be
-modified. Some architectures have model specific registers that
-trigger hardware emulation features. Generally any translation helper
-that needs to update more than a single vCPUs of state should take the
-BQL.
-
-As the BQL, or global iothread mutex is shared across the system we
-push the use of the lock as far down into the TCG code as possible to
-minimise contention.
-
-(Current solution)
-
-MMIO access automatically serialises hardware emulation by way of the
-BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
-and also defer the reset/startup of vCPUs to the vCPU context by way
-of async_run_on_cpu().
-
-Updates to interrupt state are also protected by the BQL as they can
-often be cross vCPU.
-
-Memory Consistency
-==================
-
-Between emulated guests and host systems there are a range of memory
-consistency models. Even emulating weakly ordered systems on strongly
-ordered hosts needs to ensure things like store-after-load re-ordering
-can be prevented when the guest wants to.
-
-Memory Barriers
----------------
-
-Barriers (sometimes known as fences) provide a mechanism for software
-to enforce a particular ordering of memory operations from the point
-of view of external observers (e.g. another processor core). They can
-apply to any memory operations as well as just loads or stores.
-
-The Linux kernel has an excellent write-up on the various forms of
-memory barrier and the guarantees they can provide [1].
-
-Barriers are often wrapped around synchronisation primitives to
-provide explicit memory ordering semantics. However they can be used
-by themselves to provide safe lockless access by ensuring for example
-a change to a signal flag will only be visible once the changes to
-payload are.
-
-DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
-
-This would enforce a strong load/store ordering so all loads/stores
-complete at the memory barrier. On single-core non-SMP strongly
-ordered backends this could become a NOP.
-
-Aside from explicit standalone memory barrier instructions there are
-also implicit memory ordering semantics which comes with each guest
-memory access instruction. For example all x86 load/stores come with
-fairly strong guarantees of sequential consistency whereas Arm has
-special variants of load/store instructions that imply acquire/release
-semantics.
-
-In the case of a strongly ordered guest architecture being emulated on
-a weakly ordered host the scope for a heavy performance impact is
-quite high.
-
-DESIGN REQUIREMENTS: Be efficient with use of memory barriers
-       - host systems with stronger implied guarantees can skip some barriers
-       - merge consecutive barriers to the strongest one
-
-(Current solution)
-
-The system currently has a tcg_gen_mb() which will add memory barrier
-operations if code generation is being done in a parallel context. The
-tcg_optimize() function attempts to merge barriers up to their
-strongest form before any load/store operations. The solution was
-originally developed and tested for linux-user based systems. All
-backends have been converted to emit fences when required. So far the
-following front-ends have been updated to emit fences when required:
-
-    - target-i386
-    - target-arm
-    - target-aarch64
-    - target-alpha
-    - target-mips
-
-Memory Control and Maintenance
-------------------------------
-
-This includes a class of instructions for controlling system cache
-behaviour. While QEMU doesn't model cache behaviour these instructions
-are often seen when code modification has taken place to ensure the
-changes take effect.
-
-Synchronisation Primitives
---------------------------
-
-There are two broad types of synchronisation primitives found in
-modern ISAs: atomic instructions and exclusive regions.
-
-The first type offer a simple atomic instruction which will guarantee
-some sort of test and conditional store will be truly atomic w.r.t.
-other cores sharing access to the memory. The classic example is the
-x86 cmpxchg instruction.
-
-The second type offer a pair of load/store instructions which offer a
-guarantee that a region of memory has not been touched between the
-load and store instructions. An example of this is Arm's ldrex/strex
-pair where the strex instruction will return a flag indicating a
-successful store only if no other CPU has accessed the memory region
-since the ldrex.
-
-Traditionally TCG has generated a series of operations that work
-because they are within the context of a single translation block so
-will have completed before another CPU is scheduled. However with
-the ability to have multiple threads running to emulate multiple CPUs
-we will need to explicitly expose these semantics.
-
-DESIGN REQUIREMENTS:
-  - Support classic atomic instructions
-  - Support load/store exclusive (or load link/store conditional) pairs
-  - Generic enough infrastructure to support all guest architectures
-CURRENT OPEN QUESTIONS:
-  - How problematic is the ABA problem in general?
-
-(Current solution)
-
-The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
-can be used directly or combined to emulate other instructions like
-Arm's ldrex/strex instructions. While they are susceptible to the ABA
-problem so far common guests have not implemented patterns where
-this may be a problem - typically presenting a locking ABI which
-assumes cmpxchg like semantics.
-
-The code also includes a fall-back for cases where multi-threaded TCG
-ops can't work (e.g. guest atomic width > host atomic width). In this
-case an EXCP_ATOMIC exit occurs and the instruction is emulated with
-an exclusive lock which ensures all emulation is serialised.
-
-While the atomic helpers look good enough for now there may be a need
-to look at solutions that can more closely model the guest
-architectures semantics.
-
-==========
-
-[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt