git.maquefel.me Git - linux.git/commit

bcachefs: Deadlock cycle detector

We've outgrown our own deadlock avoidance strategy.

The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.

Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.

That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
   deadlock would have actually occured. Lock ordering restarts have
   become our primary cause of transaction restarts, on some workloads
   totally 20% of actual transaction commits.

- To avoid deadlock or livelock, we'd often have to take intent locks
   when we only wanted a read lock: with the lock ordering approach, it
   is actually illegal to hold _any_ read lock while blocking on an intent
   lock, and this has been causing us unnecessary lock contention.

- It was getting fragile - the various lock ordering rules are not
   trivial, and we'd been seeing occasional livelock issues related to
   this machinery.

So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.

When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.

If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.

Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

author	Kent Overstreet <kent.overstreet@gmail.com>
	Mon, 22 Aug 2022 17:23:47 +0000 (13:23 -0400)
committer	Kent Overstreet <kent.overstreet@linux.dev>
	Sun, 22 Oct 2023 21:09:41 +0000 (17:09 -0400)
commit	33bd5d068603f9e81e0b73dbe50e9b88b2e56d0d
tree	6fff6e218b381e0fa2cd4580da3a2e919d18ccd8	tree \| snapshot
parent	62448afee714354a26db8a0f3c644f58628f0792	commit \| diff

fs/bcachefs/bcachefs_format.h		diff \| blob \| history
fs/bcachefs/btree_iter.c		diff \| blob \| history
fs/bcachefs/btree_iter.h		diff \| blob \| history
fs/bcachefs/btree_locking.c		diff \| blob \| history
fs/bcachefs/btree_locking.h		diff \| blob \| history
fs/bcachefs/btree_types.h		diff \| blob \| history
fs/bcachefs/debug.c		diff \| blob \| history
fs/bcachefs/errcode.h		diff \| blob \| history
fs/bcachefs/trace.h		diff \| blob \| history