Kent Overstreet [Mon, 18 Mar 2024 04:13:34 +0000 (00:13 -0400)]
bcachefs: Fix lost transaction restart error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 17 Mar 2024 01:22:47 +0000 (21:22 -0400)]
bcachefs: Don't corrupt journal keys gap buffer when dropping alloc info
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 16 Mar 2024 23:36:11 +0000 (19:36 -0400)]
bcachefs: fix for building in userspace
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 18 Mar 2024 00:39:42 +0000 (20:39 -0400)]
bcachefs: bch2_snapshot_is_ancestor() now safe to call in early recovery
this fixes an assertion pop in
bch2_check_snapshot_trees() ->
check_snapshot_tree() ->
bch2_snapshot_tree_master_subvol() ->
bch2_snapshot_is_ancestor()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 18 Mar 2024 00:32:36 +0000 (20:32 -0400)]
bcachefs: Fix nested transaction restart handling in bch2_bucket_gens_init()
Nested transaction restart handling is typically best avoided; when the
inner context handles a transaction restart it invalidates the outer
transaction context, so we need to make sure to return a
transaction_restart_nested error.
This code wasn't doing that, and hit the assertion in
for_each_btree_key() that checks for that via trans->restart_count.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 18 Mar 2024 00:27:00 +0000 (20:27 -0400)]
bcachefs: Improve sysfs internal/btree_updates
Print out the function that launched the btree update.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 18 Mar 2024 00:25:39 +0000 (20:25 -0400)]
bcachefs: Split out btree_node_rewrite_worker
This fixes a deadlock due to using btree_interior_update_worker for non
interior updates - async btree node rewrites were blocking, and then
blocking other interior updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 17 Mar 2024 02:45:46 +0000 (22:45 -0400)]
bcachefs: Fix locking in bch2_alloc_write_key()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 17 Mar 2024 01:22:24 +0000 (21:22 -0400)]
bcachefs: Avoid extent entry type assertions in .invalid()
After keys have passed bkey_ops.key_invalid we should never see invalid
extent entry types - but .key_invalid itself needs to cope with them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 14 Mar 2024 23:33:56 +0000 (19:33 -0400)]
bcachefs: Fix spurious -BCH_ERR_transaction_restart_nested
We only need to return transaction_restart_nested when we're inside a
context that's handling transaction restarts.
Also, add a missing check_subdir_count() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 16 Mar 2024 18:24:29 +0000 (14:24 -0400)]
bcachefs: Fix check_key_has_snapshot() call
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 14 Mar 2024 17:26:26 +0000 (13:26 -0400)]
bcachefs: Change "accounting overran journal reservation" to a warning
This doesn't need to be a BUG_ON(); the actual serious "things break"
condition is if the whole journal write overruns the available space,
and that has a fatal error, not a BUG_ON(). This check indicates we
screwed something up, but it should be a warning.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Mon, 5 Feb 2024 21:48:21 +0000 (13:48 -0800)]
bcachefs: time_stats: shrink time_stat_buffer for better alignment
Shrink this percpu object by one array element so that the object size
becomes exactly 512 bytes. This will lead to more efficient memory use,
hopefully.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Thu, 1 Feb 2024 20:41:42 +0000 (12:41 -0800)]
bcachefs: time_stats: split stats-with-quantiles into a separate structure
Currently, struct time_stats has the optional ability to quantize the
information that it collects. This is /probably/ useful for callers who
want to see quantized information, but it more than doubles the size of
the structure from 224 bytes to 464. For users who don't care about
that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per
counter, split the two into separate pieces.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Thu, 8 Feb 2024 23:33:35 +0000 (18:33 -0500)]
bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet
The only caller of this code (time_stats) always knows the weights and
whether or not any information has been collected. Pass this
information into the mean and variance code so that it doesn't have to
store that information. This reduces the structure size from 24 to 16
bytes, which shrinks each time_stats counter to 192 bytes from 208.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Mon, 5 Feb 2024 18:50:15 +0000 (10:50 -0800)]
bcachefs: time_stats: add larger units
Filesystems can stay mounted for a very long time, so add some larger
units.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 14 Mar 2024 00:16:40 +0000 (20:16 -0400)]
bcachefs: pull out time_stats.[ch]
prep work for lifting out of fs/bcachefs/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 12 Mar 2024 01:15:26 +0000 (21:15 -0400)]
bcachefs: reconstruct_alloc cleanup
Now that we've got the errors_silent mechanism, we don't have to check
if the reconstruct_alloc option is set all over the place.
Also - users no longer have to explicitly select fsck and fix_errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 11 Mar 2024 03:34:19 +0000 (23:34 -0400)]
bcachefs: fix bch_folio_sector padding
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 11 Mar 2024 00:53:17 +0000 (20:53 -0400)]
bcachefs: Fix btree key cache coherency during replay
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 11 Mar 2024 03:00:23 +0000 (23:00 -0400)]
bcachefs: Always flush write buffer in delete_dead_inodes()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 10 Mar 2024 20:29:06 +0000 (16:29 -0400)]
bcachefs: Fix order of gc_done passes
gc_stripes_done() and gc_reflink_done() may do alloc btree updates (i.e.
when deleting an indirect extent) - we need bucket gens to be fixed by
then.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 10 Mar 2024 20:24:16 +0000 (16:24 -0400)]
bcachefs: fix deletion of indirect extents in btree_gc
we need to run the normal extent update path on deletion -
bch2_bkey_make_mut() is incorrect when key type is changing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Erick Archer [Sun, 10 Mar 2024 11:02:26 +0000 (12:02 +0100)]
bcachefs: Prefer struct_size over open coded arithmetic
This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].
As the "op" variable is a pointer to "struct promote_op" and this
structure ends in a flexible array:
struct promote_op {
[...]
struct bio_vec bi_inline_vecs[];
};
and the "t" variable is a pointer to "struct journal_seq_blacklist_table"
and this structure also ends in a flexible array:
struct journal_seq_blacklist_table {
[...]
struct journal_seq_blacklist_table_entry {
u64 start;
u64 end;
bool dirty;
} entries[];
};
the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the argument "size + size * count" in the
kzalloc() functions.
This way, the code is more readable and safer.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Erick Archer <erick.archer@gmx.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 9 Mar 2024 00:00:25 +0000 (19:00 -0500)]
bcachefs: Kill unused flags argument to btree_split()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 8 Mar 2024 21:10:08 +0000 (16:10 -0500)]
bcachefs: Check for writing superblocks with nonsense member seq fields
We're seeing some unmountable filesystems due to split brain detection
going awry; it seems we somehow wrote out superblocks where we updated
the superblock seq without updating any member seq fields.
A given device's superblock should always have the main seq equal to
it's member seq field, so this is easy to check for.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 8 Mar 2024 18:51:38 +0000 (13:51 -0500)]
bcachefs: fix bch2_journal_buf_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 8 Mar 2024 03:32:06 +0000 (22:32 -0500)]
lib/generic-radix-tree.c: Make nodes more reasonably sized
this code originally used the page allocator directly, but most code
shouldn't do that - PAGE_SIZE varies with architecture, and slab is
faster.
4k is also on the large side for typical usage, 512 bytes is a better
choice for typical usage that might be somewhat sparse.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 2 Mar 2024 20:30:33 +0000 (15:30 -0500)]
bcachefs: copy_(to|from)_user_errcode()
we've got some helpers that return errors sanely, move them to a more
common location for use in fs-ioctl.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 1 Mar 2024 23:49:09 +0000 (18:49 -0500)]
bcachefs: Split out bkey_types.h
We're going to need bkey_types.h in bcachefs_ioctl.h in a future patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Fri, 1 Mar 2024 17:49:24 +0000 (12:49 -0500)]
bcachefs: fix lost journal buf wakeup due to improved pipelining
The journal_write_done() handler was reworked into a loop in commit
746a33c96b7a ("bcachefs: better journal pipelining"). As part of this,
the journal buffer wake was factored into a post-loop branch that
executes if at least one journal buffer has completed.
The journal buffer processing loop iterates on the journal buffer
pointer, however. This means that w refers to the last buffer processed
by the loop, which may or may not be done. This also means that if
multiple buffers are processed by the loop, only the last is awoken.
This lost wakeup behavior has lead to stalling problems in various CI
and fstests, such as generic/703.
Lift the wake into the loop so each done buffer sees a wake call as
it is processed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Fri, 1 Mar 2024 06:38:33 +0000 (14:38 +0800)]
bcachefs: intercept mountoption value for bool type
For mount option with bool type, the value must be 0 or 1 (See
bch2_opt_parse). But this seems does not well intercepted cause
for other value(like 2...), it returns the unexpect return code
with error message printed.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Fri, 1 Mar 2024 03:17:45 +0000 (11:17 +0800)]
bcachefs: avoid returning private error code in bch2_xattr_bcachefs_set
Avoid the private error code return to caller. The error code
should be transformed into genernal error code.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 Feb 2024 23:30:22 +0000 (18:30 -0500)]
bcachefs: Buffered write path now can avoid the inode lock
Non append, non extending buffered writes can now avoid taking the inode
lock.
To ensure atomicity of writes w.r.t. other writes, we lock every folio
that we'll be writing to, and if this fails we fall back to taking the
inode lock.
Extensive comments are provided as to corner cases.
Link: https://lore.kernel.org/linux-fsdevel/Zdkxfspq3urnrM6I@bombadil.infradead.org/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 Feb 2024 23:28:48 +0000 (18:28 -0500)]
fs: file_remove_privs_flags()
Rename and export __file_remove_privs(); for a buffered write path that
doesn't take the inode lock we need to be able to check if the operation
needs to do work first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Kent Overstreet [Thu, 29 Feb 2024 02:56:57 +0000 (21:56 -0500)]
bcachefs: Fix bch2_journal_noflush_seq()
Improved journal pipelining broke journal_noflush_seq(); it implicitly
assumed only the oldest outstanding journal buf could be in flight, but
that's no longer true.
Make this more straightforward by just setting buf->must_flush whenever
we know a journal buf is going to be flush.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Mon, 19 Feb 2024 12:24:32 +0000 (20:24 +0800)]
bcachefs: fix the error code when mounting with incorrect options.
When mount with incorrect options such as:
"mount -t bcachefs -o errors=back /dev/loop1 /mnt/bcachefs/".
It rebacks the error "mount: /mnt/bcachefs: permission denied."
cause bch2_parse_mount_opts returns -1 and bch2_mount throws
it up. This is unreasonable.
The real error message should be like this:
"mount: /mnt/bcachefs: wrong fs type, bad option, bad
superblock on /dev/loop1, missing codepage or helper program,
or other error."
Adding three private error codes for mounting error. Here are:
- BCH_ERR_mount_option as the parent class for option error.
- BCH_ERR_option_name represents the invalid option name.
- BCH_ERR_option_value represents the invalid option value.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Feb 2024 23:48:21 +0000 (18:48 -0500)]
bcachefs: split out ignore_blacklisted, ignore_not_dirty
prep work for replaying the journal backwards
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 Feb 2024 03:43:24 +0000 (22:43 -0500)]
bcachefs: improve move_gap()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 Feb 2024 05:19:09 +0000 (00:19 -0500)]
bcachefs: journal_keys now uses darray helpers
nice bit of code cleanup
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 Feb 2024 05:15:56 +0000 (00:15 -0500)]
bcachefs: Rename journal_keys.d -> journal_keys.data
This will let us use some darray helpers in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 Feb 2024 03:46:35 +0000 (22:46 -0500)]
bcachefs: jset_entry for loops declare loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 Feb 2024 03:10:09 +0000 (22:10 -0500)]
bcachefs: Errcode tracepoint, documentation
Add a tracepoint for downcasting private errors to standard errors, so
they can be recovered even when not logged; also, add some
documentation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Colin Ian King [Wed, 21 Feb 2024 11:52:03 +0000 (11:52 +0000)]
bcachefs: remove redundant assignment to variable ret
Variable ret is being assigned a value that is never read, it is
being re-assigned a couple of statements later on. The assignment
is redundant and can be removed.
Cleans up clang scan build warning:
fs/bcachefs/super-io.c:806:2: warning: Value stored to 'ret' is
never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Calvin Owens [Mon, 19 Feb 2024 07:36:08 +0000 (23:36 -0800)]
bcachefs: Silence gcc warnings about arm arch ABI drift
32-bit arm builds emit a lot of spam like this:
fs/bcachefs/backpointers.c: In function ‘extent_matches_bp’:
fs/bcachefs/backpointers.c:15:13: note: parameter passing for argument of type ‘struct bch_backpointer’ changed in GCC 9.1
Apply the change from commit
ebcc5928c5d9 ("arm64: Silence gcc warnings
about arch ABI drift") to fs/bcachefs/ to silence them.
Signed-off-by: Calvin Owens <jcalvinowens@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 18 Feb 2024 00:56:19 +0000 (19:56 -0500)]
bcachefs: Add journal.blocked to journal_debug_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 17 Feb 2024 22:54:39 +0000 (17:54 -0500)]
bcachefs: Fix journal_buf bitfield accesses
All jounal_buf bitfield updates must happen under the journal lock -
perhaps we should just switch these to atomic bit flags.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Feb 2024 06:08:25 +0000 (01:08 -0500)]
bcachefs: Split out discard fastpath
Buckets usually can't be discarded until the transaction that made them
empty has been committed in the journal.
Tracing has indicated that we're queuing the discard worker excessively,
only for it to skip over many buckets that are still waiting on a
journal commit, discarding only one or two buckets per iteration.
We want to switch to only queuing the discard worker after a journal
flush write, but there's an important optimization we need to preserve:
if a bucket becomes empty and it was never committed in the journal
while it was in use, we want to discard it and reuse it right away -
since overwriting it before the previous writes are flushed from the
device cache eans those writes only cost bus bandwidth.
So, this patch implements a fast path for buckets that can be discarded
right away. We need new locking between the two discard workers; the new
list of buckets being discarded provides that locking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 17 Feb 2024 08:26:19 +0000 (03:26 -0500)]
bcachefs: improve bch2_journal_buf_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 17 Feb 2024 04:50:05 +0000 (23:50 -0500)]
bcachefs: Drop redundant btree_path_downgrade()s
If a path doesn't have any active references, we shouldn't downgrade it;
it'll either be reused, possibly with intent refs again, or dropped at
bch2_trans_begin() time.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Daniel Hill [Thu, 18 Jan 2024 11:27:44 +0000 (00:27 +1300)]
bcachefs: rebalance_status now shows correct units
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 17 Feb 2024 01:03:12 +0000 (20:03 -0500)]
bcachefs: more informative write path error message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Feb 2024 04:59:05 +0000 (23:59 -0500)]
bcachefs: check_path() now only needs to walk up to subvolume root
Now that checking subvolume structure is a separate pass, the main
check_directory_connectivity() pass only needs to walk up to a given
inode's subvolume root.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Feb 2024 03:50:42 +0000 (22:50 -0500)]
bcachefs: bch2_check_subvolume_structure()
Now that we've got bch_subvolume.fs_path_parent, it's easy to write
subvolume
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Thomas Bertschinger [Fri, 16 Feb 2024 02:44:21 +0000 (19:44 -0700)]
bcachefs: omit alignment attribute on big endian struct bkey
This is needed for building Rust bindings on big endian architectures
like s390x. Currently this is only done in userspace, but it might
happen in-kernel in the future. When creating a Rust binding for struct
bkey, the "packed" attribute is needed to get a type with the correct
member offsets in the big endian case. However, rustc does not allow
types to have both a "packed" and "align" attribute. Thus, in order to
get a Rust type compatible with the C type, we must omit the "aligned"
attribute in C.
This does not affect the struct's size or member offsets, only its
toplevel alignment, which should be an acceptable impact.
The little endian version can have the "align" attribute because the
"packed" attr is redundant, and rust-bindgen will omit the "packed" attr
when an "align" attr is present and it can do so without changing a
type's layout
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Feb 2024 02:42:10 +0000 (21:42 -0500)]
bcachefs: bch2_trigger_alloc() handles state changes better
bch2_trigger_alloc() kicks off certain tasks on bucket state changes;
e.g. triggering the bucket discard worker and the invalidate worker.
We've observed the discard worker running too often - most runs it
doesn't do any work, according to the tracepoint - so clearly, we're
kicking it off too often.
This adds an explicit statechange() macro to make these checks more
precise.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 12 Feb 2024 22:15:29 +0000 (17:15 -0500)]
bcachefs: bch2_print_opts()
Make sure early error messages get redirected, for
kernel-fsck-from-userland.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 12 Feb 2024 20:19:22 +0000 (15:19 -0500)]
bcachefs: Improve error messages in device remove path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 12 Feb 2024 20:17:14 +0000 (15:17 -0500)]
bcachefs: Use kvzalloc() when dynamically allocating btree paths
THis silences a mm/page_alloc.c warning about allocating more than a
page with GFP_NOFAIL - and there's no reason for this to not have a
vmalloc fallback anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 10 Feb 2024 01:16:41 +0000 (20:16 -0500)]
bcachefs: Track iter->ip_allocated at bch2_trans_copy_iter()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 10 Feb 2024 01:15:03 +0000 (20:15 -0500)]
bcachefs: Save key_cache_path in peek_slot()
When bch2_btree_iter_peek_slot() clones the iterator to search for the
next key, and then discovers that the key from the cloned iterator is
the key we want to return - we also want to save the
iter->key_cache_path as well, for the update path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 23 Jan 2024 05:01:07 +0000 (00:01 -0500)]
bcachefs: Pin btree cache in ram for random access in fsck
Various phases of fsck involve checking references from one btree to
another: this means doing a sequential scan of one btree, and then
mostly random access into the second.
This is particularly painful for checking extents <-> backpointers; we
can prefetch btree node access on the sequential scan, but not on the
random access portion, and this is particularly painful on spinning
rust, where we'd like to keep the pipeline fairly full of btree node
reads so that the elevator can reduce seeking.
This patch implements prefetching and pinning of the portion of the
btree that we'll be doing random access to. We already calculate how
much of the random access btree will fit in memory so it's a fairly
straightforward change.
This will put more pressure on system memory usage, so we introduce a
new option, fsck_memory_usage_percent, which is the percentage of total
system ram that fsck is allowed to pin.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 10 Feb 2024 02:01:04 +0000 (21:01 -0500)]
bcachefs: Check for subvolume children when deleting subvolumes
Recursively destroying subvolumes isn't allowed yet.
Fixes: https://github.com/koverstreet/bcachefs/issues/634
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 21 Jan 2024 11:00:07 +0000 (06:00 -0500)]
bcachefs: BTREE_ID_subvolume_children
Add a btree to record a parent -> child subvolume relationships,
according to the filesystem heirarchy.
The subvolume_children btree is a bitset btree: if a bit is set at pos
p, that means p.offset is a child of subvolume p.inode.
This will be used for efficiently listing subvolumes, as well as
recursive deletion.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 8 Feb 2024 23:39:42 +0000 (18:39 -0500)]
bcachefs: bch_subvolume::fs_path_parent
Record the filesystem path heirarchy for subvolumes in bch_subvolume
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 00:23:56 +0000 (19:23 -0500)]
bcachefs: bch2_btree_bit_mod()
Provide a non-write buffer version of bch2_btree_bit_mod_buffered(), for
the subvolume children btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 00:10:19 +0000 (19:10 -0500)]
bcachefs: bch2_btree_bit_mod -> bch2_btree_bit_mod_buffered
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 21:04:50 +0000 (16:04 -0500)]
bcachefs: Correctly reattach subvolumes
Subvolumes need special handling to reattach - we always reattach them
in the root subvolume's lost+found, and they need a slightly different
kind of dirent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 04:08:21 +0000 (23:08 -0500)]
bcachefs: check_path() now prints full inode when reattaching
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 03:52:40 +0000 (22:52 -0500)]
bcachefs: Pass inode bkey to check_path()
prep work for improving logging/error messages
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 00:52:37 +0000 (19:52 -0500)]
bcachefs: Fix path where dirent -> subvol missing and we don't fix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 22 Jan 2024 20:12:28 +0000 (15:12 -0500)]
bcachefs: bch_subvolume::parent -> creation_parent
bit of renaming, prep for adding a fs path parent
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 21 Jan 2024 19:57:58 +0000 (14:57 -0500)]
bcachefs: Repair subvol dirents that point to non subvols
when repair switches d_type to or from DT_SUBVOL, we need to update the
target accordingly
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Feb 2024 05:45:09 +0000 (00:45 -0500)]
bcachefs: check dirent->d_parent_subvol
Check that d_parent_subvol makes sense - the dirent's snapshot must be
visible in d_parent_subvol (i.e. an ancestor of d_parent_subvol's
snapshot) in order to be visible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Feb 2024 05:23:25 +0000 (00:23 -0500)]
bcachefs: check inode->bi_parent_subvol against dirent
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Feb 2024 05:06:14 +0000 (00:06 -0500)]
bcachefs: delete duplicated checks in check_dirent_to_subvol()
these were already checked in check_subvol()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Feb 2024 04:51:23 +0000 (23:51 -0500)]
bcachefs: simplify check_dirent_inode_dirent()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Feb 2024 04:41:46 +0000 (23:41 -0500)]
bcachefs: check bi_parent_subvol in check_inode()
check for inodes with a nonzero bi_parent_subvol field that aren't
actually subvolume roots
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 8 Feb 2024 21:02:08 +0000 (16:02 -0500)]
bcachefs: better log message in lookup_inode_for_snapshot()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Feb 2024 04:39:08 +0000 (23:39 -0500)]
bcachefs: check_inode_dirent_inode()
check that if an inode has a backpointer, the dirent it points to points
back to it.
We do this in check_dirent_inode_dirent(), but only for inodes that have
dirents that point to them - we also have to do the check starting from
the inode to catch inodes that don't have dirents that point to them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 6 Feb 2024 03:30:51 +0000 (22:30 -0500)]
bcachefs: Check subvol <-> inode pointers in check_inode()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 6 Feb 2024 03:20:12 +0000 (22:20 -0500)]
bcachefs: Check subvol <-> inode pointers in check_subvol()
Subvolumes and subvolume root inodes point to each other: this verifies
the subvolume -> inode -> subvolme path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 6 Feb 2024 22:24:18 +0000 (17:24 -0500)]
bcachefs: Kill more -EIO error codes
This converts -EIOs related to btree node errors to private error codes,
which will help with some ongoing debugging by giving us better error
messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 18 Feb 2024 01:49:11 +0000 (20:49 -0500)]
bcachefs: thread_with_file: add f_ops.flush
Add a flush op, to return the exit code via close().
Also update bcachefs usage to use this to return fsck exit codes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 14 Feb 2024 01:26:09 +0000 (20:26 -0500)]
bcachefs: thread_with_file: Fix missing va_end()
Fixes: https://lore.kernel.org/linux-bcachefs/202402131603.E953E2CF@keescook/T/#u
Reported-by: coverity scan
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Sat, 10 Feb 2024 19:32:20 +0000 (11:32 -0800)]
bcachefs: thread_with_file: allow ioctls against these files
Make it so that a thread_with_stdio user can handle ioctls against the
file descriptor.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Sat, 10 Feb 2024 19:23:01 +0000 (11:23 -0800)]
bcachefs: thread_with_file: create ops structure for thread_with_stdio
Create an ops structure so we can add more file-based functionality in
the next few patches.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Wed, 7 Feb 2024 19:39:03 +0000 (11:39 -0800)]
bcachefs: thread_with_file: fix various printf problems
Experimentally fix some problems with stdio_redirect_vprintf by creating
a MOO variant with which we can experiment. We can't do a GFP_KERNEL
allocation while holding the spinlock, and I don't like how the printf
function can silently truncate the output if memory allocation fails.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Wed, 7 Feb 2024 19:43:32 +0000 (11:43 -0800)]
bcachefs: thread_with_file: allow creation of readonly files
Create a new run_thread_with_stdout function that opens a file in
O_RDONLY mode so that the kernel can write things to userspace but
userspace cannot write to the kernel. This will be used to convey xfs
health event information to userspace.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 01:41:34 +0000 (20:41 -0500)]
bcachefs: thread_with_stdio: suppress hung task warning
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 06:04:38 +0000 (01:04 -0500)]
kernel/hung_task.c: export sysctl_hung_task_timeout_secs
needed for thread_with_file; also rare but not unheard of to need this
in module code, when blocking on user input.
one workaround used by some code is wait_event_interruptible() - but
that can be buggy if the outer context isn't expecting unwinding.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: fuyuanli <fuyuanli@didiglobal.com>
Kent Overstreet [Fri, 9 Feb 2024 01:27:06 +0000 (20:27 -0500)]
bcachefs: thread_with_stdio: Mark completed in ->release()
This fixes stdio_redirect_read() getting stuck, not noticing that the
pipe has been closed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 3 Feb 2024 20:43:16 +0000 (15:43 -0500)]
bcachefs: Thread with file documentation
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 5 Feb 2024 03:56:16 +0000 (22:56 -0500)]
bcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline()
This fixes a bug where we'd return data without waiting for a newline,
if data was present but a newline was not.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 5 Feb 2024 03:49:34 +0000 (22:49 -0500)]
bcachefs: thread_with_stdio: kill thread_with_stdio_done()
Move the cleanup code to a wrapper function, where we can call it after
the thread_with_stdio fn exits.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 5 Feb 2024 03:20:40 +0000 (22:20 -0500)]
bcachefs: thread_with_stdio: convert to darray
- eliminate the dependency on printbufs, so that we can lift
thread_with_file for use in xfs
- add a nonblocking parameter to stdio_redirect_printf(), and either
block if the buffer is full or drop it on the floor - don't buffer
infinitely
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 5 Feb 2024 01:19:49 +0000 (20:19 -0500)]
bcachefs: thread_with_stdio: eliminate double buffering
The output buffer lock has to be a spinlock so that we can write to it
from interrupt context, so we can't use a direct copy_to_user; this
switches thread_with_file_read() to use fault_in_writeable() and
copy_to_user_nofault(), similar to how thread_with_file_write() works.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 1 Feb 2024 11:35:46 +0000 (06:35 -0500)]
bcachefs: kill kvpmalloc()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 1 Feb 2024 11:28:41 +0000 (06:28 -0500)]
mempool: kvmalloc pool
Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(),
which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc()
fallback.
This is part of a bcachefs cleanup - dropping an internal kvpmalloc()
helper (which predates kvmalloc()) along with mempool helpers; this
replaces the bcachefs-private kvpmalloc_pool.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-mm@kvack.org
Kent Overstreet [Thu, 25 Jan 2024 17:36:37 +0000 (12:36 -0500)]
bcachefs: bch2_lookup() gives better error message on inode not found
When a dirent points to a missing inode, we really should print out the
dirent.
This requires quite a bit of refactoring, but there's some other
benefits: we now do the entire looup (dirent and inode) in a single
btree transaction, and copy to the VFS inode with btree locks still
held, like the create path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>