Kent Overstreet [Mon, 17 Jul 2023 04:56:29 +0000 (00:56 -0400)]
bcachefs: bcachefs_metadata_version_deleted_inodes
Add a new bitset btree for inodes pending deletion; this means we no
longer have to scan the full inodes btree after an unclean shutdown.
Specifically, this adds:
- a trigger to update the deleted_inodes btree based on changes to the
inodes btree
- a new recovery pass
- and check_inodes is now only a fsck pass.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 3 Aug 2023 07:29:42 +0000 (03:29 -0400)]
bcachefs: Fix folio leak in folio_hole_offset()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Jul 2023 02:42:26 +0000 (22:42 -0400)]
bcachefs: Fix overlapping extent repair
A number of smallish fixes for overlapping extent repair, and (part of)
a new unit test. This fixes all the issues turned up by bhzhu203, in his
filesystem image from running mongodb + snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 3 Aug 2023 00:19:58 +0000 (20:19 -0400)]
bcachefs: In debug mode, run fsck again after fixing errors
We want to ensure that fsck actually fixed all the errors it found - the
second fsck run should be clean.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 2 Aug 2023 23:49:24 +0000 (19:49 -0400)]
bcachefs: recovery_types.h
Move some code out of bcachefs.h, which is too much of an everything
header.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 2 Aug 2023 16:51:51 +0000 (12:51 -0400)]
bcachefs: Handle weird opt string from sys_fsconfig()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 2 Aug 2023 00:06:45 +0000 (20:06 -0400)]
bcachefs: Assorted fixes for clang
clang had a few more warnings about enum conversion, and also didn't
like the opts.c initializer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Jul 2023 07:20:08 +0000 (03:20 -0400)]
bcachefs: Move fsck_inode_rm() to inode.c
Prep work for the new deleted inodes btree
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Jul 2023 09:38:45 +0000 (05:38 -0400)]
bcachefs: Consolidate btree id properties
This refactoring centralizes defining per-btree properties.
bch2_key_types_allowed was also about to overflow a u32, so expand that
to a u64.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Jul 2023 04:27:19 +0000 (00:27 -0400)]
bcachefs: bch2_trans_update_extent_overwrite()
Factor out a new helper, to be used when fsck has to repair overlapping
extents.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Jul 2023 03:13:43 +0000 (23:13 -0400)]
bcachefs: Fix minor memory leak on invalid bkey
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Jul 2023 03:14:05 +0000 (23:14 -0400)]
bcachefs: Move some declarations to the correct header
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Jul 2023 02:47:59 +0000 (22:47 -0400)]
bcachefs: Fix btree iter leak in __bch2_insert_snapshot_whiteouts()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Jul 2023 23:30:53 +0000 (19:30 -0400)]
bcachefs: Fix a null ptr deref in check_xattr()
We were attempting to initialize inode hash info when no inodes were
found.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Jul 2023 04:56:07 +0000 (00:56 -0400)]
bcachefs: bch2_btree_bit_mod()
New helper for bitset btrees.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Jul 2023 04:41:48 +0000 (00:41 -0400)]
bcachefs: move inode triggers to inode.c
bit of reorg
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Jul 2023 04:12:58 +0000 (00:12 -0400)]
bcachefs: fsck: delete dead code
Delete the old, now reimplemented overlapping extent check/repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Jul 2023 03:19:49 +0000 (23:19 -0400)]
bcachefs: Make topology repair a normal recovery pass
This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and
explicitly running a specific recovery pass - this is a more general
replacement for how we were running topology repair before.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Jul 2023 03:21:17 +0000 (23:21 -0400)]
bcachefs: bch2_run_explicit_recovery_pass()
This introduces bch2_run_explicit_recovery_pass() and uses it for when
fsck detects that we need to re-run dead snaphots cleanup, and makes
dead snapshot cleanup more like a normal recovery pass.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Jul 2023 22:09:26 +0000 (18:09 -0400)]
bcachefs: Print version, options earlier in startup path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Wed, 19 Jul 2023 12:53:06 +0000 (08:53 -0400)]
bcachefs: use prejournaled key updates for write buffer flushes
The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.
For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.
To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.
Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Wed, 19 Jul 2023 12:53:05 +0000 (08:53 -0400)]
bcachefs: support btree updates of prejournaled keys
Introduce support for prejournaled key updates. This allows a
transaction to commit an update for a key that already exists (and
is pinned) in the journal. This is required for btree write buffer
updates as the current scheme of journaling both on write buffer
insertion and write buffer (slow path) flush is unsafe in certain
crash recovery scenarios.
Create a small trans update wrapper to pass along the seq where the
key resides into the btree_insert_entry. From there, trans commit
passes the seq into the btree insert path where it is used to manage
the journal pin for the associated btree leaf.
Note that this patch only introduces the underlying mechanism and
otherwise includes no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Wed, 19 Jul 2023 12:53:04 +0000 (08:53 -0400)]
bcachefs: fold bch2_trans_update_by_path_trace() into callers
There is only one other caller so eliminate some boilerplate.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Wed, 19 Jul 2023 12:53:03 +0000 (08:53 -0400)]
bcachefs: remove unnecessary btree_insert_key_leaf() wrapper
This is in preparation to support prejournaled keys. We want the
ability to optionally pass a seq stored in the btree update rather
than the seq of the committing transaction.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Wed, 19 Jul 2023 12:53:02 +0000 (08:53 -0400)]
bcachefs: remove duplicate code between backpointer update paths
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Thu, 20 Jul 2023 13:00:33 +0000 (09:00 -0400)]
MAINTAINERS: add Brian Foster as a reviewer for bcachefs
Brian has been playing with bcachefs for several months now and has
offerred to commit time to patch review.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Jul 2023 02:31:19 +0000 (22:31 -0400)]
bcachefs: Suppresss various error messages in no_data_io mode
We commonly use no_data_io mode when debugging filesystem metadata
dumps, where data checksum/compression errors are expected and
unimportant - this patch suppresses these.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Jul 2023 01:56:18 +0000 (21:56 -0400)]
bcachefs: Fix lookup_inode_for_snapshot()
This fixes a use-after-free.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Jul 2023 01:09:37 +0000 (21:09 -0400)]
bcachefs: need_snapshot_cleanup shouldn't be a fsck error
We currently don't track whether snapshot cleanup still needs to finish
(aside from running a full fsck), so it shouldn't be a fsck error yet -
fsck -n after fsck has succesfully completed shouldn't error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 22:15:01 +0000 (18:15 -0400)]
bcachefs: Improve key_visible_in_snapshot()
Delete a redundant bch2_snapshot_is_ancestor() check, and convert some
assertions to debug assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 19:12:25 +0000 (15:12 -0400)]
bcachefs: Refactor overlapping extent checks
Make the overlapping extent check/repair code more self contained.
This is prep work for hopefully reducing key_visible_in_snapshot() usage
here as well, and also includes a nice performance optimization to not
check ref_visible2() unless the extents potentially overlap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 18:55:33 +0000 (14:55 -0400)]
bcachefs: check_extent(): don't use key_visible_in_snapshot()
This changes the main part of check_extents(), that checks the extent
against the corresponding inode, to not use key_visible_in_snapshot().
key_visible_in_snapshot() has to iterate over the list of ancestor
overwrites repeatedly calling bch2_snapshot_is_ancestor(), so this is a
significant performance improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 18:45:23 +0000 (14:45 -0400)]
bcachefs: check_extent() refactoring
More prep work for reducing key_visible_in_snapshot() usage - this
rearranges how KEY_TYPE_whitout keys are handled, so that they can be
marked off in inode_warker->inode->seen_this_pos.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 18:19:08 +0000 (14:19 -0400)]
bcachefs: fsck: walk_inode() now takes is_whiteout
We only want to synthesize an inode for the current snapshot ID for non
whiteouts - this refactoring lets us call walk_inode() earlier and clean
up some control flow.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jul 2023 05:41:02 +0000 (01:41 -0400)]
bcachefs: Simplify check_extent()
Minor refactoring/dead code deletion, prep work for reworking
check_extent() to avoid key_visible_in_snapshot().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jul 2023 07:11:16 +0000 (03:11 -0400)]
bcachefs: overlapping_extents_found()
This improves the repair path for overlapping extents - we now verify
that we find in the btree the overlapping extents that the algorithm
detected, and fail the fsck run with a more useful error if it doesn't
match.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 18:24:36 +0000 (14:24 -0400)]
bcachefs: fsck: inode_walker: last_pos, seen_this_pos
Prep work for changing check_extent() to avoid
key_visible_in_snapshot() - this adds the state to track whether an
inode has seen an extent at this pos.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 18:33:57 +0000 (14:33 -0400)]
bcachefs: check_extents(): make sure to check i_sectors for last inode
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 19:59:40 +0000 (15:59 -0400)]
bcachefs: Inline bch2_snapshot_is_ancestor() fast path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Jul 2023 01:03:26 +0000 (21:03 -0400)]
bcachefs: Upgrade path fixes
Some minor fixes to not print errors that are actually due to a verson
upgrade.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jul 2023 06:43:29 +0000 (02:43 -0400)]
bcachefs: is_ancestor bitmap
Further optimization for bch2_snapshot_is_ancestor(). We add a small
inline bitmap to snapshot_t, which indicates which of the next 128
snapshot IDs are ancestors of the current id - eliminating the last few
iterations of the loop in bch2_snapshot_is_ancestor().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Mikulas Patocka [Thu, 13 Jul 2023 16:00:28 +0000 (18:00 +0200)]
bcachefs: mark bch_inode_info and bkey_cached as reclaimable
Mark these caches as reclaimable, so that available memory is correctly
reported when there is a lot of cached inodes.
Note that more work is needed - you should add __GFP_RECLAIMABLE to some
of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*"
caches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jul 2023 02:27:16 +0000 (22:27 -0400)]
bcachefs: Compression levels
This allows including a compression level when specifying a compression
type, e.g.
compression=zstd:15
Values from 1 through 15 indicate compression levels, 0 or unspecified
indicates the default.
For LZ4, values 3-15 specify that the HC algorithm should be used.
Note that for compatibility, extents themselves only include the
compression type, not the compression level. This means that specifying
the same compression algorithm but different compression levels for the
compression and background_compression options will have no effect.
XXX: perhaps we could add a warning for this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jul 2023 02:06:37 +0000 (22:06 -0400)]
bcachefs: Extent sb compression type fields to 8 bits
The upper 4 bits are for compression level.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jul 2023 02:06:11 +0000 (22:06 -0400)]
bcachefs: bcachefs_format.h should be using __u64
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 12 Jul 2023 03:47:29 +0000 (23:47 -0400)]
bcachefs: fix_errors option is now a proper enum
Before, it was parsed as a bool but internally it was really an enum:
this lets us pass in all the possible values.
But we special case the option parsing: no supplied value is parsed as
FSCK_FIX_yes, to match the previous behaviour.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jul 2023 01:48:32 +0000 (21:48 -0400)]
bcachefs: bch_opt_fn
Minor refactoring to get rid of some unneeded token pasting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 12 Jul 2023 17:55:03 +0000 (13:55 -0400)]
bcachefs: Convert snapshot table to RCU array
This switches the generic radix tree for the in-memory table of snapshot
nodes to a simple rcu array. This means we have to add new locking to
deal with reallocations, but is faster than traversing the radix tree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 12 Jul 2023 15:43:03 +0000 (11:43 -0400)]
bcachefs: Add a race_fault() for write buffer slowpath
We haven't hooked up dynamic fault injection quite yet, but we will soon
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 11 Jul 2023 00:30:04 +0000 (20:30 -0400)]
bcachefs: Add buffered IO fallback for userspace
In userspace, we want to be able to switch to buffered IO when we're
dealing with an image on a filesystem/device that doesn't support the
blocksize the filesystem was formatted with.
This plumbs through !opts.direct_io -> FMODE_BUFFERED, which will be
supported by the shim version of blkdev_get_by_path() in -tools, and it
adds a fallback to disable direct IO and retry for userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 10 Jul 2023 02:28:08 +0000 (22:28 -0400)]
bcachefs: Fallocate now checks page cache
Previously, fallocate would only check the state of the extents btree
when determining if we need to create a reservation.
But the page cache might already have dirty data or a disk reservation.
This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to
check for this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 10 Jul 2023 21:23:59 +0000 (17:23 -0400)]
bcachefs: Don't start copygc until recovery is finished
With "bcachefs: Snapshot depth, skiplist fields", we now can't run data
move operations until after bch2_check_snapshots() is complete.
Ideally we'd have the copygc (and rebalance) threads wait until
c->curr_recovery_pass has advanced, but the waitlist handling is tricky
- so for now, move starting copygc back to read_write_late().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 10 Jul 2023 19:56:05 +0000 (15:56 -0400)]
bcachefs: Fix build error on weird gcc
fixes
./include/linux/stddef.h:8:14: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init]
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Jun 2023 22:04:46 +0000 (18:04 -0400)]
bcachefs: Snapshot depth, skiplist fields
This extents KEY_TYPE_snapshot to include some new fields:
- depth, to indicate depth of this particular node from the root
- skip[3], skiplist entries for quickly walking back up to the root
These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n))
instead of O(n) in the snapshot tree depth.
Skiplist nodes are picked at random from the set of ancestor nodes, not
some fixed fraction.
This introduces bcachefs_metadata_version 1.1, snapshot_skiplists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 10 Jul 2023 17:42:26 +0000 (13:42 -0400)]
bcachefs: Version table now lists required recovery passes
Now that we've got forward compatibility sorted out, we should be doing
more frequent version upgrades in the future.
To avoid having to run a full fsck for every version upgrade, this
improves the BCH_METADATA_VERSIONS() table to explicitly specify a
bitmask of recovery passes to run when upgrading to or past a given
version.
This means we can also delete PASS_UPGRADE().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 10 Jul 2023 16:23:01 +0000 (12:23 -0400)]
bcachefs: bch2_sb_maybe_downgrade(), bch2_sb_upgrade()
Add some new helpers, and fix upgrade/downgrade in bch2_fs_initialize().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 10 Jul 2023 15:17:56 +0000 (11:17 -0400)]
bcachefs: Fix a write buffer flush deadlock
We're not supposed to block if BTREE_INSERT_JOURNAL_RECLAIM && watermark
!= BCH_WATERMARK_reclaim.
This should really be a separate BTREE_INSERT_NONBLOCK flag - add some
comments to that effect, it's not important for this patch.
btree write buffer flush depends on this behaviour though - the first
loop tries to flush sequentially, which doesn't free up space in the
journal optimally. If that can't proceed we bail out and flush in
journal order - that won't work if we're blocked instead of returning an
error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 Jun 2023 02:09:35 +0000 (22:09 -0400)]
bcachefs: bcachefs_metadata_version_major_minor
This introduces major/minor versioning to the superblock version number.
Major version number changes indicate incompatible releases; we can move
forward to a new major version number, but not backwards. Minor version
numbers indicate compatible changes - these add features, but can still
be mounted and used by old versions.
With the recent patches that make it possible to roll out new btrees and
key types without breaking compatibility, we should be able to roll out
most new features without incompatible changes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Jul 2023 19:13:30 +0000 (15:13 -0400)]
bcachefs: Add new assertions for shutdown path
We've been seeing assertions pop that indicate the btree node cache or
key cache have dirty items when we just did a clean shutdown.
Add some more assertions so we can catch this when we're dirtying items.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Jul 2023 18:18:28 +0000 (14:18 -0400)]
bcachefs: bch2_xattr_set() now updates ctime
Fixes fstests generic/728
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Jul 2023 18:12:58 +0000 (14:12 -0400)]
bcachefs: Kill bch2_xattr_get()
Inline it into the only caller
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Jul 2023 17:49:34 +0000 (13:49 -0400)]
bcachefs: Fix try_decrease_writepoints()
We were freeing open buckets on the writepoint list, but forgetting to
take them off the writepoint list - whoops
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Jul 2023 17:20:29 +0000 (13:20 -0400)]
bcachefs: Mark as EXPERIMENTAL
As discussed on list, bcachefs is going to be marked as experimental for
a few releases, until the inevitable tide of new bug reports subsides.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 7 Jul 2023 06:42:28 +0000 (02:42 -0400)]
bcachefs: Enumerate recovery passes
Recovery and fsck have many different passes/jobs to do, which always
run in the same order - but not all of them run all the time. Some are
for fsck, some for unclean shutdown, some for version upgrades.
This adds some new structure: a defined list of recovery passes that we
can run in a loop, as well as consolidating the log messages.
The main benefit is consolidating the "should run this recovery pass"
logic, as well as cleaning up the "this recovery pass has finished"
state; instead of having a bunch of ad-hoc state bits in c->flags, we've
now got c->curr_recovery_pass.
By consolidating the "should run this recovery pass" logic, in the
future on disk format upgrades will be able to say "upgrading to this
version requires x passes to run", instead of forcing all of fsck to
run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Jul 2023 02:33:29 +0000 (22:33 -0400)]
bcachefs: Stash journal replay params in bch_fs
For the upcoming enumeration of recovery passes, we need all recovery
passes to be called the same way - including journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Jul 2023 02:27:03 +0000 (22:27 -0400)]
bcachefs: Kill bch2_bucket_gens_read()
This folds bch2_bucket_gens_read() into bch2_alloc_read(), doing the
version check there.
This is prep work for enumarating all recovery passes: we need some
cleanup first to make calling all the recovery passes consistent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Jul 2023 02:21:45 +0000 (22:21 -0400)]
bcachefs: Fix error path in bch2_journal_flush_device_pins()
We need to always call bch2_replicas_gc_end() after we've called
bch2_replicas_gc_start(), else we leave state around that needs to be
cleaned up.
Partial fix for: https://github.com/koverstreet/bcachefs/issues/560
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 Jun 2023 03:34:02 +0000 (23:34 -0400)]
bcachefs: version_upgrade is now an enum
The version_upgrade parameter is now an enum, not a bool, and it's
persistent in the superblock:
- compatible (default): upgrade to the latest compatible version
- incompatible: upgrade to latest incompatible version
- none
Currently all upgrades are incompatible upgrades, but the next release
will introduce major:minor versions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 Jun 2023 23:59:56 +0000 (19:59 -0400)]
bcachefs: BCH_SB_VERSION_UPGRADE_COMPLETE()
Version upgrades are not atomic operations: when we do a version upgrade
we need to update the superblock before we start using new features, and
then when the upgrade completes we need to update the superblock again.
This adds a new superblock field so we can detect and handle incomplete
version upgrades.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 7 Jul 2023 21:09:26 +0000 (17:09 -0400)]
bcachefs: Convert more -EROFS to private error codes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 7 Jul 2023 08:38:29 +0000 (04:38 -0400)]
bcachefs: Delete redundant log messages
Now that we have distinct error codes for different memory allocation
failures, the early init log messages are no longer needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 7 Jul 2023 01:16:10 +0000 (21:16 -0400)]
bcachefs: Change check for invalid key types
As part of the forward compatibility patch series, we need to allow for
new key types without complaining loudly when running an old version.
This patch changes the flags parameter of bkey_invalid to an enum, and
adds a new flag to indicate we're being called from the transaction
commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 7 Jul 2023 02:47:42 +0000 (22:47 -0400)]
bcachefs: Assorted sparse fixes
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 7 Jul 2023 00:11:36 +0000 (20:11 -0400)]
bcachefs: Refactor bch_sb_field_ops handling
This changes bch_sb_field_ops lookup to match how bkey_ops now works;
for an unknown field type we return an empty ops struct.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 6 Jul 2023 23:23:27 +0000 (19:23 -0400)]
bcachefs: Allow for unknown key types
This adds a new helper for lookups bkey_ops for a given key type, which
returns a null bkey_ops for unknown key types; various bkey_ops users
are tweaked as well to handle unknown key types.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 29 Jun 2023 02:09:13 +0000 (22:09 -0400)]
bcachefs: Allow for unknown btree IDs
We need to allow filesystems with metadata from newer versions to be
mountable and usable by older versions.
This patch enables us to roll out new btrees without a new major version
number; we can now handle btree roots for unknown btree types.
The unknown btree roots will be retained, and fsck (including
backpointers) will check them, the same as other btree types.
We add a dynamic array for the extra, unknown btree roots, in addition
to the fixed size btree root array, and add new helpers for looking up
btree roots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Fri, 30 Jun 2023 17:09:46 +0000 (13:09 -0400)]
bcachefs: flush journal to avoid invalid dev usage entries on recovery
A crash immediately after device removal can result in an
unmountable filesystem due to recovery failure. The following
command reliably reproduces on a multi-device fs:
bcachefs device remove <dev> && xfs_io -xc shutdown <mnt>
The post-crash mount fails with an error similar to the following,
reported by fsck:
invalid journal entry dev_usage at offset 7994/8034 seq 12: bad dev, fixing
This refers to a device usage entry in the journal that refers to
the index of the just removed device. Recovery considers this an
invalid entry and fails to proceed.
Device usage entries are added to journal buffer writes via
bch_journal_write() -> bch2_journal_super_entries_add_common(),
which means any journal buffer write has content that refers to
member devices at the time of the journal write.
The device remove sequence already removes metadata references to
the device being removed. It then flushes any pins that refer to the
device, clears replica entries, removes the in-memory device object
and lastly updates the superblock to reflect that the device is no
longer present. The problem is that any journal writes that occur
during this sequence will include a dev usage entry so long as the
device is present. To avoid this problem, we can flush the journal
once more after the device entry is removed from the in-core
structures, but before the superblock is updated to fully remove the
device on-disk.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Fri, 30 Jun 2023 14:51:46 +0000 (10:51 -0400)]
bcachefs: mark active journal devices on journal replicas gc
A simple device evacuate, remove, add test loop with concurrent
shutdowns occasionally reproduces a problem where the filesystem
fails to mount. The mount failure occurs because the filesystem was
uncleanly shut down, yet no member device is marked for journal data
in the superblock. An fsck detects the problem, restores the mark
and allows the mount to proceed without further consistency issues.
The reason for the lack of journal data marks is the gc mechanism
invoked via bch2_journal_flush_device_pins() runs while the journal
happens to be empty. This results in garbage collection of all journal
replicas entries. Once the updated replicas table is written to the
superblock, the filesystem is put in a transiently unrecoverable state
until further journal data is written, because journal recovery expects
to find at least one marked journal device whenever the filesystem is
not otherwise marked clean (i.e. as on clean unmount).
To fix this problem, update the journal replicas gc algorithm to always
mark currently active journal replicas entries by writing to the
journal. This ensures that only entries for devices that are no longer
used for journaling are garbage collected, not just those that don't
happen to currently hold journal data. This preserves the journal
recovery invariant above and avoids putting the fs into a transiently
unrecoverable state.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 29 Jun 2023 00:27:07 +0000 (20:27 -0400)]
bcachefs: bch2_version_compatible()
This adds a new helper for checking if an on-disk version is compatible
with the running version of bcachefs - prep work for introducing
major:minor version numbers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 Jun 2023 23:53:05 +0000 (19:53 -0400)]
bcachefs: bch2_version_to_text()
Add a new helper for printing out metadata versions in a standard
format.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 21:32:48 +0000 (17:32 -0400)]
bcachefs: Kill BTREE_INSERT_USE_RESERVE
Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 Jun 2023 04:01:19 +0000 (00:01 -0400)]
bcachefs: Fix a null ptr deref in bch2_fs_alloc() error path
This fixes a null ptr deref in bch2_free_pending_node_rewrites() when
the list head wasn't initialized.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 28 Jun 2023 03:28:17 +0000 (23:28 -0400)]
bcachefs: Fix a format string warning
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 21:32:38 +0000 (17:32 -0400)]
bcachefs: Kill JOURNAL_WATERMARK
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 21:29:20 +0000 (17:29 -0400)]
bcachefs: BCH_WATERMARK_reclaim
Add another watermark for journal reclaim - this is needed for the next
patches, that unify BCH_WATERMARK with JOURNAL_WATERMARK.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 23:02:17 +0000 (19:02 -0400)]
bcachefs: struct bch_extent_rebalance
This adds the extent entry for extents that rebalance needs to do
something with.
We're adding this ahead of the main rebalance_work patchset, because
adding new extent entries can't be done in a forwards-compatible way.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 22:01:09 +0000 (18:01 -0400)]
bcachefs: Expand BTREE_NODE_ID
We now have 20 bits for the btree ID in the on disk format - sufficient
for 1 million distinct btrees.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 23:10:24 +0000 (19:10 -0400)]
bcachefs: Fix btree node write error message
Error messages should include the error code, when available.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Jun 2023 20:35:49 +0000 (16:35 -0400)]
bcachefs: fsck: Break walk_inode() up into multiple functions
Some refactoring, prep work for algorithm improvements related to
snapshots.
we need to add a bitmap to the list of inodes for "seen this snapshot";
for this bitmap to correctly be available, we'll need to gather the list
of inodes first, and later look up the inode for a given snapshot.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 20:20:05 +0000 (16:20 -0400)]
bcachefs: Fix leak in backpointers fsck
We were forgetting to exit a printbuf - whoops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 03:31:49 +0000 (23:31 -0400)]
bcachefs: unregister_shrinker() now safe on not-registered shrinker
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 03:10:21 +0000 (23:10 -0400)]
bcachefs: Add a missing rhashtable_destroy() call
Fixes https://lore.kernel.org/linux-bcachefs/
784c3e6a-75bd-e6ca-535a-
43b3e1daf643@kernel.dk/T/#mbf7caf005f960018eba23b58795d06c06c947411
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 26 Jun 2023 22:36:24 +0000 (18:36 -0400)]
bcachefs: Improve bch2_bkey_make_mut()
bch2_bkey_make_mut() now takes the bkey_s_c by reference and points it
at the new, mutable key.
This helps in some fsck paths that may have multiple repair operations
on the same key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Jun 2023 02:26:04 +0000 (22:26 -0400)]
bcachefs: Reduce stack frame size of bch2_check_alloc_info()
Excessive inlining may (on some versions of gcc?) cause excessive stack
usage; this turns off some inlining in bch2_check_alloc_info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Jun 2023 05:34:45 +0000 (01:34 -0400)]
bcachefs: fsck needs BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE
A few fsck paths weren't using BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE -
oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Jun 2023 03:22:20 +0000 (23:22 -0400)]
bcachefs: Improve error message for overlapping extents
We now print out the full previous extent we overlapping with, to aid in
debugging and searching through the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Jun 2023 03:20:39 +0000 (23:20 -0400)]
bcachefs: Fix check_pos_snapshot_overwritten()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 Jun 2023 23:30:10 +0000 (19:30 -0400)]
bcachefs: Rename enum alloc_reserve -> bch_watermark
This is prep work for consolidating with JOURNAL_WATERMARK.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 Jun 2023 19:59:03 +0000 (15:59 -0400)]
bcachefs: BCH_ERR_fsck -> EINVAL
When we return errors outside of bcachefs, we need to return a standard
error code - fix this for BCH_ERR_fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 Jun 2023 16:17:57 +0000 (12:17 -0400)]
bcachefs: bch2_trans_mark_pointer() refactoring
bch2_bucket_backpointer_mod() doesn't need to update the alloc key, we
can exit the alloc iter earlier.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>