linux.git
12 months agobcachefs: BTREE_ID_subvolume_children
Kent Overstreet [Sun, 21 Jan 2024 11:00:07 +0000 (06:00 -0500)]
bcachefs: BTREE_ID_subvolume_children

Add a btree to record a parent -> child subvolume relationships,
according to the filesystem heirarchy.

The subvolume_children btree is a bitset btree: if a bit is set at pos
p, that means p.offset is a child of subvolume p.inode.

This will be used for efficiently listing subvolumes, as well as
recursive deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: bch_subvolume::fs_path_parent
Kent Overstreet [Thu, 8 Feb 2024 23:39:42 +0000 (18:39 -0500)]
bcachefs: bch_subvolume::fs_path_parent

Record the filesystem path heirarchy for subvolumes in bch_subvolume

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: bch2_btree_bit_mod()
Kent Overstreet [Fri, 9 Feb 2024 00:23:56 +0000 (19:23 -0500)]
bcachefs: bch2_btree_bit_mod()

Provide a non-write buffer version of bch2_btree_bit_mod_buffered(), for
the subvolume children btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: bch2_btree_bit_mod -> bch2_btree_bit_mod_buffered
Kent Overstreet [Fri, 9 Feb 2024 00:10:19 +0000 (19:10 -0500)]
bcachefs: bch2_btree_bit_mod -> bch2_btree_bit_mod_buffered

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Correctly reattach subvolumes
Kent Overstreet [Fri, 9 Feb 2024 21:04:50 +0000 (16:04 -0500)]
bcachefs: Correctly reattach subvolumes

Subvolumes need special handling to reattach - we always reattach them
in the root subvolume's lost+found, and they need a slightly different
kind of dirent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: check_path() now prints full inode when reattaching
Kent Overstreet [Fri, 9 Feb 2024 04:08:21 +0000 (23:08 -0500)]
bcachefs: check_path() now prints full inode when reattaching

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Pass inode bkey to check_path()
Kent Overstreet [Fri, 9 Feb 2024 03:52:40 +0000 (22:52 -0500)]
bcachefs: Pass inode bkey to check_path()

prep work for improving logging/error messages

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Fix path where dirent -> subvol missing and we don't fix
Kent Overstreet [Fri, 9 Feb 2024 00:52:37 +0000 (19:52 -0500)]
bcachefs: Fix path where dirent -> subvol missing and we don't fix

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: bch_subvolume::parent -> creation_parent
Kent Overstreet [Mon, 22 Jan 2024 20:12:28 +0000 (15:12 -0500)]
bcachefs: bch_subvolume::parent -> creation_parent

bit of renaming, prep for adding a fs path parent

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Repair subvol dirents that point to non subvols
Kent Overstreet [Sun, 21 Jan 2024 19:57:58 +0000 (14:57 -0500)]
bcachefs: Repair subvol dirents that point to non subvols

when repair switches d_type to or from DT_SUBVOL, we need to update the
target accordingly

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: check dirent->d_parent_subvol
Kent Overstreet [Wed, 7 Feb 2024 05:45:09 +0000 (00:45 -0500)]
bcachefs: check dirent->d_parent_subvol

Check that d_parent_subvol makes sense - the dirent's snapshot must be
visible in d_parent_subvol (i.e. an ancestor of d_parent_subvol's
snapshot) in order to be visible.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: check inode->bi_parent_subvol against dirent
Kent Overstreet [Wed, 7 Feb 2024 05:23:25 +0000 (00:23 -0500)]
bcachefs: check inode->bi_parent_subvol against dirent

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: delete duplicated checks in check_dirent_to_subvol()
Kent Overstreet [Wed, 7 Feb 2024 05:06:14 +0000 (00:06 -0500)]
bcachefs: delete duplicated checks in check_dirent_to_subvol()

these were already checked in check_subvol()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: simplify check_dirent_inode_dirent()
Kent Overstreet [Wed, 7 Feb 2024 04:51:23 +0000 (23:51 -0500)]
bcachefs: simplify check_dirent_inode_dirent()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: check bi_parent_subvol in check_inode()
Kent Overstreet [Wed, 7 Feb 2024 04:41:46 +0000 (23:41 -0500)]
bcachefs: check bi_parent_subvol in check_inode()

check for inodes with a nonzero bi_parent_subvol field that aren't
actually subvolume roots

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: better log message in lookup_inode_for_snapshot()
Kent Overstreet [Thu, 8 Feb 2024 21:02:08 +0000 (16:02 -0500)]
bcachefs: better log message in lookup_inode_for_snapshot()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: check_inode_dirent_inode()
Kent Overstreet [Wed, 7 Feb 2024 04:39:08 +0000 (23:39 -0500)]
bcachefs: check_inode_dirent_inode()

check that if an inode has a backpointer, the dirent it points to points
back to it.

We do this in check_dirent_inode_dirent(), but only for inodes that have
dirents that point to them - we also have to do the check starting from
the inode to catch inodes that don't have dirents that point to them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Check subvol <-> inode pointers in check_inode()
Kent Overstreet [Tue, 6 Feb 2024 03:30:51 +0000 (22:30 -0500)]
bcachefs: Check subvol <-> inode pointers in check_inode()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Check subvol <-> inode pointers in check_subvol()
Kent Overstreet [Tue, 6 Feb 2024 03:20:12 +0000 (22:20 -0500)]
bcachefs: Check subvol <-> inode pointers in check_subvol()

Subvolumes and subvolume root inodes point to each other: this verifies
the subvolume -> inode -> subvolme path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Kill more -EIO error codes
Kent Overstreet [Tue, 6 Feb 2024 22:24:18 +0000 (17:24 -0500)]
bcachefs: Kill more -EIO error codes

This converts -EIOs related to btree node errors to private error codes,
which will help with some ongoing debugging by giving us better error
messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_file: add f_ops.flush
Kent Overstreet [Sun, 18 Feb 2024 01:49:11 +0000 (20:49 -0500)]
bcachefs: thread_with_file: add f_ops.flush

Add a flush op, to return the exit code via close().

Also update bcachefs usage to use this to return fsck exit codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_file: Fix missing va_end()
Kent Overstreet [Wed, 14 Feb 2024 01:26:09 +0000 (20:26 -0500)]
bcachefs: thread_with_file: Fix missing va_end()

Fixes: https://lore.kernel.org/linux-bcachefs/202402131603.E953E2CF@keescook/T/#u
Reported-by: coverity scan
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_file: allow ioctls against these files
Darrick J. Wong [Sat, 10 Feb 2024 19:32:20 +0000 (11:32 -0800)]
bcachefs: thread_with_file: allow ioctls against these files

Make it so that a thread_with_stdio user can handle ioctls against the
file descriptor.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_file: create ops structure for thread_with_stdio
Darrick J. Wong [Sat, 10 Feb 2024 19:23:01 +0000 (11:23 -0800)]
bcachefs: thread_with_file: create ops structure for thread_with_stdio

Create an ops structure so we can add more file-based functionality in
the next few patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_file: fix various printf problems
Darrick J. Wong [Wed, 7 Feb 2024 19:39:03 +0000 (11:39 -0800)]
bcachefs: thread_with_file: fix various printf problems

Experimentally fix some problems with stdio_redirect_vprintf by creating
a MOO variant with which we can experiment.  We can't do a GFP_KERNEL
allocation while holding the spinlock, and I don't like how the printf
function can silently truncate the output if memory allocation fails.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_file: allow creation of readonly files
Darrick J. Wong [Wed, 7 Feb 2024 19:43:32 +0000 (11:43 -0800)]
bcachefs: thread_with_file: allow creation of readonly files

Create a new run_thread_with_stdout function that opens a file in
O_RDONLY mode so that the kernel can write things to userspace but
userspace cannot write to the kernel.  This will be used to convey xfs
health event information to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_stdio: suppress hung task warning
Kent Overstreet [Fri, 9 Feb 2024 01:41:34 +0000 (20:41 -0500)]
bcachefs: thread_with_stdio: suppress hung task warning

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agokernel/hung_task.c: export sysctl_hung_task_timeout_secs
Kent Overstreet [Fri, 9 Feb 2024 06:04:38 +0000 (01:04 -0500)]
kernel/hung_task.c: export sysctl_hung_task_timeout_secs

needed for thread_with_file; also rare but not unheard of to need this
in module code, when blocking on user input.

one workaround used by some code is wait_event_interruptible() - but
that can be buggy if the outer context isn't expecting unwinding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: fuyuanli <fuyuanli@didiglobal.com>
12 months agobcachefs: thread_with_stdio: Mark completed in ->release()
Kent Overstreet [Fri, 9 Feb 2024 01:27:06 +0000 (20:27 -0500)]
bcachefs: thread_with_stdio: Mark completed in ->release()

This fixes stdio_redirect_read() getting stuck, not noticing that the
pipe has been closed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Thread with file documentation
Kent Overstreet [Sat, 3 Feb 2024 20:43:16 +0000 (15:43 -0500)]
bcachefs: Thread with file documentation

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline()
Kent Overstreet [Mon, 5 Feb 2024 03:56:16 +0000 (22:56 -0500)]
bcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline()

This fixes a bug where we'd return data without waiting for a newline,
if data was present but a newline was not.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_stdio: kill thread_with_stdio_done()
Kent Overstreet [Mon, 5 Feb 2024 03:49:34 +0000 (22:49 -0500)]
bcachefs: thread_with_stdio: kill thread_with_stdio_done()

Move the cleanup code to a wrapper function, where we can call it after
the thread_with_stdio fn exits.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_stdio: convert to darray
Kent Overstreet [Mon, 5 Feb 2024 03:20:40 +0000 (22:20 -0500)]
bcachefs: thread_with_stdio: convert to darray

 - eliminate the dependency on printbufs, so that we can lift
   thread_with_file for use in xfs
 - add a nonblocking parameter to stdio_redirect_printf(), and either
   block if the buffer is full or drop it on the floor - don't buffer
   infinitely

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: thread_with_stdio: eliminate double buffering
Kent Overstreet [Mon, 5 Feb 2024 01:19:49 +0000 (20:19 -0500)]
bcachefs: thread_with_stdio: eliminate double buffering

The output buffer lock has to be a spinlock so that we can write to it
from interrupt context, so we can't use a direct copy_to_user; this
switches thread_with_file_read() to use fault_in_writeable() and
copy_to_user_nofault(), similar to how thread_with_file_write() works.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: kill kvpmalloc()
Kent Overstreet [Thu, 1 Feb 2024 11:35:46 +0000 (06:35 -0500)]
bcachefs: kill kvpmalloc()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agomempool: kvmalloc pool
Kent Overstreet [Thu, 1 Feb 2024 11:28:41 +0000 (06:28 -0500)]
mempool: kvmalloc pool

Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(),
which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc()
fallback.

This is part of a bcachefs cleanup - dropping an internal kvpmalloc()
helper (which predates kvmalloc()) along with mempool helpers; this
replaces the bcachefs-private kvpmalloc_pool.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-mm@kvack.org
12 months agobcachefs: bch2_lookup() gives better error message on inode not found
Kent Overstreet [Thu, 25 Jan 2024 17:36:37 +0000 (12:36 -0500)]
bcachefs: bch2_lookup() gives better error message on inode not found

When a dirent points to a missing inode, we really should print out the
dirent.

This requires quite a bit of refactoring, but there's some other
benefits: we now do the entire looup (dirent and inode) in a single
btree transaction, and copy to the VFS inode with btree locks still
held, like the create path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: bch2_inode_insert()
Kent Overstreet [Fri, 26 Jan 2024 01:25:49 +0000 (20:25 -0500)]
bcachefs: bch2_inode_insert()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agomm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN
Kent Overstreet [Fri, 26 Jan 2024 00:00:24 +0000 (19:00 -0500)]
mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN

Introduce PF_MEMALLOC_* equivalents of some GFP_ flags:

PF_MEMALLOC_NORECLAIM -> GFP_NOWAIT
PF_MEMALLOC_NOWARN -> __GFP_NOWARN

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agomm: introduce memalloc_flags_{save,restore}
Kent Overstreet [Fri, 26 Jan 2024 00:00:24 +0000 (19:00 -0500)]
mm: introduce memalloc_flags_{save,restore}

Our proliferation of memalloc_*_{save,restore} APIs is getting a bit
silly, this adds a generic version and converts the existing
save/restore functions to wrappers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Vlastimil Babka <vbabka@suse.cz>
12 months agobcachefs: factor out check_inode_backpointer()
Kent Overstreet [Tue, 6 Feb 2024 00:38:19 +0000 (19:38 -0500)]
bcachefs: factor out check_inode_backpointer()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Factor out check_subvol_dirent()
Kent Overstreet [Mon, 5 Feb 2024 08:22:29 +0000 (03:22 -0500)]
bcachefs: Factor out check_subvol_dirent()

Going to be adding more code here for checking subvol structure.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Kill some -EINVALs
Kent Overstreet [Tue, 6 Feb 2024 02:44:23 +0000 (21:44 -0500)]
bcachefs: Kill some -EINVALs

Repurposing standard error codes in bcachefs code is banned in new code,
and we need to get rid of the remaining ones - private error codes give
us much better error messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: bump max_active on btree_interior_update_worker
Kent Overstreet [Tue, 6 Feb 2024 00:28:03 +0000 (19:28 -0500)]
bcachefs: bump max_active on btree_interior_update_worker

WQ_UNBOUND with max_active 1 means ordered workqueue, but we don't
actually need or want ordered semantics - and probably want a higher
concurrency limit anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: move fsck_write_inode() to inode.c
Kent Overstreet [Thu, 1 Feb 2024 12:35:28 +0000 (07:35 -0500)]
bcachefs: move fsck_write_inode() to inode.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Initialize super_block->s_uuid
Kent Overstreet [Sat, 3 Feb 2024 20:23:07 +0000 (15:23 -0500)]
bcachefs: Initialize super_block->s_uuid

Need to fix this oversight for the new FS_IOC_(GET|SET)UUID ioctls.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Switch to uuid_to_fsid()
Kent Overstreet [Sat, 3 Feb 2024 20:05:17 +0000 (15:05 -0500)]
bcachefs: Switch to uuid_to_fsid()

switch the statfs code from something horrible and open coded to the
more standard uuid_to_fsid()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Subvolumes may now be renamed
Kent Overstreet [Sun, 21 Jan 2024 21:46:45 +0000 (16:46 -0500)]
bcachefs: Subvolumes may now be renamed

Files within a subvolume cannot be renamed into another subvolume, but
subvolumes themselves were intended to be.

This implements subvolume renaming - we need to ensure that there's only
a single dirent that points to a subvolume key (not multiple versions in
different snapshots), and we need to ensure that dirent.d_parent_subol
and inode.bi_parent_subvol are updated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: btree node prefetching in check_topology
Kent Overstreet [Mon, 22 Jan 2024 19:25:00 +0000 (14:25 -0500)]
bcachefs: btree node prefetching in check_topology

btree_and_journal_iter is old code that we want to get rid of, but we're
not ready to yet.

lack of btree node prefetching is, it turns out, a real performance
issue for fsck on spinning rust, so - add it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: btree_and_journal_iter.trans
Kent Overstreet [Mon, 22 Jan 2024 19:37:42 +0000 (14:37 -0500)]
bcachefs: btree_and_journal_iter.trans

we now always have a btree_trans when using a btree_and_journal_iter;
prep work for adding prefetching to btree_and_journal_iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: better journal pipelining
Kent Overstreet [Wed, 31 Jan 2024 19:26:15 +0000 (14:26 -0500)]
bcachefs: better journal pipelining

Recently a severe performance regression was discovered, which bisected
to

  a6548c8b5eb5 bcachefs: Avoid flushing the journal in the discard path

It turns out the old behaviour, which issued excessive journal flushes,
worked around a performance issue where queueing delays would cause the
journal to not be able to write quickly enough and stall.

The journal flushes masked the issue because they periodically flushed
the device write cache, reducing write latency for non flushes.

This patch reworks the journalling code to allow more than one
(non-flush) write to be in flight at a time. With this patch, doing 4k
random writes and an iodepth of 128, we are now able to hit 560k iops to
a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: closure per journal buf
Kent Overstreet [Wed, 31 Jan 2024 18:42:48 +0000 (13:42 -0500)]
bcachefs: closure per journal buf

Prep work for having multiple journal writes in flight.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: bio per journal buf
Kent Overstreet [Wed, 31 Jan 2024 18:20:28 +0000 (13:20 -0500)]
bcachefs: bio per journal buf

Prep work for having multiple journal writes in flight.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: jset_entry_datetime
Kent Overstreet [Sat, 27 Jan 2024 15:16:15 +0000 (10:16 -0500)]
bcachefs: jset_entry_datetime

This gives us a way to record the date and time every journal entry was
written - useful for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: improve journal entry read fsck error messages
Kent Overstreet [Sat, 27 Jan 2024 15:01:23 +0000 (10:01 -0500)]
bcachefs: improve journal entry read fsck error messages

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: convert journal replay ptrs to darray
Kent Overstreet [Sat, 27 Jan 2024 05:05:03 +0000 (00:05 -0500)]
bcachefs: convert journal replay ptrs to darray

Eliminates some error paths - no longer have a hardcoded
BCH_REPLICAS_MAX limit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Cleanup bch2_dirent_lookup_trans()
Kent Overstreet [Thu, 25 Jan 2024 17:35:06 +0000 (12:35 -0500)]
bcachefs: Cleanup bch2_dirent_lookup_trans()

Drop an unnecessary bch2_subvolume_get_snapshot() call, and drop the __
from the name - this is a normal interface.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: bch2_hash_set_snapshot() -> bch2_hash_set_in_snapshot()
Kent Overstreet [Fri, 26 Jan 2024 00:57:26 +0000 (19:57 -0500)]
bcachefs: bch2_hash_set_snapshot() -> bch2_hash_set_in_snapshot()

Minor renaming for clarity, bit of refactoring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Workqueues should be WQ_HIGHPRI
Kent Overstreet [Tue, 23 Jan 2024 01:55:08 +0000 (20:55 -0500)]
bcachefs: Workqueues should be WQ_HIGHPRI

Most bcachefs workqueues are used for completions, and should be
WQ_HIGHPRI - this helps reduce queuing delays, we want to complete
quickly once we can no longer signal backpressure by blocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Improve bch2_dirent_to_text()
Kent Overstreet [Sun, 21 Jan 2024 22:46:14 +0000 (17:46 -0500)]
bcachefs: Improve bch2_dirent_to_text()

For DT_SUBVOL, we now print both parent and child subvol IDs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: fixup for building in userspace
Kent Overstreet [Tue, 16 Jan 2024 22:29:15 +0000 (17:29 -0500)]
bcachefs: fixup for building in userspace

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Avoid taking journal lock unnecessarily
Kent Overstreet [Wed, 31 Jan 2024 16:28:13 +0000 (11:28 -0500)]
bcachefs: Avoid taking journal lock unnecessarily

Previously, any time we failed to get a journal reservation we'd retry,
with the journal lock held; but this isn't necessary given
wait_event()/wake_up() ordering.

This avoids performance cliffs when the journal starts to get backed up
and lock contention shoots up.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Journal writes should be REQ_SYNC|REQ_META
Kent Overstreet [Wed, 31 Jan 2024 16:25:46 +0000 (11:25 -0500)]
bcachefs: Journal writes should be REQ_SYNC|REQ_META

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Avoid setting j->write_work unnecessarily
Kent Overstreet [Wed, 31 Jan 2024 16:24:37 +0000 (11:24 -0500)]
bcachefs: Avoid setting j->write_work unnecessarily

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Split out journal workqueue
Kent Overstreet [Wed, 31 Jan 2024 16:21:46 +0000 (11:21 -0500)]
bcachefs: Split out journal workqueue

We don't want journal write completions to be blocked behind btree
transactions - io_complete_wq is used for btree updates after data and
metadata writes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Kill unnecessary wakeups in journal reclaim
Kent Overstreet [Wed, 31 Jan 2024 16:06:59 +0000 (11:06 -0500)]
bcachefs: Kill unnecessary wakeups in journal reclaim

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: skip invisible entries in empty subvolume checking
Guoyu Ou [Tue, 13 Feb 2024 08:20:04 +0000 (16:20 +0800)]
bcachefs: skip invisible entries in empty subvolume checking

When we are checking whether a subvolume is empty in the specified snapshot,
entries that do not belong to this subvolume should be skipped.

This fixes the following case:

    $ bcachefs subvolume create ./sub
    $ cd sub
    $ bcachefs subvolume create ./sub2
    $ bcachefs subvolume snapshot . ./snap
    $ ls -a snap
    . ..
    $ rmdir snap
    rmdir: failed to remove 'snap': Directory not empty

As Kent suggested, we pass 0 in may_delete_deleted_inode() to ignore subvols
in the subvol we are checking, because inode.bi_subvol is only set on
subvolume roots, and we can't go through every inode in the subvolume and
change bi_subvol when taking a snapshot. It makes the check less strict, but
that's ok, the rest of fsck will still catch it.

Signed-off-by: Guoyu Ou <benogy@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: fix split brain message
Kent Overstreet [Sat, 27 Jan 2024 05:31:13 +0000 (00:31 -0500)]
bcachefs: fix split brain message

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Set path->uptodate when no node at level
Kent Overstreet [Wed, 24 Jan 2024 21:32:12 +0000 (16:32 -0500)]
bcachefs: Set path->uptodate when no node at level

We were failing to set path->uptodate when reaching the end of a btree
node iterator, causing the new prefetch code for backpointers gc to go
into an infinite loop.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Correctly validate k->u64s in btree node read path
Kent Overstreet [Fri, 8 Mar 2024 19:53:03 +0000 (14:53 -0500)]
bcachefs: Correctly validate k->u64s in btree node read path

validate_bset_keys() never properly validated k->u64s; it checked if it
was 0, but not if it was smaller than keys for the given packed format;
this fixes that small oversight.

This patch was backported, so it's adding quite a few error enums so
that they don't get renumbered and we don't have confusing gaps.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Fix degraded mode fsck
Kent Overstreet [Sun, 10 Mar 2024 18:54:09 +0000 (14:54 -0400)]
bcachefs: Fix degraded mode fsck

We don't know where the superblock and journal lives on offline devices;
that means if a device is offline fsck can't check those buckets.

Previously, fsck would incorrectly clear bucket data types for those
buckets on offline devices; now we just use the previous state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Fix journal replay with unreadable btree roots
Kent Overstreet [Sat, 9 Mar 2024 00:57:22 +0000 (19:57 -0500)]
bcachefs: Fix journal replay with unreadable btree roots

When a btree root is unreadable, we still might be able to get some data
back by replaying what's in the journal. Previously though, we got
confused when journal replay would attempt to replay a key for a level
that didn't exist.

This adds bch2_btree_increase_depth(), so that journal replay can handle
this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: fix check_inode_deleted_list()
Kent Overstreet [Tue, 27 Feb 2024 12:38:50 +0000 (07:38 -0500)]
bcachefs: fix check_inode_deleted_list()

check_inode_deleted_list() returns true if the inode is on the deleted
list; check_inode() was checking the return code incorrectly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: no_splitbrain_check option
Kent Overstreet [Fri, 8 Mar 2024 21:03:19 +0000 (16:03 -0500)]
bcachefs: no_splitbrain_check option

This adds an option to disable kicking out devices when splitbrain is
detected - it seems there's some issues with splitbrain detection and
we're kicking out devices erronously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: extent_entry_next_safe()
Kent Overstreet [Fri, 8 Mar 2024 20:25:27 +0000 (15:25 -0500)]
bcachefs: extent_entry_next_safe()

We need to be able to iterate over extent ptrs that may be corrupted in
order to print them - this fixes a bug where we'd pop an assert in
bch2_bkey_durability_safe().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: journal_seq_blacklist_add() now handles entries being added out of order
Kent Overstreet [Thu, 7 Mar 2024 17:30:49 +0000 (12:30 -0500)]
bcachefs: journal_seq_blacklist_add() now handles entries being added out of order

bch2_journal_seq_blacklist_add() was bugged when the new entry
overlapped with multiple existing entries, and it also assumed new
entries are being added in increasing order.

This is true on any sane filesystem, but when trying to recover from
very badly mangled filesystems we might end up with the journal sequence
number rewinding vs. what the blacklist list knows about - easiest to
just handle that here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 months agobcachefs: Fix null-ptr-deref in bch2_fs_alloc()
Li Zetao [Mon, 4 Mar 2024 03:22:03 +0000 (11:22 +0800)]
bcachefs: Fix null-ptr-deref in bch2_fs_alloc()

There is a null-ptr-deref issue reported by kasan:

  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  Call Trace:
    <TASK>
    bch2_fs_alloc+0x1092/0x2170 [bcachefs]
    bch2_fs_open+0x683/0xe10 [bcachefs]
    ...

When initializing the name of bch_fs, it needs to dynamically alloc memory
to meet the length of the name. However, when name allocation failed, it
will cause a null-ptr-deref access exception in subsequent string copy.

Fix this issue by checking if name allocation is successful.

Fixes: 401ec4db6308 ("bcachefs: Printbuf rework")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 months agoLinux 6.8-rc6
Linus Torvalds [Sun, 25 Feb 2024 23:46:06 +0000 (15:46 -0800)]
Linux 6.8-rc6

13 months agoMerge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs
Linus Torvalds [Sun, 25 Feb 2024 23:31:57 +0000 (15:31 -0800)]
Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Some more mostly boring fixes, but some not

  User reported ones:

   - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
     performance bug; user reported an untar initially taking two
     seconds and then ~2 minutes

   - kill a __GFP_NOFAIL in the buffered read path; this was a leftover
     from the trickier fix to kill __GFP_NOFAIL in readahead, where we
     can't return errors (and have to silently truncate the read
     ourselves).

     bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
     filesystems because our folio state is just barely too big, 2MB
     hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.

     additionally, the flags argument was just buggy, we weren't
     supplying GFP_KERNEL previously (!)"

* tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: fix bch2_save_backtrace()
  bcachefs: Fix check_snapshot() memcpy
  bcachefs: Fix bch2_journal_flush_device_pins()
  bcachefs: fix iov_iter count underflow on sub-block dio read
  bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
  bcachefs: Kill __GFP_NOFAIL in buffered read path
  bcachefs: fix backpointer_to_text() when dev does not exist

13 months agobcachefs: fix bch2_save_backtrace()
Kent Overstreet [Sun, 25 Feb 2024 20:45:34 +0000 (15:45 -0500)]
bcachefs: fix bch2_save_backtrace()

Missed a call in the previous fix.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 months agoMerge tag 'docs-6.8-fixes3' of git://git.lwn.net/linux
Linus Torvalds [Sun, 25 Feb 2024 18:58:12 +0000 (10:58 -0800)]
Merge tag 'docs-6.8-fixes3' of git://git.lwn.net/linux

Pull two documentation build fixes from Jonathan Corbet:

 - The XFS online fsck documentation uses incredibly deeply nested
   subsection and list nesting; that broke the PDF docs build. Tweak a
   parameter to tell LaTeX to allow the deeper nesting.

 - Fix a 6.8 PDF-build regression

* tag 'docs-6.8-fixes3' of git://git.lwn.net/linux:
  docs: translations: use attribute to store current language
  docs: Instruct LaTeX to cope with deeper nesting

13 months agoMerge tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 25 Feb 2024 18:41:57 +0000 (10:41 -0800)]
Merge tag 'usb-6.8-rc6' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB fixes for 6.8-rc6 to resolve some reported
  problems. These include:

   - regression fixes with typec tpcm code as reported by many

   - cdnsp and cdns3 driver fixes

   - usb role setting code bugfixes

   - build fix for uhci driver

   - ncm gadget driver bugfix

   - MAINTAINERS entry update

  All of these have been in linux-next all week with no reported issues
  and there is at least one fix in here that is in Thorsten's regression
  list that is being tracked"

* tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: tpcm: Fix issues with power being removed during reset
  MAINTAINERS: Drop myself as maintainer of TYPEC port controller drivers
  usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs
  Revert "usb: typec: tcpm: reset counter when enter into unattached state after try role"
  usb: gadget: omap_udc: fix USB gadget regression on Palm TE
  usb: dwc3: gadget: Don't disconnect if not started
  usb: cdns3: fix memory double free when handle zero packet
  usb: cdns3: fixed memory use after free at cdns3_gadget_ep_disable()
  usb: roles: don't get/set_role() when usb_role_switch is unregistered
  usb: roles: fix NULL pointer issue when put module's reference
  usb: cdnsp: fixed issue with incorrect detecting CDNSP family controllers
  usb: cdnsp: blocked some cdns3 specific code
  usb: uhci-grlib: Explicitly include linux/platform_device.h

13 months agoMerge tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sun, 25 Feb 2024 18:35:41 +0000 (10:35 -0800)]
Merge tag 'tty-6.8-rc6' of git://git./linux/kernel/git/gregkh/tty

Pull tty/serial driver fixes from Greg KH:
 "Here are three small serial/tty driver fixes for 6.8-rc6 that resolve
  the following reported errors:

   - riscv hvc console driver fix that was reported by many

   - amba-pl011 serial driver fix for RS485 mode

   - stm32 serial driver fix for RS485 mode

  All of these have been in linux-next all week with no reported
  problems"

* tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: amba-pl011: Fix DMA transmission in RS485 mode
  serial: stm32: do not always set SER_RS485_RX_DURING_TX if RS485 is enabled
  tty: hvc: Don't enable the RISC-V SBI console by default

13 months agoMerge tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 25 Feb 2024 18:22:21 +0000 (10:22 -0800)]
Merge tag 'x86_urgent_for_v6.8_rc6' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Make sure clearing CPU buffers using VERW happens at the latest
   possible point in the return-to-userspace path, otherwise memory
   accesses after the VERW execution could cause data to land in CPU
   buffers again

* tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM/VMX: Move VERW closer to VMentry for MDS mitigation
  KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
  x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
  x86/entry_32: Add VERW just before userspace transition
  x86/entry_64: Add VERW just before userspace transition
  x86/bugs: Add asm helpers for executing VERW

13 months agoMerge tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 25 Feb 2024 18:14:12 +0000 (10:14 -0800)]
Merge tag 'irq_urgent_for_v6.8_rc6' of git://git./linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Make sure GICv4 always gets initialized to prevent a kexec-ed kernel
   from silently failing to set it up

 - Do not call bus_get_dev_root() for the mbigen irqchip as it always
   returns NULL - use NULL directly

 - Fix hardware interrupt number truncation when assigning MSI
   interrupts

 - Correct sending end-of-interrupt messages to disabled interrupts
   lines on RISC-V PLIC

* tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3-its: Do not assume vPE tables are preallocated
  irqchip/mbigen: Don't use bus_get_dev_root() to find the parent
  PCI/MSI: Prevent MSI hardware interrupt number truncation
  irqchip/sifive-plic: Enable interrupt if needed before EOI

13 months agoMerge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 25 Feb 2024 17:53:13 +0000 (09:53 -0800)]
Merge tag 'erofs-for-6.8-rc6-fixes' of git://git./linux/kernel/git/xiang/erofs

Pull erofs fix from Gao Xiang:

 - Fix page refcount leak when looking up specific inodes
   introduced by metabuf reworking

* tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix refcount on the metabuf used for inode lookup

13 months agoMerge tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 25 Feb 2024 17:29:05 +0000 (09:29 -0800)]
Merge tag 'pull-fixes.pathwalk-rcu-2' of git://git./linux/kernel/git/viro/vfs

Pull RCU pathwalk fixes from Al Viro:
 "We still have some races in filesystem methods when exposed to RCU
  pathwalk. This series is a result of code audit (the second round of
  it) and it should deal with most of that stuff.

  Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
  Up to maintainers (a note for NTFS folks - when documentation says
  that a method may not block, it *does* imply that blocking allocations
  are to be avoided. Really)"

[ More explanations for people who aren't familiar with the vagaries of
  RCU path walking: most of it is hidden from filesystems, but if a
  filesystem actively participates in the low-level path walking it
  needs to make sure the fields involved in that walk are RCU-safe.

  That "actively participate in low-level path walking" includes things
  like having its own ->d_hash()/->d_compare() routines, or by having
  its own directory permission function that doesn't just use the common
  helpers.  Having a ->d_revalidate() function will also have this issue.

  Note that instead of making everything RCU safe you can also choose to
  abort the RCU pathwalk if your operation cannot be done safely under
  RCU, but that obviously comes with a performance penalty. One common
  pattern is to allow the simple cases under RCU, and abort only if you
  need to do something more complicated.

  So not everything needs to be RCU-safe, and things like the inode etc
  that the VFS itself maintains obviously already are. But these fixes
  tend to be about properly RCU-delaying things like ->s_fs_info that
  are maintained by the filesystem and that got potentially released too
  early.   - Linus ]

* tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4_get_link(): fix breakage in RCU mode
  cifs_get_link(): bail out in unsafe case
  fuse: fix UAF in rcu pathwalks
  procfs: make freeing proc_fs_info rcu-delayed
  procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
  nfs: fix UAF on pathwalk running into umount
  nfs: make nfs_set_verifier() safe for use in RCU pathwalk
  afs: fix __afs_break_callback() / afs_drop_open_mmap() race
  hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
  exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
  affs: free affs_sb_info with kfree_rcu()
  rcu pathwalk: prevent bogus hard errors from may_lookup()
  fs/super.c: don't drop ->s_user_ns until we free struct super_block itself

13 months agoMerge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sun, 25 Feb 2024 17:17:15 +0000 (09:17 -0800)]
Merge tag 'pull-fixes' of git://git./linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
 "A couple of fixes - revert of regression from this cycle and a fix for
  erofs failure exit breakage (had been there since way back)"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  erofs: fix handling kern_mount() failure
  Revert "get rid of DCACHE_GENOCIDE"

13 months agoext4_get_link(): fix breakage in RCU mode
Al Viro [Sat, 3 Feb 2024 06:17:34 +0000 (01:17 -0500)]
ext4_get_link(): fix breakage in RCU mode

1) errors from ext4_getblk() should not be propagated to caller
unless we are really sure that we would've gotten the same error
in non-RCU pathwalk.
2) we leak buffer_heads if ext4_getblk() is successful, but bh is
not uptodate.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agocifs_get_link(): bail out in unsafe case
Al Viro [Wed, 20 Sep 2023 02:28:16 +0000 (22:28 -0400)]
cifs_get_link(): bail out in unsafe case

->d_revalidate() bails out there, anyway.  It's not enough
to prevent getting into ->get_link() in RCU mode, but that
could happen only in a very contrieved setup.  Not worth
trying to do anything fancy here unless ->d_revalidate()
stops kicking out of RCU mode at least in some cases.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agofuse: fix UAF in rcu pathwalks
Al Viro [Thu, 28 Sep 2023 04:19:39 +0000 (00:19 -0400)]
fuse: fix UAF in rcu pathwalks

->permission(), ->get_link() and ->inode_get_acl() might dereference
->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
as well) when called from rcu pathwalk.

Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
and dropping ->user_ns rcu-delayed too.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agoprocfs: make freeing proc_fs_info rcu-delayed
Al Viro [Wed, 20 Sep 2023 04:12:00 +0000 (00:12 -0400)]
procfs: make freeing proc_fs_info rcu-delayed

makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns()
is still synchronous, but that's not a problem - it does
rcu-delay everything that needs to be)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agoprocfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
Al Viro [Wed, 20 Sep 2023 03:52:58 +0000 (23:52 -0400)]
procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()

that keeps both around until struct inode is freed, making access
to them safe from rcu-pathwalk

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agonfs: fix UAF on pathwalk running into umount
Al Viro [Thu, 28 Sep 2023 02:11:26 +0000 (22:11 -0400)]
nfs: fix UAF on pathwalk running into umount

NFS ->d_revalidate(), ->permission() and ->get_link() need to access
some parts of nfs_server when called in RCU mode:
server->flags
server->caps
*(server->io_stats)
and, worst of all, call
server->nfs_client->rpc_ops->have_delegation
(the last one - as NFS_PROTO(inode)->have_delegation()).  We really
don't want to RCU-delay the entire nfs_free_server() (it would have
to be done with schedule_work() from RCU callback, since it can't
be made to run from interrupt context), but actual freeing of
nfs_server and ->io_stats can be done via call_rcu() just fine.
nfs_client part is handled simply by making nfs_free_client() use
kfree_rcu().

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agonfs: make nfs_set_verifier() safe for use in RCU pathwalk
Al Viro [Thu, 28 Sep 2023 01:50:25 +0000 (21:50 -0400)]
nfs: make nfs_set_verifier() safe for use in RCU pathwalk

nfs_set_verifier() relies upon dentry being pinned; if that's
the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
that ->d_parent points to a positive dentry.  For something
we'd run into in RCU mode that is *not* true - dentry might've
been through dentry_kill() just as we grabbed ->d_lock, with
its parent going through the same just as we get to into
nfs_set_verifier_locked().  It might get to detaching inode
(and zeroing ->d_inode) before nfs_set_verifier_locked() gets
to fetching that; we get an oops as the result.

That can happen in nfs{,4} ->d_revalidate(); the call chain in
question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
We have checked that the parent had been positive, but that's
done before we get to nfs_set_verifier() and it's possible for
memory pressure to pick our dentry as eviction candidate by that
time.  If that happens, back-to-back attempts to kill dentry and
its parent are quite normal.  Sure, in case of eviction we'll
fail the ->d_seq check in the caller, but we need to survive
until we return there...

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agoafs: fix __afs_break_callback() / afs_drop_open_mmap() race
Al Viro [Sat, 30 Sep 2023 00:24:34 +0000 (20:24 -0400)]
afs: fix __afs_break_callback() / afs_drop_open_mmap() race

In __afs_break_callback() we might check ->cb_nr_mmap and if it's non-zero
do queue_work(&vnode->cb_work).  In afs_drop_open_mmap() we decrement
->cb_nr_mmap and do flush_work(&vnode->cb_work) if it reaches zero.

The trouble is, there's nothing to prevent __afs_break_callback() from
seeing ->cb_nr_mmap before the decrement and do queue_work() after both
the decrement and flush_work().  If that happens, we might be in trouble -
vnode might get freed before the queued work runs.

__afs_break_callback() is always done under ->cb_lock, so let's make
sure that ->cb_nr_mmap can change from non-zero to zero while holding
->cb_lock (the spinlock component of it - it's a seqlock and we don't
need to mess with the counter).

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agohfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
Al Viro [Wed, 20 Sep 2023 00:18:59 +0000 (20:18 -0400)]
hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info

->d_hash() and ->d_compare() use those, so we need to delay freeing
them.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agoexfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
Al Viro [Tue, 19 Sep 2023 19:53:32 +0000 (15:53 -0400)]
exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper

That stuff can be accessed by ->d_hash()/->d_compare(); as it is, we have
a hard-to-hit UAF if rcu pathwalk manages to get into ->d_hash() on a filesystem
that is in process of getting shut down.

Besides, having nls and upcase table cleanup moved from ->put_super() towards
the place where sbi is freed makes for simpler failure exits.

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agoaffs: free affs_sb_info with kfree_rcu()
Al Viro [Tue, 19 Sep 2023 23:36:07 +0000 (19:36 -0400)]
affs: free affs_sb_info with kfree_rcu()

one of the flags in it is used by ->d_hash()/->d_compare()

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13 months agorcu pathwalk: prevent bogus hard errors from may_lookup()
Al Viro [Sat, 30 Sep 2023 01:11:41 +0000 (21:11 -0400)]
rcu pathwalk: prevent bogus hard errors from may_lookup()

If lazy call of ->permission() returns a hard error, check that
try_to_unlazy() succeeds before returning it.  That both makes
life easier for ->permission() instances and closes the race
in ENOTDIR handling - it is possible that positive d_can_lookup()
seen in link_path_walk() applies to the state *after* unlink() +
mkdir(), while nd->inode matches the state prior to that.

Normally seeing e.g. EACCES from permission check in rcu pathwalk
means that with some timings non-rcu pathwalk would've run into
the same; however, running into a non-executable regular file
in the middle of a pathname would not get to permission check -
it would fail with ENOTDIR instead.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>