Christoph Hellwig [Wed, 20 Dec 2023 06:35:02 +0000 (07:35 +0100)]
xfs: remove xfs_attr_sf_hdr_t
Remove the last two users of the typedef.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 20 Dec 2023 06:35:01 +0000 (07:35 +0100)]
xfs: remove struct xfs_attr_shortform
sparse complains about struct xfs_attr_shortform because it embeds a
structure with a variable sized array in a variable sized array.
Given that xfs_attr_shortform is not a very useful structure, and the
dir2 equivalent has been removed a long time ago, remove it as well.
Provide a xfs_attr_sf_firstentry helper that returns the first
xfs_attr_sf_entry behind a xfs_attr_sf_hdr to replace the structure
dereference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 20 Dec 2023 06:35:00 +0000 (07:35 +0100)]
xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalue
xfs_attr_shortform_getvalue duplicates the logic in xfs_attr_sf_findname.
Use the helper instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 20 Dec 2023 06:34:59 +0000 (07:34 +0100)]
xfs: remove xfs_attr_shortform_lookup
xfs_attr_shortform_lookup is only used by xfs_attr_shortform_addname,
which is much better served by calling xfs_attr_sf_findname. Switch
it over and remove xfs_attr_shortform_lookup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 20 Dec 2023 06:34:58 +0000 (07:34 +0100)]
xfs: simplify xfs_attr_sf_findname
xfs_attr_sf_findname has the simple job of finding a xfs_attr_sf_entry in
the attr fork, but the convoluted calling convention obfuscates that.
Return the found entry as the return value instead of an pointer
argument, as the -ENOATTR/-EEXIST can be trivally derived from that, and
remove the basep argument, as it is equivalent of the offset of sfe in
the data for if an sfe was found, or an offset of totsize if not was
found. To simplify the totsize computation add a xfs_attr_sf_endptr
helper that returns the imaginative xfs_attr_sf_entry at the end of
the current attrs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 20 Dec 2023 06:34:57 +0000 (07:34 +0100)]
xfs: move the xfs_attr_sf_lookup tracepoint
trace_xfs_attr_sf_lookup is currently only called by
xfs_attr_shortform_lookup, which despit it's name is a simple helper for
xfs_attr_shortform_addname, which has it's own tracing. Move the
callsite to xfs_attr_shortform_getvalue, which is the closest thing to
a high level lookup we have for the Linux xattr API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 20 Dec 2023 06:34:56 +0000 (07:34 +0100)]
xfs: return if_data from xfs_idata_realloc
Many of the xfs_idata_realloc callers need to set a local pointer to the
just reallocated if_data memory. Return the pointer to simplify them a
bit and use the opportunity to re-use krealloc for freeing if_data if the
size hits 0.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 20 Dec 2023 06:34:55 +0000 (07:34 +0100)]
xfs: make if_data a void pointer
The xfs_ifork structure currently has a union of the if_root void pointer
and the if_data char pointer. In either case it is an opaque pointer
that depends on the fork format. Replace the union with a single if_data
void pointer as that is what almost all callers want. Only the symlink
NULL termination code in xfs_init_local_fork actually needs a new local
variable now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:37 +0000 (05:57 +0100)]
xfs: fold xfs_rtallocate_extent into xfs_bmap_rtalloc
There isn't really much left in xfs_rtallocate_extent now, fold it into
the only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:36 +0000 (05:57 +0100)]
xfs: simplify and optimize the RT allocation fallback cascade
There are currently multiple levels of fall back if an RT allocation
can not be satisfied:
1) xfs_rtallocate_extent extends the minlen and reduces the maxlen due
to the extent size hint. If that can't be done, it return -ENOSPC
and let's xfs_bmap_rtalloc retry, which then not only drops the
extent size hint based alignment, but also the minlen adjustment
2) if xfs_rtallocate_extent gets -ENOSPC from the underlying functions,
it only drops the extent size hint based alignment and retries
3) if that still does not succeed, xfs_rtallocate_extent drops the
extent size hint (which is a complex no-op at this point) and the
minlen using the same code as (1) above
4) if that still doesn't success and the caller wanted an allocation
near a blkno, drop that blkno hint.
The handling in 1 is rather inefficient as we could just drop the
alignment and continue, and 2/3 interact in really weird ways due to
the duplicate policy.
Move aligning the min and maxlen out of xfs_rtallocate_extent and into
a helper called directly by xfs_bmap_rtalloc. This allows just
continuing with the allocation if we have to drop the alignment instead
of going through the retry loop and also dropping the perfectly usable
minlen adjustment that didn't cause the problem, and then just use
a single retry that drops both the minlen and alignment requirement
when we really are out of space, thus consolidating cases (2) and (3)
above.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:35 +0000 (05:57 +0100)]
xfs: reorder the minlen and prod calculations in xfs_bmap_rtalloc
xfs_bmap_rtalloc is a bit of a mess in terms of calculating the locally
need variables. Reorder them a bit so that related code is located
next to each other - the raminlen calculation moves up next to where
the maximum len is calculated, and all the prod calculation is move
into a single place and rearranged so that the real prod calculation
only happens when it actually is needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:34 +0000 (05:57 +0100)]
xfs: remove XFS_RTMIN/XFS_RTMAX
Use the kernel min/max helpers instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:33 +0000 (05:57 +0100)]
xfs: remove rt-wrappers from xfs_format.h
xfs_format.h has a bunch odd wrappers for helper functions and mount
structure access using RT* prefixes. Replace them with their open coded
versions (for those that weren't entirely unused) and remove the wrappers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:32 +0000 (05:57 +0100)]
xfs: factor out a xfs_rtalloc_sumlevel helper
xfs_rtallocate_extent_size has two loops with nearly identical logic
in them. Split that logic into a separate xfs_rtalloc_sumlevel helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:31 +0000 (05:57 +0100)]
xfs: tidy up xfs_rtallocate_extent_exact
Use common code for both xfs_rtallocate_range calls by moving
the !isfree logic into the non-default branch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:30 +0000 (05:57 +0100)]
xfs: merge the calls to xfs_rtallocate_range in xfs_rtallocate_block
Use a goto to use a common tail for the case of being able to allocate
an extent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:29 +0000 (05:57 +0100)]
xfs: reflow the tail end of xfs_rtallocate_extent_block
Change polarity of a check so that the successful case of being able to
allocate an extent is in the main path of the function and error handling
is on a branch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:28 +0000 (05:57 +0100)]
xfs: invert a check in xfs_rtallocate_extent_block
Doing a break in the else side of a conditional is rather silly. Invert
the check, break ASAP and unindent the other leg.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:27 +0000 (05:57 +0100)]
xfs: split xfs_rtmodify_summary_int
Inline the logic of xfs_rtmodify_summary_int into xfs_rtmodify_summary
and xfs_rtget_summary instead of having a somewhat awkward helper to
share a little bit of code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:26 +0000 (05:57 +0100)]
xfs: move xfs_rtget_summary to xfs_rtbitmap.c
xfs_rtmodify_summary_int is only used inside xfs_rtbitmap.c and to
implement xfs_rtget_summary. Move xfs_rtget_summary to xfs_rtbitmap.c
as the exported API and mark xfs_rtmodify_summary_int static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:25 +0000 (05:57 +0100)]
xfs: cleanup picking the start extent hint in xfs_bmap_rtalloc
Clean up the logical in xfs_bmap_rtalloc that tries to find a rtextent
to start the search from by using a separate variable for the hint, not
calling xfs_bmap_adjacent when we want to ignore the locality and avoid
an extra roundtrip converting between block numbers and RT extent
numbers.
As a side-effect this doesn't pointlessly call xfs_rtpick_extent and
increment the start rtextent hint if we are going to ignore the result
anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:24 +0000 (05:57 +0100)]
xfs: indicate if xfs_bmap_adjacent changed ap->blkno
Add a return value to xfs_bmap_adjacent to indicate if it did change
ap->blkno or not.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:23 +0000 (05:57 +0100)]
xfs: reflow the tail end of xfs_bmap_rtalloc
Reorder the tail end of xfs_bmap_rtalloc so that the successfully
allocation is in the main path, and the error handling is on a branch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:22 +0000 (05:57 +0100)]
xfs: return -ENOSPC from xfs_rtallocate_*
Just return -ENOSPC instead of returning 0 and setting the return rt
extent number to NULLRTEXTNO. This is turn removes all users of
NULLRTEXTNO, so remove that as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:21 +0000 (05:57 +0100)]
xfs: move xfs_bmap_rtalloc to xfs_rtalloc.c
xfs_bmap_rtalloc is currently in xfs_bmap_util.c, which is a somewhat
odd spot for it, given that is only called from xfs_bmap.c and calls
into xfs_rtalloc.c to do the actual work. Move xfs_bmap_rtalloc to
xfs_rtalloc.c and mark xfs_rtpick_extent xfs_rtallocate_extent and
xfs_rtallocate_extent static now that they aren't called from outside
of xfs_rtalloc.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:20 +0000 (05:57 +0100)]
xfs: also use xfs_bmap_btalloc_accounting for RT allocations
Make xfs_bmap_btalloc_accounting more generic by handling the RT quota
reservations and then also use it from xfs_bmap_rtalloc instead of
open coding the accounting logic there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:19 +0000 (05:57 +0100)]
xfs: remove the xfs_alloc_arg argument to xfs_bmap_btalloc_accounting
xfs_bmap_btalloc_accounting only uses the len field from args, but that
has just been propagated to ap->length field by the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:18 +0000 (05:57 +0100)]
xfs: turn the xfs_trans_mod_dquot_byino stub into an inline function
Without this upcoming change can cause an unused variable warning,
when adding a local variable for the fields field passed to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 18 Dec 2023 04:57:17 +0000 (05:57 +0100)]
xfs: consider minlen sized extents in xfs_rtallocate_extent_block
minlen is the lower bound on the extent length that the caller can
accept, and maxlen is at this point the maximal available length.
This means a minlen extent is perfectly fine to use, so do it. This
matches the equivalent logic in xfs_rtallocate_extent_exact that also
accepts a minlen sized extent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Wang Jinchao [Fri, 15 Dec 2023 10:24:34 +0000 (18:24 +0800)]
xfs/health: cleanup, remove duplicated including
remove the second ones:
\#include "xfs_trans_resv.h"
\#include "xfs_mount.h"
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Long Li [Fri, 15 Dec 2023 08:22:34 +0000 (16:22 +0800)]
xfs: fix perag leak when growfs fails
During growfs, if new ag in memory has been initialized, however
sb_agcount has not been updated, if an error occurs at this time it
will cause perag leaks as follows, these new AGs will not been freed
during umount , because of these new AGs are not visible(that is
included in mp->m_sb.sb_agcount).
unreferenced object 0xffff88810be40200 (size 512):
comm "xfs_growfs", pid 857, jiffies
4294909093
hex dump (first 32 bytes):
00 c0 c1 05 81 88 ff ff 04 00 00 00 00 00 00 00 ................
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc
381741e2):
[<
ffffffff8191aef6>] __kmalloc+0x386/0x4f0
[<
ffffffff82553e65>] kmem_alloc+0xb5/0x2f0
[<
ffffffff8238dac5>] xfs_initialize_perag+0xc5/0x810
[<
ffffffff824f679c>] xfs_growfs_data+0x9bc/0xbc0
[<
ffffffff8250b90e>] xfs_file_ioctl+0x5fe/0x14d0
[<
ffffffff81aa5194>] __x64_sys_ioctl+0x144/0x1c0
[<
ffffffff83c3d81f>] do_syscall_64+0x3f/0xe0
[<
ffffffff83e00087>] entry_SYSCALL_64_after_hwframe+0x62/0x6a
unreferenced object 0xffff88810be40800 (size 512):
comm "xfs_growfs", pid 857, jiffies
4294909093
hex dump (first 32 bytes):
20 00 00 00 00 00 00 00 57 ef be dc 00 00 00 00 .......W.......
10 08 e4 0b 81 88 ff ff 10 08 e4 0b 81 88 ff ff ................
backtrace (crc
bde50e2d):
[<
ffffffff8191b43a>] __kmalloc_node+0x3da/0x540
[<
ffffffff81814489>] kvmalloc_node+0x99/0x160
[<
ffffffff8286acff>] bucket_table_alloc.isra.0+0x5f/0x400
[<
ffffffff8286bdc5>] rhashtable_init+0x405/0x760
[<
ffffffff8238dda3>] xfs_initialize_perag+0x3a3/0x810
[<
ffffffff824f679c>] xfs_growfs_data+0x9bc/0xbc0
[<
ffffffff8250b90e>] xfs_file_ioctl+0x5fe/0x14d0
[<
ffffffff81aa5194>] __x64_sys_ioctl+0x144/0x1c0
[<
ffffffff83c3d81f>] do_syscall_64+0x3f/0xe0
[<
ffffffff83e00087>] entry_SYSCALL_64_after_hwframe+0x62/0x6a
Factor out xfs_free_unused_perag_range() from xfs_initialize_perag(),
used for freeing unused perag within a specified range in error handling,
included in the error path of the growfs failure.
Fixes: 1c1c6ebcf528 ("xfs: Replace per-ag array with a radix tree")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Long Li [Fri, 15 Dec 2023 08:22:33 +0000 (16:22 +0800)]
xfs: add lock protection when remove perag from radix tree
Take mp->m_perag_lock for deletions from the perag radix tree in
xfs_initialize_perag to prevent racing with tagging operations.
Lookups are fine - they are RCU protected so already deal with the
tree changing shape underneath the lookup - but tagging operations
require the tree to be stable while the tags are propagated back up
to the root.
Right now there's nothing stopping radix tree tagging from operating
while a growfs operation is progress and adding/removing new entries
into the radix tree.
Hence we can have traversals that require a stable tree occurring at
the same time we are removing unused entries from the radix tree which
causes the shape of the tree to change.
Likely this hasn't caused a problem in the past because we are only
doing append addition and removal so the active AG part of the tree
is not changing shape, but that doesn't mean it is safe. Just making
the radix tree modifications serialise against each other is obviously
correct.
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Chandan Babu R [Sat, 16 Dec 2023 03:23:55 +0000 (08:53 +0530)]
Merge tag 'repair-quota-6.8_2023-12-15' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeB
xfs: online repair of quota and rt metadata files
XFS stores quota records and free space bitmap information in files.
Add the necessary infrastructure to enable repairing metadata inodes and
their forks, and then make it so that we can repair the file metadata
for the rtbitmap. Repairing the bitmap contents (and the summary file)
is left for subsequent patchsets.
We also add the ability to repair file metadata the quota files. As
part of these repairs, we also reinitialize the ondisk dquot records as
necessary to get the incore dquots working. We can also correct
obviously bad dquot record attributes, but we leave checking the
resource usage counts for the next patchsets.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-quota-6.8_2023-12-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: repair quotas
xfs: improve dquot iteration for scrub
xfs: check dquot resource timers
xfs: check the ondisk space mapping behind a dquot
Chandan Babu R [Sat, 16 Dec 2023 03:20:01 +0000 (08:50 +0530)]
Merge tag 'repair-rtbitmap-6.8_2023-12-15' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeB
xfs: online repair of rt bitmap file
Add in the necessary infrastructure to check the inode and data forks of
metadata files, then apply that to the realtime bitmap file. We won't
be able to reconstruct the contents of the rtbitmap file until rmapbt is
added for realtime volumes, but we can at least get the basics started.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-rtbitmap-6.8_2023-12-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: online repair of realtime bitmaps
xfs: create a new inode fork block unmap helper
xfs: repair the inode core and forks of a metadata inode
xfs: always check the rtbitmap and rtsummary files
xfs: check rt summary file geometry more thoroughly
xfs: check rt bitmap file geometry more thoroughly
Chandan Babu R [Sat, 16 Dec 2023 03:14:55 +0000 (08:44 +0530)]
Merge tag 'repair-file-mappings-6.8_2023-12-15' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeB
xfs: online repair of file fork mappings
In this series, online repair gains the ability to rebuild data and attr
fork mappings from the reverse mapping information. It is at this point
where we reintroduce the ability to reap file extents.
Repair of CoW forks is a little different -- on disk, CoW staging
extents are owned by the refcount btree and cannot be mapped back to
individual files. Hence we can only detect staging extents that don't
quite look right (missing reverse mappings, shared staging extents) and
replace them with fresh allocations.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-file-mappings-6.8_2023-12-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: repair problems in CoW forks
xfs: create a ranged query function for refcount btrees
xfs: refactor repair forcing tests into a repair.c helper
xfs: repair inode fork block mapping data structures
xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents
Chandan Babu R [Sat, 16 Dec 2023 03:07:00 +0000 (08:37 +0530)]
Merge tag 'repair-inodes-6.8_2023-12-15' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeB
xfs: online repair of inodes and forks
In this series, online repair gains the ability to repair inode records.
To do this, we must repair the ondisk inode and fork information enough
to pass the iget verifiers and hence make the inode igettable again.
Once that's done, we can perform higher level repairs on the incore
inode. The fstests counterpart of this patchset implements stress
testing of repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-inodes-6.8_2023-12-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: skip the rmapbt search on an empty attr fork unless we know it was zapped
xfs: abort directory parent scrub scans if we encounter a zapped directory
xfs: zap broken inode forks
xfs: repair inode records
xfs: set inode sick state flags when we zap either ondisk fork
xfs: dont cast to char * for XFS_DFORK_*PTR macros
xfs: add missing nrext64 inode flag check to scrub
xfs: try to attach dquots to files before repairing them
xfs: disable online repair quota helpers when quota not enabled
Chandan Babu R [Sat, 16 Dec 2023 03:02:05 +0000 (08:32 +0530)]
Merge tag 'repair-ag-btrees-6.8_2023-12-15' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeB
xfs: online repair of AG btrees
Now that we've spent a lot of time reworking common code in online fsck,
we're ready to start rebuilding the AG space btrees. This series
implements repair functions for the free space, inode, and refcount
btrees. Rebuilding the reverse mapping btree is much more intense and
is left for a subsequent patchset. The fstests counterpart of this
patchset implements stress testing of repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-ag-btrees-6.8_2023-12-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: repair refcount btrees
xfs: repair inode btrees
xfs: repair free space btrees
xfs: remove trivial bnobt/inobt scrub helpers
xfs: roll the scrub transaction after completing a repair
xfs: move the per-AG datatype bitmaps to separate files
xfs: create separate structures and code for u32 bitmaps
Chandan Babu R [Sat, 16 Dec 2023 02:57:42 +0000 (08:27 +0530)]
Merge tag 'repair-prep-for-bulk-loading-6.8_2023-12-15' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeB
xfs: prepare repair for bulk loading
Before we start merging the online repair functions, let's improve the
bulk loading code a bit. First, we need to fix a misinteraction between
the AIL and the btree bulkloader wherein the delwri at the end of the
bulk load fails to queue a buffer for writeback if it happens to be on
the AIL list.
Second, we introduce a defer ops barrier object so that the process of
reaping blocks after a repair cannot queue more than two extents per EFI
log item. This increases our exposure to leaking blocks if the system
goes down during a reap, but also should prevent transaction overflows,
which result in the system going down.
Third, we change the bulkloader itself to copy multiple records into a
block if possible, and add some debugging knobs so that developers can
control the slack factors, just like they can do for xfs_repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-prep-for-bulk-loading-6.8_2023-12-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: constrain dirty buffers while formatting a staged btree
xfs: move btree bulkload record initialization to ->get_record implementations
xfs: add debug knobs to control btree bulk load slack factors
xfs: read leaf blocks when computing keys for bulkloading into node blocks
xfs: set XBF_DONE on newly formatted btree block that are ready for writing
xfs: force all buffers to be written during btree bulk load
Darrick J. Wong [Fri, 15 Dec 2023 18:03:45 +0000 (10:03 -0800)]
xfs: repair quotas
Fix anything that causes the quota verifiers to fail.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:45 +0000 (10:03 -0800)]
xfs: improve dquot iteration for scrub
Upon a closer inspection of the quota record scrubber, I noticed that
dqiterate wasn't actually walking all possible dquots for the mapped
blocks in the quota file. This is due to xfs_qm_dqget_next skipping all
XFS_IS_DQUOT_UNINITIALIZED dquots.
For a fsck program, we really want to look at all the dquots, even if
all counters and limits in the dquot record are zero. Rewrite the
implementation to do this, as well as switching to an iterator paradigm
to reduce the number of indirect calls.
This enables removal of the old broken dqiterate code from xfs_dquot.c.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:44 +0000 (10:03 -0800)]
xfs: check dquot resource timers
For each dquot resource, ensure either (a) the resource usage is over
the soft limit and there is a nonzero timer; or (b) usage is at or under
the soft limit and the timer is unset. (a) is redundant with the dquot
buffer verifier, but (b) isn't checked anywhere.
Found by fuzzing xfs/426 and noticing that diskdq.btimer = add didn't
trip any kind of warning for having a timer set even with no limits.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:44 +0000 (10:03 -0800)]
xfs: check the ondisk space mapping behind a dquot
Each xfs_dquot object caches the file offset and daddr of the ondisk
block that backs the dquot. Make sure these cached values are the same
as the bmapi data, and that the block state is written.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:43 +0000 (10:03 -0800)]
xfs: online repair of realtime bitmaps
Fix all the file metadata surrounding the realtime bitmap file, which
includes the rt geometry, file size, forks, and space mappings. The
bitmap contents themselves cannot be fixed without rt rmap, so that will
come later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:43 +0000 (10:03 -0800)]
xfs: create a new inode fork block unmap helper
Create a new helper to unmap blocks from an inode's fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:42 +0000 (10:03 -0800)]
xfs: repair the inode core and forks of a metadata inode
Add a helper function to repair the core and forks of a metadata inode,
so that we can get move onto the task of repairing higher level metadata
that lives in an inode.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:42 +0000 (10:03 -0800)]
xfs: always check the rtbitmap and rtsummary files
XFS filesystems always have a realtime bitmap and summary file, even if
there has never been a realtime volume attached. Always check them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:41 +0000 (10:03 -0800)]
xfs: check rt summary file geometry more thoroughly
I forgot that the xfs_mount tracks the size and number of levels in the
realtime summary file, and that the rt summary file can have more blocks
mapped to the data fork than m_rsumsize implies if growfsrt fails.
So. Add to the rtsummary scrubber an explicit check that all the
summary geometry values are correct, then adjust the rtsummary i_size
checks to allow for the growfsrt failure case. Finally, flag post-eof
blocks in the summary file.
While we're at it, split the extent map checking so that we only call
xfs_bmapi_read once per extent instead of once per rtsummary block.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:41 +0000 (10:03 -0800)]
xfs: check rt bitmap file geometry more thoroughly
I forgot that the superblock tracks the number of blocks that are in the
realtime bitmap, and that the rt bitmap file can have more blocks mapped
to the data fork than sb_rbmblocks if growfsrt fails.
So. Add to the rtbitmap scrubber an explicit check that sb_rextents and
sb_rbmblocks are correct, then adjust the rtbitmap i_size checks to
allow for the growfsrt failure case. Finally, flag post-eof blocks in
the rtbitmap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:40 +0000 (10:03 -0800)]
xfs: repair problems in CoW forks
Try to repair errors that we see in file CoW forks so that we don't do
stupid things like remap garbage into a file. There's not a lot we can
do with the COW fork -- the ondisk metadata record only that the COW
staging extents are owned by the refcount btree, which effectively means
that we can't reconstruct this incore structure from scratch.
Actually, this is even worse -- we can't touch written extents, because
those map space that are actively under writeback, and there's not much
to do with delalloc reservations. Hence we can only detect crosslinked
unwritten extents and fix them by punching out the problematic parts and
replacing them with delalloc extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:40 +0000 (10:03 -0800)]
xfs: create a ranged query function for refcount btrees
Implement ranged queries for refcount records. The next patch will use
this to scan refcount data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:39 +0000 (10:03 -0800)]
xfs: refactor repair forcing tests into a repair.c helper
There are a couple of conditions that userspace can set to force repairs
of metadata. These really belong in the repair code and not open-coded
into the check code, so refactor them into a helper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:39 +0000 (10:03 -0800)]
xfs: repair inode fork block mapping data structures
Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:38 +0000 (10:03 -0800)]
xfs: skip the rmapbt search on an empty attr fork unless we know it was zapped
The attribute fork scrubber can optionally scan the reverse mapping
records of the filesystem to determine if the fork is missing mappings
that it should have. However, this is a very expensive operation, so we
only want to do this if we suspect that the fork is missing records.
For attribute forks the criteria for suspicion is that the attr fork is
in EXTENTS format and has zero extents.
However, there are several ways that a file can end up in this state
through regular filesystem usage. For example, an LSM can set a
s_security hook but then decide not to set an ACL; or an attr set can
create the attr fork but then the actual set operation fails with
ENOSPC; or we can delete all the attrs on a file whose data fork is in
btree format, in which case we do not delete the attr fork. We don't
want to run the expensive check for any case that can be arrived at
through regular operations.
However.
When online inode repair decides to zap an attribute fork, it cannot
determine if it is zapping ACL information. As a precaution it removes
all the discretionary access control permissions and sets the user and
group ids to zero. Check these three additional conditions to decide if
we want to scan the rmap records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:38 +0000 (10:03 -0800)]
xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents
Back in commit
a55e07308831b ("xfs: only allow reaping of per-AG
blocks in xrep_reap_extents"), we removed from the reaping code the
ability to handle bmbt blocks. At the time, the reaping code only
walked single blocks, didn't correctly detect crosslinked blocks, and
the special casing made the function hard to understand. It was easier
to remove unneeded functionality prior to fixing all the bugs.
Now that we've fixed the problems, we want again the ability to reap
file metadata blocks. Reintroduce the per-file reaping functionality
atop the current implementation. We require that sc->sa is
uninitialized, so that we can use it to hold all the per-AG context for
a given extent.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:37 +0000 (10:03 -0800)]
xfs: abort directory parent scrub scans if we encounter a zapped directory
In a previous patch, we added some code to perform sufficient repairs
to an ondisk inode record such that the inode cache would be willing to
load the inode. If the broken inode was a shortform directory, it will
reset the directory to something plausible, which is to say an empty
subdirectory of the root. The telltale signs that something is
seriously wrong is the broken link count.
Such directories look clean, but they shouldn't participate in a
filesystem scan to find or confirm a directory parent pointer. Create a
predicate that identifies such directories and abort the scrub.
Found by fuzzing xfs/1554 with multithreaded xfs_scrub enabled and
u3.bmx[0].startblock = zeroes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:37 +0000 (10:03 -0800)]
xfs: zap broken inode forks
Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true. Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:36 +0000 (10:03 -0800)]
xfs: repair inode records
If an inode is so badly damaged that it cannot be loaded into the cache,
fix the ondisk metadata and try again. If there /is/ a cached inode,
fix any problems and apply any optimizations that can be solved incore.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:35 +0000 (10:03 -0800)]
xfs: set inode sick state flags when we zap either ondisk fork
In a few patches, we'll add some online repair code that tries to
massage the ondisk inode record just enough to get it to pass the inode
verifiers so that we can continue with more file repairs. Part of that
massaging can include zapping the ondisk forks to clear errors. After
that point, the bmap fork repair functions will rebuild the zapped
forks.
Christoph asked for stronger protections against online repair zapping a
fork to get the inode to load vs. other threads trying to access the
partially repaired file. Do this by adding a special "[DA]FORK_ZAPPED"
inode health flag whenever repair zaps a fork, and sprinkling checks for
that flag into the various file operations for things that don't like
handling an unexpected zero-extents fork.
In practice xfs_scrub will scrub and fix the forks almost immediately
after zapping them, so the window is very small. However, if a crash or
unmount should occur, we can still detect these zapped inode forks by
looking for a zero-extents fork when data was expected.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:35 +0000 (10:03 -0800)]
xfs: dont cast to char * for XFS_DFORK_*PTR macros
Code in the next patch will assign the return value of XFS_DFORK_*PTR
macros to a struct pointer. gcc complains about casting char* strings
to struct pointers, so let's fix the macro's cast to void* to shut up
the warnings.
While we're at it, fix one of the scrub tests that uses PTR to use BOFF
instead for a simpler integer comparison, since other linters whine
about char* and void* comparisons.
Can't satisfy all these dman bots.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:34 +0000 (10:03 -0800)]
xfs: add missing nrext64 inode flag check to scrub
Add this missing check that the superblock nrext64 flag is set if the
inode flag is set.
Fixes: 9b7d16e34bbeb ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:34 +0000 (10:03 -0800)]
xfs: try to attach dquots to files before repairing them
Inode resource usage is tracked in the quota metadata. Repairing a file
might change the resources used by that file, which means that we need
to attach dquots to the file that we're examining before accessing
anything in the file protected by the ILOCK.
However, there's a twist: a dquot cache miss requires the dquot to be
read in from the quota file, during which we drop the ILOCK on the file
being examined. This means that we *must* try to attach the dquots
before taking the ILOCK.
Therefore, dquots must be attached to files in the scrub setup function.
If doing so yields corruption errors (or unknown dquot errors), we
instead clear the quotachecked status, which will cause a quotacheck on
next mount. A future series will make this trigger live quotacheck.
While we're here, change the xrep_ino_dqattach function to use the
unlocked dqattach functions so that we avoid cycling the ILOCK if the
inode already has dquots attached. This makes the naming and locking
requirements consistent with the rest of the filesystem.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:33 +0000 (10:03 -0800)]
xfs: repair refcount btrees
Reconstruct the refcount data from the rmap btree.
Link: https://docs.kernel.org/filesystems/xfs-online-fsck-design.html#case-study-rebuilding-the-space-reference-counts
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:33 +0000 (10:03 -0800)]
xfs: disable online repair quota helpers when quota not enabled
Don't compile the quota helper functions if quota isn't being built into
the XFS module.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:32 +0000 (10:03 -0800)]
xfs: repair inode btrees
Use the rmapbt to find inode chunks, query the chunks to compute hole
and free masks, and with that information rebuild the inobt and finobt.
Refer to the case study in
Documentation/filesystems/xfs-online-fsck-design.rst for more details.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:32 +0000 (10:03 -0800)]
xfs: repair free space btrees
Rebuild the free space btrees from the gaps in the rmap btree. Refer to
the case study in Documentation/filesystems/xfs-online-fsck-design.rst
for more details.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:31 +0000 (10:03 -0800)]
xfs: remove trivial bnobt/inobt scrub helpers
Christoph Hellwig complained about awkward code in the next two repair
patches such as:
sc->sm->sm_type = XFS_SCRUB_TYPE_BNOBT;
error = xchk_bnobt(sc);
This is a little silly, so let's export the xchk_{,i}allocbt functions
to the dispatch table in scrub.c directly and get rid of the helpers.
Originally I had planned each btree gets its own separate entry point,
but since repair doesn't work that way, it no longer makes sense to
complicate the call chain that way.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:31 +0000 (10:03 -0800)]
xfs: roll the scrub transaction after completing a repair
When we've finished repairing an AG header, roll the scrub transaction.
This ensure that any failures caused by defer ops failing are captured
by the xrep_done tracepoint and that any stacktraces that occur will
point to the repair code that caused it, instead of xchk_teardown.
Going forward, repair functions should commit the transaction if they're
going to return success. Usually the space reaping functions that run
after a successful atomic commit of the new metadata will take care of
that for us.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:30 +0000 (10:03 -0800)]
xfs: move the per-AG datatype bitmaps to separate files
Move struct xagb_bitmap to its own pair of C and header files per
request of Christoph.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:30 +0000 (10:03 -0800)]
xfs: create separate structures and code for u32 bitmaps
Create a version of the xbitmap that handles 32-bit integer intervals
and adapt the xfs_agblock_t bitmap to use it. This reduces the size of
the interval tree nodes from 48 to 36 bytes and enables us to use a more
efficient slab (:
0000040 instead of :
0000048) which allows us to pack
more nodes into a single slab page (102 vs 85).
As a side effect, the users of these bitmaps no longer have to convert
between u32 and u64 quantities just to use the bitmap; and the hairy
overflow checking code in xagb_bitmap_test goes away.
Later in this patchset we're going to add bitmaps for xfs_agino_t,
xfs_rgblock_t, and xfs_dablk_t, so the increase in code size (5622 vs.
9959 bytes) seems worth it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:29 +0000 (10:03 -0800)]
xfs: constrain dirty buffers while formatting a staged btree
Constrain the number of dirty buffers that are locked by the btree
staging code at any given time by establishing a threshold at which we
put them all on the delwri queue and push them to disk. This limits
memory consumption while writing out new btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:29 +0000 (10:03 -0800)]
xfs: move btree bulkload record initialization to ->get_record implementations
When we're performing a bulk load of a btree, move the code that
actually stores the btree record in the new btree block out of the
generic code and into the individual ->get_record implementations.
This is preparation for being able to store multiple records with a
single indirect call.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:28 +0000 (10:03 -0800)]
xfs: add debug knobs to control btree bulk load slack factors
Add some debug knobs so that we can control the leaf and node block
slack when rebuilding btrees.
For developers, it might be useful to construct btrees of various
heights by crafting a filesystem with a certain number of records and
then using repair+knobs to rebuild the index with a certain shape.
Practically speaking, you'd only ever do that for extreme stress
testing of the runtime code or the btree generator.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:28 +0000 (10:03 -0800)]
xfs: read leaf blocks when computing keys for bulkloading into node blocks
When constructing a new btree, xfs_btree_bload_node needs to read the
btree blocks for level N to compute the keyptrs for the blocks that will
be loaded into level N+1. The level N blocks must be formatted at that
point.
A subsequent patch will change the btree bulkloader to write new btree
blocks in 256K chunks to moderate memory consumption if the new btree is
very large. As a consequence of that, it's possible that the buffers
for lower level blocks might have been reclaimed by the time the node
builder comes back to the block.
Therefore, change xfs_btree_bload_node to read the lower level blocks
to handle the reclaimed buffer case. As a side effect, the read will
increase the LRU refs, which will bias towards keeping new btree buffers
in memory after the new btree commits.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:27 +0000 (10:03 -0800)]
xfs: set XBF_DONE on newly formatted btree block that are ready for writing
The btree bulkloading code calls xfs_buf_delwri_queue_here when it has
finished formatting a new btree block and wants to queue it to be
written to disk. Once the new btree root has been committed, the blocks
(and hence the buffers) will be accessible to the rest of the
filesystem. Mark each new buffer as DONE when adding it to the delwri
list so that the next btree traversal can skip reloading the contents
from disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 15 Dec 2023 18:03:27 +0000 (10:03 -0800)]
xfs: force all buffers to be written during btree bulk load
While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffers with incorrect
DELWRI_Q state. Looking further, I observed this race between the AIL
trying to write out a btree block and repair zapping a btree block after
the fact:
AIL: Repair0:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
commit
delwri_submit # oops
Worse yet, I discovered that running the same repair over and over in a
tight loop can result in a second race that cause data integrity
problems with the repair:
AIL: Repair0: Repair1:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
commit
find free space X
get buffer
rewrite buffer
delwri_queue:
set DELWRI_Q
already on a list, do not add
commit
BAD: committed tree root before all blocks written
delwri_submit # too late now
I traced this to my own misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list. If a buffer is
logged and committed, the buffer can end up on that AIL buffer list. If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up. Marking the buffer stale clears DELWRI_Q from the
buffer state without removing the buffer from its delwri list. The
buffer doesn't know which list it's on, so it cannot know which lock to
take to protect the list for a removal.
If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block. Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.
When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.
This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
committing the new btree root pointer. This ordering requirement is
required for data consistency.
Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.
To fix this, create a new function that waits for the buffer to be
removed from any other delwri lists before adding the buffer to the
caller's delwri list. By waiting for the buffer to clear both the
delwri list and any potential delwri wait list, we can be sure that
repair will initiate writes of all buffers and report all write errors
back to userspace instead of committing the new structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner [Thu, 14 Dec 2023 21:40:35 +0000 (08:40 +1100)]
xfs: initialise di_crc in xfs_log_dinode
Alexander Potapenko report that KMSAN was issuing these warnings:
kmalloc-ed xlog buffer of size 512 :
ffff88802fc26200
kmalloc-ed xlog buffer of size 368 :
ffff88802fc24a00
kmalloc-ed xlog buffer of size 648 :
ffff88802b631000
kmalloc-ed xlog buffer of size 648 :
ffff88802b632800
kmalloc-ed xlog buffer of size 648 :
ffff88802b631c00
xlog_write_iovec: copying 12 bytes from
ffff888017ddbbd8 to
ffff88802c300400
xlog_write_iovec: copying 28 bytes from
ffff888017ddbbe4 to
ffff88802c30040c
xlog_write_iovec: copying 68 bytes from
ffff88802fc26274 to
ffff88802c300428
xlog_write_iovec: copying 188 bytes from
ffff88802fc262bc to
ffff88802c30046c
=====================================================
BUG: KMSAN: uninit-value in xlog_write_iovec fs/xfs/xfs_log.c:2227
BUG: KMSAN: uninit-value in xlog_write_full fs/xfs/xfs_log.c:2263
BUG: KMSAN: uninit-value in xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532
xlog_write_iovec fs/xfs/xfs_log.c:2227
xlog_write_full fs/xfs/xfs_log.c:2263
xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532
xlog_cil_write_chain fs/xfs/xfs_log_cil.c:918
xlog_cil_push_work+0x30f2/0x44e0 fs/xfs/xfs_log_cil.c:1263
process_one_work kernel/workqueue.c:2630
process_scheduled_works+0x1188/0x1e30 kernel/workqueue.c:2703
worker_thread+0xee5/0x14f0 kernel/workqueue.c:2784
kthread+0x391/0x500 kernel/kthread.c:388
ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
Uninit was created at:
slab_post_alloc_hook+0x101/0xac0 mm/slab.h:768
slab_alloc_node mm/slub.c:3482
__kmem_cache_alloc_node+0x612/0xae0 mm/slub.c:3521
__do_kmalloc_node mm/slab_common.c:1006
__kmalloc+0x11a/0x410 mm/slab_common.c:1020
kmalloc ./include/linux/slab.h:604
xlog_kvmalloc fs/xfs/xfs_log_priv.h:704
xlog_cil_alloc_shadow_bufs fs/xfs/xfs_log_cil.c:343
xlog_cil_commit+0x487/0x4dc0 fs/xfs/xfs_log_cil.c:1574
__xfs_trans_commit+0x8df/0x1930 fs/xfs/xfs_trans.c:1017
xfs_trans_commit+0x30/0x40 fs/xfs/xfs_trans.c:1061
xfs_create+0x15af/0x2150 fs/xfs/xfs_inode.c:1076
xfs_generic_create+0x4cd/0x1550 fs/xfs/xfs_iops.c:199
xfs_vn_create+0x4a/0x60 fs/xfs/xfs_iops.c:275
lookup_open fs/namei.c:3477
open_last_lookups fs/namei.c:3546
path_openat+0x29ac/0x6180 fs/namei.c:3776
do_filp_open+0x24d/0x680 fs/namei.c:3809
do_sys_openat2+0x1bc/0x330 fs/open.c:1440
do_sys_open fs/open.c:1455
__do_sys_openat fs/open.c:1471
__se_sys_openat fs/open.c:1466
__x64_sys_openat+0x253/0x330 fs/open.c:1466
do_syscall_x64 arch/x86/entry/common.c:51
do_syscall_64+0x4f/0x140 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b arch/x86/entry/entry_64.S:120
Bytes 112-115 of 188 are uninitialized
Memory access of size 188 starts at
ffff88802fc262bc
This is caused by the struct xfs_log_dinode not having the di_crc
field initialised. Log recovery never uses this field (it is only
present these days for on-disk format compatibility reasons) and so
it's value is never checked so nothing in XFS has caught this.
Further, none of the uninitialised memory access warning tools have
caught this (despite catching other uninit memory accesses in the
struct xfs_log_dinode back in 2017!) until recently. Alexander
annotated the XFS code to get the dump of the actual bytes that were
detected as uninitialised, and from that report it took me about 30s
to realise what the issue was.
The issue was introduced back in 2016 and every inode that is logged
fails to initialise this field. This is no actual bad behaviour
caused by this issue - I find it hard to even classify it as a
bug...
Reported-and-tested-by: Alexander Potapenko <glider@google.com>
Fixes: f8d55aa0523a ("xfs: introduce inode log format object")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Darrick J. Wong [Thu, 14 Dec 2023 21:38:45 +0000 (13:38 -0800)]
xfs: fix an off-by-one error in xreap_agextent_binval
Overall, this function tries to find and invalidate all buffers for a
given extent of space on the data device. The inner for loop in this
function tries to find all xfs_bufs for a given daddr. The lengths of
all possible cached buffers range from 1 fsblock to the largest needed
to contain a 64k xattr value (~17fsb). The scan is capped to avoid
looking at anything buffer going past the given extent.
Unfortunately, the loop continuation test is wrong -- max_fsbs is the
largest size we want to scan, not one past that. Put another way, this
loop is actually 1-indexed, not 0-indexed. Therefore, the continuation
test should use <=, not <.
As a result, online repairs of btree blocks fails to stale any buffers
for btrees that are being torn down, which causes later assertions in
the buffer cache when another thread creates a different-sized buffer.
This happens in xfs/709 when allocating an inode cluster buffer:
------------[ cut here ]------------
WARNING: CPU: 0 PID:
3346128 at fs/xfs/xfs_message.c:104 assfail+0x3a/0x40 [xfs]
CPU: 0 PID:
3346128 Comm: fsstress Not tainted 6.7.0-rc4-djwx #rc4
RIP: 0010:assfail+0x3a/0x40 [xfs]
Call Trace:
<TASK>
_xfs_buf_obj_cmp+0x4a/0x50
xfs_buf_get_map+0x191/0xba0
xfs_trans_get_buf_map+0x136/0x280
xfs_ialloc_inode_init+0x186/0x340
xfs_ialloc_ag_alloc+0x254/0x720
xfs_dialloc+0x21f/0x870
xfs_create_tmpfile+0x1a9/0x2f0
xfs_rename+0x369/0xfd0
xfs_vn_rename+0xfa/0x170
vfs_rename+0x5fb/0xc30
do_renameat2+0x52d/0x6e0
__x64_sys_renameat2+0x4b/0x60
do_syscall_64+0x3b/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e
A later refactoring patch in the online repair series fixed this by
accident, which is why I didn't notice this until I started testing only
the patches that are likely to end up in 6.8.
Fixes: 1c7ce115e521 ("xfs: reap large AG metadata extents when possible")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Eric Sandeen [Thu, 14 Dec 2023 19:28:08 +0000 (13:28 -0600)]
xfs: short circuit xfs_growfs_data_private() if delta is zero
Although xfs_growfs_data() doesn't call xfs_growfs_data_private()
if in->newblocks == mp->m_sb.sb_dblocks, xfs_growfs_data_private()
further massages the new block count so that we don't i.e. try
to create a too-small new AG.
This may lead to a delta of "0" in xfs_growfs_data_private(), so
we end up in the shrink case and emit the EXPERIMENTAL warning
even if we're not changing anything at all.
Fix this by returning straightaway if the block delta is zero.
(nb: in older kernels, the result of entering the shrink case
with delta == 0 may actually let an -ENOSPC escape to userspace,
which is confusing for users.)
Fixes: fb2fc1720185 ("xfs: support shrinking unused space in the last AG")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 13 Dec 2023 09:06:33 +0000 (10:06 +0100)]
xfs: pass the defer ops directly to xfs_defer_add
Pass a pointer to the xfs_defer_op_type structure to xfs_defer_add and
remove the indirection through the xfs_defer_ops_type enum and a global
table of all possible operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Thu, 14 Dec 2023 05:16:32 +0000 (06:16 +0100)]
xfs: pass the defer ops instead of type to xfs_defer_start_recovery
xfs_defer_start_recovery is only called from xlog_recover_intent_item,
and the callers of that all have the actual xfs_defer_ops_type operation
vector at hand. Pass that directly instead of looking it up from the
defer_op_types table.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 13 Dec 2023 09:06:31 +0000 (10:06 +0100)]
xfs: store an ops pointer in struct xfs_defer_pending
The dfp_type field in struct xfs_defer_pending is only used to either
look up the operations associated with the pending word or in trace
points. Replace it with a direct pointer to the operations vector,
and store a pretty name in the vector for tracing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 13 Dec 2023 09:06:30 +0000 (10:06 +0100)]
xfs: move xfs_attr_defer_type up in xfs_attr_item.c
We'll reference it directly in xlog_recover_attri_commit_pass2, so move
it up a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Wed, 13 Dec 2023 09:06:29 +0000 (10:06 +0100)]
xfs: consolidate the xfs_attr_defer_* helpers
Consolidate the xfs_attr_defer_* helpers into a single xfs_attr_defer_add
one that picks the right dela_state based on the passed in operation.
Also move to a single trace point as the actual operation is visible
through the flags in the delta_state passed to the trace point.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Chandan Babu R [Thu, 14 Dec 2023 05:20:35 +0000 (10:50 +0530)]
Merge tag 'fix-growfsrt-failures-6.8_2023-12-13' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeB
xfs: fix growfsrt failure during rt volume attach
One more series to fix a transaction reservation overrun while
trying to attach a very large rt volume to a filesystem.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'fix-growfsrt-failures-6.8_2023-12-13' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: recompute growfsrtfree transaction reservation while growing rt volume
Darrick J. Wong [Mon, 11 Dec 2023 18:41:51 +0000 (10:41 -0800)]
xfs: recompute growfsrtfree transaction reservation while growing rt volume
While playing with growfs to create a 20TB realtime section on a
filesystem that didn't previously have an rt section, I noticed that
growfs would occasionally shut down the log due to a transaction
reservation overflow.
xfs_calc_growrtfree_reservation uses the current size of the realtime
summary file (m_rsumsize) to compute the transaction reservation for a
growrtfree transaction. The reservations are computed at mount time,
which means that m_rsumsize is zero when growfs starts "freeing" the new
realtime extents into the rt volume. As a result, the transaction is
undersized and fails.
Fix this by recomputing the transaction reservations every time we
change m_rsumsize.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Mon, 4 Dec 2023 20:07:19 +0000 (21:07 +0100)]
xfs: move xfs_ondisk.h to libxfs/
Move xfs_ondisk.h to libxfs so that we can do the struct sanity checks
in userspace libxfs as well. This should allow us to retire the
somewhat fragile xfs/122 test on xfstests.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 4 Dec 2023 20:07:18 +0000 (21:07 +0100)]
xfs: use static_assert to check struct sizes and offsets
Use the compiler-provided static_assert built-in from C11 instead of
the kernel-specific BUILD_BUG_ON_MSG for the structure size and offset
checks in xfs_ondisk. This not only gives slightly nicer error messages
in case things go south, but can also be trivially used as-is in
userspace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Zhang Tianci [Tue, 5 Dec 2023 05:59:00 +0000 (13:59 +0800)]
xfs: extract xfs_da_buf_copy() helper function
This patch does not modify logic.
xfs_da_buf_copy() will copy one block from src xfs_buf to
dst xfs_buf, and update the block metadata in dst directly.
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Zhang Tianci [Tue, 5 Dec 2023 05:58:59 +0000 (13:58 +0800)]
xfs: update dir3 leaf block metadata after swap
xfs_da3_swap_lastblock() copy the last block content to the dead block,
but do not update the metadata in it. We need update some metadata
for some kinds of type block, such as dir3 leafn block records its
blkno, we shall update it to the dead block blkno. Otherwise,
before write the xfs_buf to disk, the verify_write() will fail in
blk_hdr->blkno != xfs_buf->b_bn, then xfs will be shutdown.
We will get this warning:
XFS (dm-0): Metadata corruption detected at xfs_dir3_leaf_verify+0xa8/0xe0 [xfs], xfs_dir3_leafn block 0x178
XFS (dm-0): Unmount and run xfs_repair
XFS (dm-0): First 128 bytes of corrupted metadata buffer:
00000000e80f1917: 00 80 00 0b 00 80 00 07 3d ff 00 00 00 00 00 00 ........=.......
000000009604c005: 00 00 00 00 00 00 01 a0 00 00 00 00 00 00 00 00 ................
000000006b6fb2bf: e4 44 e3 97 b5 64 44 41 8b 84 60 0e 50 43 d9 bf .D...dDA..`.PC..
00000000678978a2: 00 00 00 00 00 00 00 83 01 73 00 93 00 00 00 00 .........s......
00000000b28b247c: 99 29 1d 38 00 00 00 00 99 29 1d 40 00 00 00 00 .).8.....).@....
000000002b2a662c: 99 29 1d 48 00 00 00 00 99 49 11 00 00 00 00 00 .).H.....I......
00000000ea2ffbb8: 99 49 11 08 00 00 45 25 99 49 11 10 00 00 48 fe .I....E%.I....H.
0000000069e86440: 99 49 11 18 00 00 4c 6b 99 49 11 20 00 00 4d 97 .I....Lk.I. ..M.
XFS (dm-0): xfs_do_force_shutdown(0x8) called from line 1423 of file fs/xfs/xfs_buf.c. Return address =
00000000c0ff63c1
XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem
XFS (dm-0): Please umount the filesystem and rectify the problem(s)
>From the log above, we know xfs_buf->b_no is 0x178, but the block's hdr record
its blkno is 0x1a0.
Fixes: 24df33b45ecf ("xfs: add CRC checking to dir2 leaf blocks")
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Jiachen Zhang [Tue, 5 Dec 2023 05:58:58 +0000 (13:58 +0800)]
xfs: ensure logflagsp is initialized in xfs_bmap_del_extent_real
In the case of returning -ENOSPC, ensure logflagsp is initialized by 0.
Otherwise the caller __xfs_bunmapi will set uninitialized illegal
tmp_logflags value into xfs log, which might cause unpredictable error
in the log recovery procedure.
Also, remove the flags variable and set the *logflagsp directly, so that
the code should be more robust in the long run.
Fixes: 1b24b633aafe ("xfs: move some more code into xfs_bmap_del_extent_real")
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 4 Dec 2023 17:40:57 +0000 (18:40 +0100)]
xfs: clean up xfs_fsops.h
Use struct types instead of typedefs so that the header can be included
with pulling in the headers that define the typedefs, and remove the
pointless externs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 4 Dec 2023 17:40:56 +0000 (18:40 +0100)]
xfs: clean up the xfs_reserve_blocks interface
xfs_reserve_blocks has a very odd interface that can only be explained
by it directly deriving from the IRIX fcntl handler back in the day.
Split reporting out the reserved blocks out of xfs_reserve_blocks into
the only caller that cares. This means that the value reported from
XFS_IOC_SET_RESBLKS isn't atomically sampled in the same critical
section as when it was set anymore, but as the values could change
right after setting them anyway that does not matter. It does
provide atomic sampling of both values for XFS_IOC_GET_RESBLKS now,
though.
Also pass a normal scalar integer value for the requested value instead
of the pointless pointer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 4 Dec 2023 17:40:55 +0000 (18:40 +0100)]
xfs: clean up the XFS_IOC_FSCOUNTS handler
Split XFS_IOC_FSCOUNTS out of the main xfs_file_ioctl function, and
merge the xfs_fs_counts helper into the ioctl handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Mon, 4 Dec 2023 17:40:54 +0000 (18:40 +0100)]
xfs: clean up the XFS_IOC_{GS}ET_RESBLKS handler
The XFS_IOC_GET_RESBLKS and XFS_IOC_SET_RESBLKS already share a fair
amount of code, and will share even more soon. Move the logic for both
of them out of the main xfs_file_ioctl function into a
xfs_ioctl_getset_resblocks helper to share the code and prepare for
additional changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Bagas Sanjaya [Wed, 29 Nov 2023 12:39:47 +0000 (19:39 +0700)]
Documentation: xfs: consolidate XFS docs into its own subdirectory
XFS docs are currently in upper-level Documentation/filesystems.
Although these are currently 4 docs, they are already outstanding as
a group and can be moved to its own subdirectory.
Consolidate them into Documentation/filesystems/xfs/.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Shiyang Ruan [Mon, 23 Oct 2023 07:20:46 +0000 (15:20 +0800)]
mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
$FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
# $FSX_PROG -N
1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
1. device has gone but mount point still exists, and umount will fail
with "target is busy"
2. programs will hang and cannot be killed
3. may crash with NULL pointer dereference
To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super() // freeze (kernel call)
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated processes
` -> unmap_and_kill()
` -> invalidate_inode_pages2_range() // drop file's cache
`-> thaw_super() // thaw (both kernel & user call)
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created. Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area. Make sure all files and processes(not only the current progress)
are handled correctly. Also drop the cache of associated files before
pmem is removed.
[1]: https://lore.kernel.org/linux-mm/
161604050314.
1463742.
14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/
169116275623.
3187159.
16862410128731457358.stg-ugh@frogsfrogsfrogs/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Chandan Babu R [Thu, 7 Dec 2023 08:48:33 +0000 (14:18 +0530)]
Merge tag 'repair-auto-reap-space-reservations-6.8_2023-12-06' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
xfs: reserve disk space for online repairs
Online repair fixes metadata structures by writing a new copy out to
disk and atomically committing the new structure into the filesystem.
For this to work, we need to reserve all the space we're going to need
ahead of time so that the atomic commit transaction is as small as
possible. We also require the reserved space to be freed if the system
goes down, or if we decide not to commit the repair, or if we reserve
too much space.
To keep the atomic commit transaction as small as possible, we would
like to allocate some space and simultaneously schedule automatic
reaping of the reserved space, even on log recovery. EFIs are the
mechanism to get us there, but we need to use them in a novel manner.
Once we allocate the space, we want to hold on to the EFI (relogging as
necessary) until we can commit or cancel the repair. EFIs for written
committed blocks need to go away, but unwritten or uncommitted blocks
can be freed like normal.
Earlier versions of this patchset directly manipulated the log items,
but Dave thought that to be a layering violation. For v27, I've
modified the defer ops handling code to be capable of pausing a deferred
work item. Log intent items are created as they always have been, but
paused items are pushed onto a side list when finishing deferred work
items, and pushed back onto the transaction after that. Log intent done
item are not created for paused work.
The second part adds a "stale" flag to the EFI so that the repair
reservation code can dispose of an EFI the normal way, but without the
space actually being freed.
This has been lightly tested with fstests. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-auto-reap-space-reservations-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: force small EFIs for reaping btree extents
xfs: log EFIs for all btree blocks being used to stage a btree
xfs: implement block reservation accounting for btrees we're staging
xfs: remove unused fields from struct xbtree_ifakeroot
xfs: automatic freeing of freshly allocated unwritten space
xfs: remove __xfs_free_extent_later
xfs: allow pausing of pending deferred work items
xfs: don't append work items to logged xfs_defer_pending objects
Chandan Babu R [Thu, 7 Dec 2023 08:45:26 +0000 (14:15 +0530)]
Merge tag 'scrub-livelock-prevention-6.8_2023-12-06' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
xfs: prevent livelocks in xchk_iget
Prevent scrub from live locking in xchk_iget if there's a cycle in the
inobt by allocating an empty transaction.
This has been lightly tested with fstests. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'scrub-livelock-prevention-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: make xchk_iget safer in the presence of corrupt inode btrees
Chandan Babu R [Thu, 7 Dec 2023 08:40:55 +0000 (14:10 +0530)]
Merge tag 'defer-elide-create-done-6.8_2023-12-06' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
xfs: elide defer work ->create_done if no intent
Christoph pointed out that the defer ops machinery doesn't need to call
->create_done if the deferred work item didn't generate a log intent
item in the first place. Let's clean that up and save an indirect call
in the non-logged xattr update call path.
This has been lightly tested with fstests. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'defer-elide-create-done-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: elide ->create_done calls for unlogged deferred work
xfs: document what LARP means
Chandan Babu R [Thu, 7 Dec 2023 08:33:11 +0000 (14:03 +0530)]
Merge tag 'fix-rtmount-overflows-6.8_2023-12-06' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
xfs: fix realtime geometry integer overflows
While reading through the realtime geometry support code in xfsprogs, I
noticed a discrepancy between the sb_rextslog computation used when
writing out the superblock during mkfs and the validation code used in
xfs_repair. This discrepancy would lead to system failure for a runt rt
volume having more than 1 rt block but zero rt extents in length. Most
people aren't going to configure a 1M extent size for their 360k rt
floppy disk volume, but I did!
In the process of studying that code, it occurred to me that there is a
second bug in the computation -- the use of highbit32 for a 64-bit
value means that the upper 32 bits are not considered in the search for
a high bit. This causes the creation of a realtime summary file that is
the wrong length. If rextents is a multiple of U32_MAX then this will
appear to work fine because highbit32 returns -1 for an input of 0; but
for all other cases the rt summary is undersized, leading to failures.
Fix the first problem by standardizing the computation with a helper in
libxfs; and the second problem by correcting the computation. This will
cause any existing rt volumes larger than 2^32 blocks to fail validation
but they probably were already crashing anyway.
This has been lightly tested with fstests. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'fix-rtmount-overflows-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: don't allow overly small or large realtime volumes
xfs: fix 32-bit truncation in xfs_compute_rextslog
xfs: make rextslog computation consistent with mkfs