btrfs: raid56: extra debugging for raid6 syndrome generation
[BUG]
I have got at least two crash report for RAID6 syndrome generation, no
matter if it's AVX2 or SSE2, they all seems to have a similar
calltrace with corrupted RAX:
BUG: kernel NULL pointer dereference, address:
0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
Workqueue: btrfs-rmw rmw_rbio_work [btrfs]
RIP: 0010:raid6_sse21_gen_syndrome+0x9e/0x130 [raid6_pq]
RAX:
0000000000000000 RBX:
0000000000001000 RCX:
ffffa0ff4cfa3248
RDX:
0000000000000000 RSI:
ffffa0f74cfa3238 RDI:
0000000000000000
Call Trace:
<TASK>
rmw_rbio+0x5c8/0xa80 [btrfs]
process_one_work+0x1c7/0x3d0
worker_thread+0x4d/0x380
kthread+0xf3/0x120
ret_from_fork+0x2c/0x50
</TASK>
[CAUSE]
The cause is not known. Recently I also hit this in AVX512 path, and
that's even in v5.15 backport, which doesn't have any of my RAID56
rework.
Furthermore according to the registers:
RAX:
0000000000000000 RBX:
0000000000001000 RCX:
ffffa0ff4cfa3248
The RAX register is showing the number of stripes (including PQ), which
is not correct (0). But the remaining two registers are all sane.
- RBX is the sectorsize
For x86_64 it should always be 4K and matches the output.
- RCX is the pointers array
Which is from rbio->finish_pointers, and it looks like a sane
kernel address.
[WORKAROUND]
For now, I can only add extra debug ASSERT()s before we call raid6
gen_syndrome() helper and hopes to catch the problem.
The debug requires both CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT
enabled.
My current guess is some use-after-free, but every report is only having
corrupted RAX but seemingly valid pointers doesn't make much sense.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>