Tom Rix [Mon, 27 Mar 2023 13:23:24 +0000 (09:23 -0400)]
md/raid5: remove unused working_disks variable
clang with W=1 reports
drivers/md/raid5.c:7719:6: error: variable 'working_disks'
set but not used [-Werror,-Wunused-but-set-variable]
int working_disks = 0;
^
This variable is not used so remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230327132324.1769595-1-trix@redhat.com
Yu Kuai [Tue, 14 Mar 2023 01:22:58 +0000 (09:22 +0800)]
md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
handle_read_error() will resumit r10_bio by raid10_read_request(), which
will call bio_start_io_acct() again, while bio_end_io_acct() will only
be called once.
Fix the problem by don't account io again from handle_read_error().
Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting")
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230314012258.2395894-1-yukuai1@huaweicloud.com
Yu Kuai [Fri, 10 Mar 2023 07:38:55 +0000 (15:38 +0800)]
md/raid10: fix memleak of md thread
In raid10_run(), if setup_conf() succeed and raid10_run() failed before
setting 'mddev->thread', then in the error path 'conf->thread' is not
freed.
Fix the problem by setting 'mddev->thread' right after setup_conf().
Fixes: 43a521238aca ("md-cluster: choose correct label when clustered layout is not supported")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-7-yukuai1@huaweicloud.com
Yu Kuai [Fri, 10 Mar 2023 07:38:54 +0000 (15:38 +0800)]
md/raid10: fix memleak for 'conf->bio_split'
In the error path of raid10_run(), 'conf' need be freed, however,
'conf->bio_split' is missed and memory will be leaked.
Since there are 3 places to free 'conf', factor out a helper to fix the
problem.
Fixes: fc9977dd069e ("md/raid10: simplify the splitting of requests.")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-6-yukuai1@huaweicloud.com
Yu Kuai [Fri, 10 Mar 2023 07:38:53 +0000 (15:38 +0800)]
md/raid10: fix leak of 'r10bio->remaining' for recovery
raid10_sync_request() will add 'r10bio->remaining' for both rdev and
replacement rdev. However, if the read io fails, recovery_request_write()
returns without issuing the write io, in this case, end_sync_request()
is only called once and 'remaining' is leaked, cause an io hang.
Fix the problem by decreasing 'remaining' according to if 'bio' and
'repl_bio' is valid.
Fixes: 24afd80d99f8 ("md/raid10: handle recovery of replacement devices.")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-5-yukuai1@huaweicloud.com
Yu Kuai [Fri, 10 Mar 2023 07:38:52 +0000 (15:38 +0800)]
md/raid10: don't BUG_ON() in raise_barrier()
If raise_barrier() is called the first time in raid10_sync_request(), which
means the first non-normal io is handled, raise_barrier() should wait for
all dispatched normal io to be done. This ensures that normal io won't
starve.
However, BUG_ON() if this is broken is too aggressive. This patch replace
BUG_ON() with WARN and fall back to not force.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-4-yukuai1@huaweicloud.com
Yu Kuai [Fri, 10 Mar 2023 07:38:51 +0000 (15:38 +0800)]
md: fix soft lockup in status_resync
status_resync() will calculate 'curr_resync - recovery_active' to show
user a progress bar like following:
[============>........] resync = 61.4%
'curr_resync' and 'recovery_active' is updated in md_do_sync(), and
status_resync() can read them concurrently, hence it's possible that
'curr_resync - recovery_active' can overflow to a huge number. In this
case status_resync() will be stuck in the loop to print a large amount
of '=', which will end up soft lockup.
Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case,
this way resync in progress will be reported to user.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com
Mariusz Tkaczyk [Mon, 6 Mar 2023 13:03:17 +0000 (14:03 +0100)]
md: add error_handlers for raid0 and linear
After the commit
9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10")
MD_BROKEN must be set if array is failed because state_store() checks it.
If it is set then -EBUSY is returned to userspace.
For raid0 and linear MD_BROKEN is not set by error_handler(). As a result
mdadm is unable to trigger clean-up actions. It is a regression.
This patch adds appropriate error_handler for raid0 and linear. The
error handler sets MD_BROKEN for this device.
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com
Jon Derrick [Fri, 24 Feb 2023 18:33:23 +0000 (11:33 -0700)]
md: Use optimal I/O size for last bitmap page
If the bitmap space has enough room, size the I/O for the last bitmap
page write to the optimal I/O size for the storage device. The expanded
write is checked that it won't overrun the data or metadata.
The drive this was tested against has higher latencies when there are
sub-4k writes due to device-side read-mod-writes of its atomic 4k write
unit. This change helps increase performance by sizing the last bitmap
page I/O for the device's preferred write unit, if it is given.
Example Intel/Solidigm P5520
Raid10, Chunk-size 64M, bitmap-size 57228 bits
$ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1
--assume-clean --bitmap=internal --bitmap-chunk=64M
$ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60
Without patch:
write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets
With patch:
write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets
Biosnoop:
Without patch:
Time Process PID Device LBA Size Lat
1.410377 md0_raid10 6900 nvme0n1 W 16 4096 0.02
1.410387 md0_raid10 6900 nvme2n1 W 16 4096 0.02
1.410374 md0_raid10 6900 nvme3n1 W 16 4096 0.01
1.410381 md0_raid10 6900 nvme1n1 W 16 4096 0.02
1.410411 md0_raid10 6900 nvme1n1 W
115346512 4096 0.01
1.410418 md0_raid10 6900 nvme0n1 W
115346512 4096 0.02
1.410915 md0_raid10 6900 nvme2n1 W 24 3584 0.43 <--
1.410935 md0_raid10 6900 nvme3n1 W 24 3584 0.45 <--
1.411124 md0_raid10 6900 nvme1n1 W 24 3584 0.64 <--
1.411147 md0_raid10 6900 nvme0n1 W 24 3584 0.66 <--
1.411176 md0_raid10 6900 nvme3n1 W
2019022184 4096 0.01
1.411189 md0_raid10 6900 nvme2n1 W
2019022184 4096 0.02
With patch:
Time Process PID Device LBA Size Lat
5.747193 md0_raid10 727 nvme0n1 W 16 4096 0.01
5.747192 md0_raid10 727 nvme1n1 W 16 4096 0.02
5.747195 md0_raid10 727 nvme3n1 W 16 4096 0.01
5.747202 md0_raid10 727 nvme2n1 W 16 4096 0.02
5.747229 md0_raid10 727 nvme3n1 W
1196223704 4096 0.02
5.747224 md0_raid10 727 nvme0n1 W
1196223704 4096 0.01
5.747279 md0_raid10 727 nvme0n1 W 24 4096 0.01 <--
5.747279 md0_raid10 727 nvme1n1 W 24 4096 0.02 <--
5.747284 md0_raid10 727 nvme3n1 W 24 4096 0.02 <--
5.747291 md0_raid10 727 nvme2n1 W 24 4096 0.02 <--
5.747314 md0_raid10 727 nvme2n1 W
2234636712 4096 0.01
5.747317 md0_raid10 727 nvme1n1 W
2234636712 4096 0.02
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-4-jonathan.derrick@linux.dev
Jon Derrick [Fri, 24 Feb 2023 18:33:22 +0000 (11:33 -0700)]
md: Fix types in sb writer
Page->index is a pgoff_t and multiplying could cause overflows on a
32-bit architecture. In the sb writer, this is used to calculate and
verify the sector being used, and is multiplied by a sector value. Using
sector_t will cast it to a u64 type and is the more appropriate type for
the unit. Additionally, the integer size unit is converted to a sector
unit in later calculations, and is now corrected to be an unsigned type.
Finally, clean up the calculations using variable aliases to improve
readabiliy.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-3-jonathan.derrick@linux.dev
Jon Derrick [Fri, 24 Feb 2023 18:33:21 +0000 (11:33 -0700)]
md: Move sb writer loop to its own function
Preparatory patch for optimal I/O size calculation. Move the sb writer
loop routine into its own function for clarity.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-2-jonathan.derrick@linux.dev
Jiangshan Yi [Tue, 14 Feb 2023 06:40:13 +0000 (14:40 +0800)]
md/raid10: Fix typo in comment (replacment -> replacement)
Replace replacment with replacement.
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230214064013.2373851-1-yijiangshan@kylinos.cn
Thomas Weißschuh [Tue, 14 Feb 2023 03:19:22 +0000 (03:19 +0000)]
md: make kobj_type structures constant
Since commit
ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definitions to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230214-kobj_type-md-v1-1-d6853f707f11@weissschuh.net
Li Nan [Wed, 22 Feb 2023 04:10:00 +0000 (12:10 +0800)]
md/raid10: fix null-ptr-deref in raid10_sync_request
init_resync() inits mempool and sets conf->have_replacemnt at the beginning
of sync, close_sync() frees the mempool when sync is completed.
After [1] recovery might be skipped and init_resync() is called but
close_sync() is not. null-ptr-deref occurs with r10bio->dev[i].repl_bio.
The following is one way to reproduce the issue.
1) create a array, wait for resync to complete, mddev->recovery_cp is set
to MaxSector.
2) recovery is woken and it is skipped. conf->have_replacement is set to
0 in init_resync(). close_sync() not called.
3) some io errors and rdev A is set to WantReplacement.
4) a new device is added and set to A's replacement.
5) recovery is woken, A have replacement, but conf->have_replacemnt is
0. r10bio->dev[i].repl_bio will not be alloced and null-ptr-deref
occurs.
Fix it by not calling init_resync() if recovery skipped.
[1] commit
7e83ccbecd60 ("md/raid10: Allow skipping recovery when clean arrays are assembled")
Fixes: 7e83ccbecd60 ("md/raid10: Allow skipping recovery when clean arrays are assembled")
Cc: stable@vger.kernel.org
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230222041000.3341651-3-linan666@huaweicloud.com
Li Nan [Wed, 22 Feb 2023 04:09:59 +0000 (12:09 +0800)]
md/raid10: fix task hung in raid10d
commit
fe630de009d0 ("md/raid10: avoid deadlock on recovery.") allowed
normal io and sync io to exist at the same time. Task hung will occur as
below:
T1 T2 T3 T4
raid10d
handle_read_error
allow_barrier
conf->nr_pending--
-> 0
//submit sync io
raid10_sync_request
raise_barrier
->will not be blocked
...
//submit to drivers
raid10_read_request
wait_barrier
conf->nr_pending++
-> 1
//retry read fail
raid10_end_read_request
reschedule_retry
add to retry_list
conf->nr_queued++
-> 1
//sync io fail
end_sync_read
__end_sync_read
reschedule_retry
add to retry_list
conf->nr_queued++
-> 2
...
handle_read_error
get form retry_list
conf->nr_queued--
freeze_array
wait nr_pending == nr_queued+1
->1 ->2
//task hung
retry read and sync io will be added to retry_list(nr_queued->2) if they
fails. raid10d() called handle_read_error() and hung in freeze_array().
nr_queued will not decrease because raid10d is blocked, nr_pending will
not increase because conf->barrier is not released.
Fix it by moving allow_barrier() after raid10_read_request().
raise_barrier() will wait for nr_waiting to become 0. Therefore, sync io
and regular io will not be issued at the same time.
Also remove the check of nr_queued in stop_waiting_barrier. It can be 0
but don't need to be blocking. Remove the check for MD_RECOVERY_RUNNING as
the check is redundent.
Fixes: fe630de009d0 ("md/raid10: avoid deadlock on recovery.")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230222041000.3341651-2-linan666@huaweicloud.com
Akinobu Mita [Mon, 27 Mar 2023 14:37:33 +0000 (23:37 +0900)]
block: null_blk: make fault-injection dynamically configurable per device
The null_blk driver has multiple driver-specific fault injection
mechanisms. Each fault injection configuration can only be specified by a
module parameter and cannot be reconfigured without reloading the driver.
Also, each configuration is common to all devices and is initialized every
time a new device is added.
This change adds the following subdirectories for each null_blk device.
/sys/kernel/config/nullb/<disk>/timeout_inject
/sys/kernel/config/nullb/<disk>/requeue_inject
/sys/kernel/config/nullb/<disk>/init_hctx_fault_inject
Each fault injection attribute can be dynamically set per device by a
corresponding file in these directories.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://lore.kernel.org/r/20230327143733.14599-3-akinobu.mita@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Akinobu Mita [Mon, 27 Mar 2023 14:37:32 +0000 (23:37 +0900)]
fault-inject: allow configuration via configfs
This provides a helper function to allow configuration of fault-injection
for configfs-based drivers.
The config items created by this function have the same interface as the
one created under debugfs by fault_create_debugfs_attr().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://lore.kernel.org/r/20230327143733.14599-2-akinobu.mita@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:06:51 +0000 (08:06 +0200)]
blk-mq: remove __blk_mq_run_hw_queue
__blk_mq_run_hw_queue just contains a WARN_ON_ONCE for calls from
interrupt context and a blk_mq_run_dispatch_ops-protected call to
blk_mq_sched_dispatch_requests. Open code the call to
blk_mq_sched_dispatch_requests in both callers, and move the WARN_ON_ONCE
to blk_mq_run_hw_queue where it can be extended to all !async calls,
while the other call is from workqueue context and thus obviously does
not need the assert.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:06:50 +0000 (08:06 +0200)]
blk-mq: move the !async handling out of __blk_mq_delay_run_hw_queue
Only blk_mq_run_hw_queue can call __blk_mq_delay_run_hw_queue with
async=false, so move the handling there.
With this __blk_mq_delay_run_hw_queue can be merged into
blk_mq_delay_run_hw_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:06:49 +0000 (08:06 +0200)]
blk-mq: move the blk_mq_hctx_stopped check in __blk_mq_delay_run_hw_queue
For the in-context dispatch, blk_mq_hctx_stopped is alredy checked in
blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection.
For the async dispatch case having a check before scheduling the work
still makes sense to avoid needless workqueue scheduling, so just keep it
for that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:06:48 +0000 (08:06 +0200)]
blk-mq: remove the blk_mq_hctx_stopped check in blk_mq_run_work_fn
blk_mq_hctx_stopped is already checked in blk_mq_sched_dispatch_requests
under blk_mq_run_dispatch_ops() protection, so remove the duplicate check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:06:47 +0000 (08:06 +0200)]
blk-mq: cleanup __blk_mq_sched_dispatch_requests
__blk_mq_sched_dispatch_requests currently has duplicated logic
for the cases where requests are on the hctx dispatch list or not.
Merge the two with a new need_dispatch variable and remove a few
pointless local variables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:57 +0000 (08:40 +0200)]
blk-mq: pass a flags argument to blk_mq_add_to_requeue_list
Replace the boolean at_head argument with the same flags that are already
passed to blk_mq_insert_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:56 +0000 (08:40 +0200)]
blk-mq: pass a flags argument to elevator_type->insert_requests
Instead of passing a bool at_head, pass down the full flags from the
blk_mq_insert_request interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:55 +0000 (08:40 +0200)]
blk-mq: pass a flags argument to blk_mq_request_bypass_insert
Replace the boolean at_head argument with the same flags that are already
passed to blk_mq_insert_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:54 +0000 (08:40 +0200)]
blk-mq: pass a flags argument to blk_mq_insert_request
Replace the at_head bool with a flags argument that so far only contains
a single BLK_MQ_INSERT_AT_HEAD value. This makes it much easier to grep
for head insertions into the blk-mq dispatch queues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:53 +0000 (08:40 +0200)]
blk-mq: don't kick the requeue_list in blk_mq_add_to_requeue_list
blk_mq_add_to_requeue_list takes a bool parameter to control how to kick
the requeue list at the end of the function. Move the call to
blk_mq_kick_requeue_list to the callers that want it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:52 +0000 (08:40 +0200)]
blk-mq: don't run the hw_queue from blk_mq_request_bypass_insert
blk_mq_request_bypass_insert takes a bool parameter to control how to run
the queue at the end of the function. Move the blk_mq_run_hw_queue call
to the callers that want it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:51 +0000 (08:40 +0200)]
blk-mq: don't run the hw_queue from blk_mq_insert_request
blk_mq_insert_request takes two bool parameters to control how to run
the queue at the end of the function. Move the blk_mq_run_hw_queue call
to the callers that want it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:50 +0000 (08:40 +0200)]
blk-mq: fold __blk_mq_try_issue_directly into its two callers
Due to the wildly different behavior based on the bypass_insert argument,
not a whole lot of code in __blk_mq_try_issue_directly is actually shared
between blk_mq_try_issue_directly and blk_mq_request_issue_directly.
Remove __blk_mq_try_issue_directly and fold the code into the two callers
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:49 +0000 (08:40 +0200)]
blk-mq: factor out a blk_mq_get_budget_and_tag helper
Factor out a helper from __blk_mq_try_issue_directly in preparation
of folding that function into its two callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:48 +0000 (08:40 +0200)]
blk-mq: refactor the DONTPREP/SOFTBARRIER andling in blk_mq_requeue_work
Split the RQF_DONTPREP and RQF_SOFTBARRIER in separate branches to make
the code more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:47 +0000 (08:40 +0200)]
blk-mq: refactor passthrough vs flush handling in blk_mq_insert_request
While both passthrough and flush requests call directly into
blk_mq_request_bypass_insert, the parameters aren't the same.
Split the handling into two separate conditionals and turn the whole
function into an if/elif/elif/else flow instead of the gotos.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:46 +0000 (08:40 +0200)]
blk-mq: remove blk_flush_queue_rq
Just call blk_mq_add_to_requeue_list directly from the two callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:45 +0000 (08:40 +0200)]
blk-mq: fold __blk_mq_insert_req_list into blk_mq_insert_request
Remove this very small helper and fold it into the only caller.
Note that this moves the trace_block_rq_insert out of ctx->lock, matching
the other calls to this tracepoint.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:44 +0000 (08:40 +0200)]
blk-mq: fold __blk_mq_insert_request into blk_mq_insert_request
There is no good point in keeping the __blk_mq_insert_request around
for two function calls and a singler caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:43 +0000 (08:40 +0200)]
blk-mq: move blk_mq_sched_insert_request to blk-mq.c
blk_mq_sched_insert_request is the main request insert helper and not
directly I/O scheduler related. Move blk_mq_sched_insert_request to
blk-mq.c, rename it to blk_mq_insert_request and mark it static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:42 +0000 (08:40 +0200)]
blk-mq: fold blk_mq_sched_insert_requests into blk_mq_dispatch_plug_list
blk_mq_dispatch_plug_list is the only caller of
blk_mq_sched_insert_requests, and it makes sense to just fold it there
as blk_mq_sched_insert_requests isn't specific to I/O schedulers despite
the name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:41 +0000 (08:40 +0200)]
blk-mq: move more logic into blk_mq_insert_requests
Move all logic related to the direct insert (including the call to
blk_mq_run_hw_queue) into blk_mq_insert_requests to streamline the code
flow up a bit, and to allow marking blk_mq_try_issue_list_directly
static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:40 +0000 (08:40 +0200)]
blk-mq: include <linux/blk-mq.h> in block/blk-mq.h
block/blk-mq.h needs various definitions from <linux/blk-mq.h>,
include it there instead of relying on the source files to include
both.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:39 +0000 (08:40 +0200)]
blk-mq: remove blk-mq-tag.h
blk-mq-tag.h is always included by blk-mq.h, and causes recursive
inclusion hell with further changes. Just merge it into blk-mq.h
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Apr 2023 06:40:38 +0000 (08:40 +0200)]
blk-mq: don't plug for head insertions in blk_execute_rq_nowait
Plugs never insert at head, so don't plug for head insertions.
Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Thu, 13 Apr 2023 06:28:05 +0000 (14:28 +0800)]
blk-throttle: only enable blk-stat when BLK_DEV_THROTTLING_LOW
blk_throtl_register() will unconditionally enable blk-stat for gendisk
when register, even when we have no BLK_DEV_THROTTLING_LOW config.
Since the kernel always has only BLK_DEV_THROTTLING config and the
BLK_DEV_THROTTLING_LOW config is still in EXPERIMENTAL state, we can
just skip blk-stat when !BLK_DEV_THROTTLING_LOW.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230413062805.2081970-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Thu, 13 Apr 2023 06:28:04 +0000 (14:28 +0800)]
blk-stat: fix QUEUE_FLAG_STATS clear
We need to set QUEUE_FLAG_STATS for two cases:
1. blk_stat_enable_accounting()
2. blk_stat_add_callback()
So we should clear it only when ((q->stats->accounting == 0) &&
list_empty(&q->stats->callbacks)).
blk_stat_disable_accounting() only check if q->stats->accounting
is 0 before clear the flag, this patch fix it.
Also add list_empty(&q->stats->callbacks)) check when enable, or
the flag is already set.
The bug can be reproduced on kernel without BLK_DEV_THROTTLING
(since it unconditionally enable accounting, see the next patch).
# cat /sys/block/sr0/queue/scheduler
none mq-deadline [bfq]
# cat /sys/kernel/debug/block/sr0/state
SAME_COMP|IO_STAT|INIT_DONE|STATS|REGISTERED|NOWAIT|30
# echo none > /sys/block/sr0/queue/scheduler
# cat /sys/kernel/debug/block/sr0/state
SAME_COMP|IO_STAT|INIT_DONE|REGISTERED|NOWAIT
# cat /sys/block/sr0/queue/wbt_lat_usec
75000
We can see that after changing elevator from "bfq" to "none",
"STATS" flag is lost even though WBT callback still need it.
Fixes: 68497092bde9 ("block: make queue stat accounting a reference")
Cc: <stable@vger.kernel.org> # v5.17+
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230413062805.2081970-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Thu, 13 Apr 2023 00:06:49 +0000 (14:06 -1000)]
blk-iolatency: Make initialization lazy
Other rq_qos policies such as wbt and iocost are lazy-initialized when they
are configured for the first time for the device but iolatency is
initialized unconditionally from blkcg_init_disk() during gendisk init. Lazy
init is beneficial because rq_qos policies add runtime overhead when
initialized as every IO has to walk all registered rq_qos callbacks.
This patch switches iolatency to lazy initialization too so that it only
registered its rq_qos policy when it is first configured.
Note that there is a known race condition between blkcg config file writes
and del_gendisk() and this patch makes iolatency susceptible to it by
exposing the init path to race against the deletion path. However, that
problem already exists in iocost and is being worked on.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230413000649.115785-5-tj@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Thu, 13 Apr 2023 00:06:48 +0000 (14:06 -1000)]
blk-iolatency: s/blkcg_rq_qos/iolat_rq_qos/
The name was too generic given that there are multiple blkcg rq-qos
policies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230413000649.115785-4-tj@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Thu, 13 Apr 2023 00:06:47 +0000 (14:06 -1000)]
blkcg: Restructure blkg_conf_prep() and friends
We want to support lazy init of rq-qos policies so that iolatency is enabled
lazily on configuration instead of gendisk initialization. The way blkg
config helpers are structured now is a bit awkward for that. Let's
restructure:
* blkcg_conf_open_bdev() is renamed to blkg_conf_open_bdev(). The blkcg_
prefix was used because the bdev opening step is blkg-independent.
However, the distinction is too subtle and confuses more than helps. Let's
switch to blkg prefix so that it's consistent with the type and other
helper names.
* struct blkg_conf_ctx now remembers the original input string and is always
initialized by the new blkg_conf_init().
* blkg_conf_open_bdev() is updated to take a pointer to blkg_conf_ctx like
blkg_conf_prep() and can be called multiple times safely. Instead of
modifying the double pointer to input string directly,
blkg_conf_open_bdev() now sets blkg_conf_ctx->body.
* blkg_conf_finish() is renamed to blkg_conf_exit() for symmetry and now
must be called on all blkg_conf_ctx's which were initialized with
blkg_conf_init().
Combined, this allows the users to either open the bdev first or do it
altogether with blkg_conf_prep() which will help implementing lazy init of
rq-qos policies.
blkg_conf_init/exit() will also be used implement synchronization against
device removal. This is necessary because iolat / iocost are configured
through cgroupfs instead of one of the files under /sys/block/DEVICE. As
cgroupfs operations aren't synchronized with block layer, the lazy init and
other configuration operations may race against device removal. This patch
makes blkg_conf_init/exit() used consistently for all cgroup-orginating
configurations making them a good place to implement explicit
synchronization.
Users are updated accordingly. No behavior change is intended by this patch.
v2: bfq wasn't updated in v1 causing a build error. Fixed.
v3: Update the description to include future use of blkg_conf_init/exit() as
synchronization points.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230413000649.115785-3-tj@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Thu, 13 Apr 2023 00:06:46 +0000 (14:06 -1000)]
blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()
Now that all RCU flavors have been combined either holding a spin lock,
disabling irq or disabling preemption implies RCU read lock, so there's no
need to use rcu_read_[un]lock() explicitly while holding queue_lock. This
shouldn't cause any behavior changes.
v2: Description updated. Leave __acquires/release on queue_lock alone.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230413000649.115785-2-tj@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stefan Haberland [Wed, 5 Apr 2023 14:20:17 +0000 (16:20 +0200)]
s390/dasd: fix hanging blockdevice after request requeue
The DASD driver does not kick the requeue list when requeuing IO requests
to the blocklayer. This might lead to hanging blockdevice when there is
no other trigger for this.
Fix by automatically kick the requeue list when requeuing DASD requests
to the blocklayer.
Fixes: e443343e509a ("s390/dasd: blk-mq conversion")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/20230405142017.2446986-8-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stefan Haberland [Wed, 5 Apr 2023 14:20:16 +0000 (16:20 +0200)]
s390/dasd: add autoquiesce event for start IO error
Add a check for errors in the start_io function that signal a not
working device. Trigger an autoquiesce event in that case.
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/20230405142017.2446986-7-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stefan Haberland [Wed, 5 Apr 2023 14:20:15 +0000 (16:20 +0200)]
s390/dasd: add aq_timeouts autoquiesce trigger
Add a sysfs attribute aq_timeouts that controls after how many
timeouts a autoquiesce event might be triggered.
The default value is 32768 which is the maximum number of retries
for the DASD device driver DASD_RETRIES_MAX. This means that the
timeout trigger will never happen.
The default value for DASD retries is 255.
Setting the value to below 255 will trigger the timeout autoquiesce
event before an IO error is generated.
Also add the check for the configured amount of timeouts and trigger
an autoquiesce event if exceeded.
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/20230405142017.2446986-6-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stefan Haberland [Wed, 5 Apr 2023 14:20:14 +0000 (16:20 +0200)]
s390/dasd: add aq_requeue sysfs attribute
Add a sysfs attribute to control if all IO requests will be requeued to
the blocklayer in case of an autoquiesce event or not.
A value of 1 means that in case of an autoquiesce event all IO requests
will be requeued to the blocklayer.
A value of 0 means that the device will only be stopped.
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/20230405142017.2446986-5-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stefan Haberland [Wed, 5 Apr 2023 14:20:13 +0000 (16:20 +0200)]
s390/dasd: add aq_mask sysfs attribute
Add sysfs attribute that controls the DASD autoquiesce feature.
The autoquiesce is disabled when 0 is echoed to the attribute.
A value greater than 0 will enable the feature.
The aq_mask attribute will accept an unsigned integer and the value
will be interpreted as bitmask defining the trigger events that will
lead to an automatic quiesce.
The following autoquiesce triggers will currently be available:
DASD_EER_FATALERROR 1 - any final I/O error
DASD_EER_NOPATH 2 - no remaining paths for the device
DASD_EER_STATECHANGE 3 - a state change interrupt occurred
DASD_EER_PPRCSUSPEND 4 - the device is PPRC suspended
DASD_EER_NOSPC 5 - there is no space remaining on an ESE device
DASD_EER_TIMEOUT 6 - a certain amount of timeouts occurred
DASD_EER_STARTIO 7 - the IO start function encountered an error
The currently supported maximum value is 255.
Bit 31 is reserved for internal usage.
Bit 0 is not used.
Example:
- deactivate autoquiesce
$ echo 0 > /sys/bus/ccw/0.0.1234/aq_mask
- enable autoquiesce for FATALERROR, NOPATH and TIMEOUT
(0000 0000 0000 0000 0000 0000 0100 0110 => 70)
$ echo 70 > /sys/bus/ccw/0.0.1234/aq_mask
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/20230405142017.2446986-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stefan Haberland [Wed, 5 Apr 2023 14:20:12 +0000 (16:20 +0200)]
s390/dasd: add autoquiesce feature
Add the internal logic to check for autoquiesce triggers and handle
them.
Quiesce and resume are functions that tell Linux to stop/resume
issuing I/Os to a specific DASD.
The DASD driver allows a manual quiesce/resume via ioctl.
Autoquiesce will define an amount of triggers that will lead to
an automatic quiesce if a certain event occurs.
There is no automatic resume.
All events will be reported via DASD Extended Error Reporting (EER)
if configured.
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/20230405142017.2446986-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stefan Haberland [Wed, 5 Apr 2023 14:20:11 +0000 (16:20 +0200)]
s390/dasd: remove unused DASD EER defines
Remove definitions that have never been used.
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/20230405142017.2446986-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Thu, 6 Apr 2023 14:50:50 +0000 (22:50 +0800)]
blk-cgroup: delete cpd_init_fn of blkcg_policy
blkcg_policy cpd_init_fn() is used to just initialize some default
fields of policy data, which is enough to do in cpd_alloc_fn().
This patch delete the only user bfq_cpd_init(), and remove cpd_init_fn
from blkcg_policy.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230406145050.49914-4-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Thu, 6 Apr 2023 14:50:49 +0000 (22:50 +0800)]
blk-cgroup: delete cpd_bind_fn of blkcg_policy
cpd_bind_fn is just used for update default weight when block
subsys attached to a hierarchy. No any policy need it anymore.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230406145050.49914-3-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Thu, 6 Apr 2023 14:50:48 +0000 (22:50 +0800)]
block, bfq: remove BFQ_WEIGHT_LEGACY_DFL
BFQ_WEIGHT_LEGACY_DFL is the same as CGROUP_WEIGHT_DFL, which means
we don't need cpd_bind_fn() callback to update default weight when
attached to a hierarchy.
This patch remove BFQ_WEIGHT_LEGACY_DFL and cpd_bind_fn().
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230406145050.49914-2-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ondrej Kozina [Wed, 5 Apr 2023 11:12:23 +0000 (13:12 +0200)]
sed-opal: Add command to read locking range parameters.
It returns following attributes:
locking range start
locking range length
read lock enabled
write lock enabled
lock state (RW, RO or LK)
It can be retrieved by user authority provided the authority
was added to locking range via prior IOC_OPAL_ADD_USR_TO_LR
ioctl command. The command was extended to add user in ACE that
allows to read attributes listed above.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Link: https://lore.kernel.org/r/20230405111223.272816-6-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ondrej Kozina [Wed, 5 Apr 2023 11:12:22 +0000 (13:12 +0200)]
sed-opal: add helper to get multiple columns at once.
Refactors current code querying single column to use the
new helper. Real multi column usage will be added later.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230405111223.272816-5-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ondrej Kozina [Wed, 5 Apr 2023 11:12:21 +0000 (13:12 +0200)]
sed-opal: allow user authority to get locking range attributes.
Extend ACE set of locking range attributes accessible to user
authority. This patch allows user authority to get following
locking range attribues when user get added to locking range via
IOC_OPAL_ADD_USR_TO_LR:
locking range start
locking range end
read lock enabled
write lock enabled
read locked
write locked
lock on reset
active key
Note: Admin1 authority always remains in the ACE. Otherwise
it breaks current userspace expecting Admin1 in the ACE (sedutils).
See TCG OPAL2 s.4.3.1.7 "ACE_Locking_RangeNNNN_Get_RangeStartToActiveKey".
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230405111223.272816-4-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ondrej Kozina [Wed, 5 Apr 2023 11:12:20 +0000 (13:12 +0200)]
sed-opal: add helper for adding user authorities in ACE.
Move ACE construction away from add_user_to_lr routine
and refactor it to be used also in later code.
Also adds boolean operators defines from TCG Core
specification.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Link: https://lore.kernel.org/r/20230405111223.272816-3-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ondrej Kozina [Wed, 5 Apr 2023 11:12:19 +0000 (13:12 +0200)]
sed-opal: do not add same authority twice in boolean ace.
While adding user authority in boolean ace value
of uid OPAL_LOCKINGRANGE_ACE_WRLOCKED or
OPAL_LOCKINGRANGE_ACE_RDLOCKED, it was added twice.
It seemed redundant when only single authority was added
in the set method aka { authority1, authority1, OR }:
TCG Storage Architecture Core Specification, 5.1.3.3 ACE_expression
"This is an alternative type where the options are either a uidref to an
Authority object or one of the boolean_ACE (AND = 0 and OR = 1) options.
This type is used within the AC_element list to form a postfix Boolean
expression of Authorities."
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230405111223.272816-2-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 30 Mar 2023 11:36:24 +0000 (19:36 +0800)]
block: ublk_drv: cleanup 'struct ublk_map_data'
'struct ublk_map_data' is passed to ublk_copy_user_pages()
for copying data between userspace buffer and request pages.
Here what matters is userspace buffer address/len and 'struct request',
so replace ->io field with user buffer address, and rename max_bytes
as len.
Meantime remove 'ubq' field from ublk_map_data, since it isn't used
any more.
Then code becomes more readable.
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 30 Mar 2023 11:36:23 +0000 (19:36 +0800)]
block: ublk_drv: clean up several helpers
Convert the following pattern in several helpers
if (Z)
return true
return false
into:
return Z;
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 30 Mar 2023 11:36:22 +0000 (19:36 +0800)]
block: ublk_drv: add two helpers to clean up map/unmap request
Add two helpers for checking if map/unmap is needed, since we may have
passthrough request which needs map or unmap in future, such as for
supporting report zones.
Meantime don't mark ublk_copy_user_pages as inline since this function
is a bit fat now.
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 30 Mar 2023 11:36:21 +0000 (19:36 +0800)]
block: ublk_drv: don't consider flush request in map/unmap io
There isn't data in request of REQ_OP_FLUSH always, so don't consider
it in both ublk_map_io() and ublk_unmap_io().
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 30 Mar 2023 11:36:20 +0000 (19:36 +0800)]
block: ublk_drv: add common exit handling
Simplify exit handling a bit, and prepare for supporting fused command.
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Thu, 30 Mar 2023 10:27:44 +0000 (12:27 +0200)]
drbd: Pass a peer device to the resync and online verify functions
Originally-from: Andreas Grünbacher <agruen@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Thu, 30 Mar 2023 10:27:43 +0000 (12:27 +0200)]
drbd: pass drbd_peer_device to __req_mod
In preparation to support multiple connections, we need to know which
one we need to modify the request state for.
Originally-from: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Thu, 30 Mar 2023 10:27:42 +0000 (12:27 +0200)]
drbd: drbd_uuid_compare: pass a peer_device
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andreas Gruenbacher [Thu, 30 Mar 2023 10:27:41 +0000 (12:27 +0200)]
drbd: INFO_bm_xfer_stats(): Pass a peer device argument
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andreas Gruenbacher [Thu, 30 Mar 2023 10:27:40 +0000 (12:27 +0200)]
drbd: Add peer device parameter to whole-bitmap I/O handlers
Pass a peer device parameter through the bitmap I/O functions to the I/O
handlers. In after_state_ch(), set that parameter when queuing the
drbd_send_bitmap operation so that this operation knows where to send the
bitmap.
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andreas Gruenbacher [Thu, 30 Mar 2023 10:27:39 +0000 (12:27 +0200)]
drbd: Rip out the ERR_IF_CNT_IS_NEGATIVE macro
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Thu, 30 Mar 2023 10:27:38 +0000 (12:27 +0200)]
genetlink: make _genl_cmd_to_str static
Primarily to silence warnings like:
warning: no previous prototype for 'xxx_genl_cmd_to_str' [-Wmissing-prototypes]
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Thu, 30 Mar 2023 18:49:26 +0000 (11:49 -0700)]
null_blk: use kmap_local_page() and kunmap_local()
Replace the deprecated API kmap_atomic() and kunmap_atomic() with
kmap_local_page() and kunmap_local() in null_flush_cache_page().
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230330184926.64209-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Thu, 30 Mar 2023 18:49:25 +0000 (11:49 -0700)]
null_blk: use non-deprecated lib functions
Use library helper memcpy_page() to copy source page into destination
instead of having duplicate code in copy_to_nullb() & copy_from_nullb().
In copy_from_nullb() also replace the memset call with zero_user().
Use library helper memset_page() to set the buffer to 0xFF instead of
having duplicate code.
This also removes deprecated API kmap_atomic() from copy_to_nullb()
copy_from_nullb() and nullb_fill_pattern(),
from :include/linux/highmem.h:
"kmap_atomic - Atomically map a page for temporary usage - Deprecated!"
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230330184926.64209-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Mon, 27 Mar 2023 07:34:27 +0000 (00:34 -0700)]
block: open code __blk_account_io_done()
There is only one caller for __blk_account_io_done(), the function
is small enough to fit in its caller blk_account_io_done().
Remove the function and opencode in the its caller
blk_account_io_done().
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Mon, 27 Mar 2023 07:34:26 +0000 (00:34 -0700)]
block: open code __blk_account_io_start()
There is only one caller for __blk_account_io_start(), the function
is small enough to fit in its caller blk_account_io_start().
Remove the function and opencode in the its caller
blk_account_io_start().
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Mon, 20 Mar 2023 19:49:26 +0000 (12:49 -0700)]
blk-mq: remove hybrid polling
io_uring provides the only way user space can poll completions, and that
always sets BLK_POLL_NOSLEEP. This effectively makes hybrid polling dead
code, so remove it and everything supporting it.
Hybrid polling was effectively killed off with
9650b453a3d4b1, "block:
ignore RWF_HIPRI hint for sync dio", but still potentially reachable
through io_uring until
d729cf9acb93119, "io_uring: don't sleep when
polling for I/O", but hybrid polling probably should not have been
reachable through that async interface from the beginning.
Fixes: 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio")
Fixes: d729cf9acb93 ("io_uring: don't sleep when polling for I/O")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20230320194926.3353144-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Wed, 15 Mar 2023 18:39:07 +0000 (11:39 -0700)]
blk-crypto: drop the NULL check from blk_crypto_put_keyslot()
Now that all callers of blk_crypto_put_keyslot() check for NULL before
calling it, there is no need for blk_crypto_put_keyslot() to do the NULL
check itself.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Wed, 15 Mar 2023 18:39:06 +0000 (11:39 -0700)]
blk-mq: return actual keyslot error in blk_insert_cloned_request()
To avoid hiding information, pass on the error code from
blk_crypto_rq_get_keyslot() instead of always using BLK_STS_IOERR.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Wed, 15 Mar 2023 18:39:05 +0000 (11:39 -0700)]
blk-crypto: remove blk_crypto_insert_cloned_request()
blk_crypto_insert_cloned_request() is the same as
blk_crypto_rq_get_keyslot(), so just use that directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Wed, 15 Mar 2023 18:39:04 +0000 (11:39 -0700)]
blk-crypto: make blk_crypto_evict_key() more robust
If blk_crypto_evict_key() sees that the key is still in-use (due to a
bug) or that ->keyslot_evict failed, it currently just returns while
leaving the key linked into the keyslot management structures.
However, blk_crypto_evict_key() is only called in contexts such as inode
eviction where failure is not an option. So actually the caller
proceeds with freeing the blk_crypto_key regardless of the return value
of blk_crypto_evict_key().
These two assumptions don't match, and the result is that there can be a
use-after-free in blk_crypto_reprogram_all_keys() after one of these
errors occurs. (Note, these errors *shouldn't* happen; we're just
talking about what happens if they do anyway.)
Fix this by making blk_crypto_evict_key() unlink the key from the
keyslot management structures even on failure.
Also improve some comments.
Fixes: 1b2628397058 ("block: Keyslot Manager for Inline Encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Wed, 15 Mar 2023 18:39:03 +0000 (11:39 -0700)]
blk-crypto: make blk_crypto_evict_key() return void
blk_crypto_evict_key() is only called in contexts such as inode eviction
where failure is not an option. So there is nothing the caller can do
with errors except log them. (dm-table.c does "use" the error code, but
only to pass on to upper layers, so it doesn't really count.)
Just make blk_crypto_evict_key() return void and log errors itself.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Wed, 15 Mar 2023 18:39:02 +0000 (11:39 -0700)]
blk-mq: release crypto keyslot before reporting I/O complete
Once all I/O using a blk_crypto_key has completed, filesystems can call
blk_crypto_evict_key(). However, the block layer currently doesn't call
blk_crypto_put_keyslot() until the request is being freed, which happens
after upper layers have been told (via bio_endio()) the I/O has
completed. This causes a race condition where blk_crypto_evict_key()
can see 'slot_refs != 0' without there being an actual bug.
This makes __blk_crypto_evict_key() hit the
'WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)' and return without
doing anything, eventually causing a use-after-free in
blk_crypto_reprogram_all_keys(). (This is a very rare bug and has only
been seen when per-file keys are being used with fscrypt.)
There are two options to fix this: either release the keyslot before
bio_endio() is called on the request's last bio, or make
__blk_crypto_evict_key() ignore slot_refs. Let's go with the first
solution, since it preserves the ability to report bugs (via
WARN_ON_ONCE) where a key is evicted while still in-use.
Fixes: a892c8d52c02 ("block: Inline encryption support for blk-mq")
Cc: stable@vger.kernel.org
Reviewed-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jakub Kicinski [Fri, 24 Feb 2023 02:13:01 +0000 (18:13 -0800)]
nbd: use the structured req attr check
Use the macro for checking presence of required attributes.
It has the advantage of reporting to the user which attr
was missing in a machine-readable format (extack).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230224021301.1630703-2-kuba@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jakub Kicinski [Fri, 24 Feb 2023 02:13:00 +0000 (18:13 -0800)]
nbd: allow genl access outside init_net
NBD doesn't have much to do with networking, allow users outside
init_net to access the family.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230224021301.1630703-1-kuba@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 12 Mar 2023 23:36:44 +0000 (16:36 -0700)]
Linux 6.3-rc2
Hector Martin [Sat, 11 Mar 2023 14:19:14 +0000 (23:19 +0900)]
wifi: cfg80211: Partial revert "wifi: cfg80211: Fix use after free for wext"
This reverts part of commit
015b8cc5e7c4 ("wifi: cfg80211: Fix use after
free for wext")
This commit broke WPA offload by unconditionally clearing the crypto
modes for non-WEP connections. Drop that part of the patch.
Signed-off-by: Hector Martin <marcan@marcan.st>
Reported-by: Ilya <me@0upti.me>
Reported-and-tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Fixes: 015b8cc5e7c4 ("wifi: cfg80211: Fix use after free for wext")
Cc: stable@kernel.org
Link: https://lore.kernel.org/linux-wireless/ZAx0TWRBlGfv7pNl@kroah.com/T/#m11e6e0915ab8fa19ce8bc9695ab288c0fe018edf
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 12 Mar 2023 23:15:36 +0000 (16:15 -0700)]
Merge tag 'tpm-v6.3-rc3' of git://git./linux/kernel/git/jarkko/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen:
"Two additional bug fixes for v6.3"
* tag 'tpm-v6.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm: disable hwrng for fTPM on some AMD designs
tpm/eventlog: Don't abort tpm_read_log on faulty ACPI address
Mario Limonciello [Tue, 28 Feb 2023 02:44:39 +0000 (20:44 -0600)]
tpm: disable hwrng for fTPM on some AMD designs
AMD has issued an advisory indicating that having fTPM enabled in
BIOS can cause "stuttering" in the OS. This issue has been fixed
in newer versions of the fTPM firmware, but it's up to system
designers to decide whether to distribute it.
This issue has existed for a while, but is more prevalent starting
with kernel 6.1 because commit
b006c439d58db ("hwrng: core - start
hwrng kthread also for untrusted sources") started to use the fTPM
for hwrng by default. However, all uses of /dev/hwrng result in
unacceptable stuttering.
So, simply disable registration of the defective hwrng when detecting
these faulty fTPM versions. As this is caused by faulty firmware, it
is plausible that such a problem could also be reproduced by other TPM
interactions, but this hasn't been shown by any user's testing or reports.
It is hypothesized to be triggered more frequently by the use of the RNG
because userspace software will fetch random numbers regularly.
Intentionally continue to register other TPM functionality so that users
that rely upon PCR measurements or any storage of data will still have
access to it. If it's found later that another TPM functionality is
exacerbating this problem a module parameter it can be turned off entirely
and a module parameter can be introduced to allow users who rely upon
fTPM functionality to turn it on even though this problem is present.
Link: https://www.amd.com/en/support/kb/faq/pa-410
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216989
Link: https://lore.kernel.org/all/20230209153120.261904-1-Jason@zx2c4.com/
Fixes: b006c439d58d ("hwrng: core - start hwrng kthread also for untrusted sources")
Cc: stable@vger.kernel.org
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Tested-by: reach622@mailcuk.com
Tested-by: Bell <1138267643@qq.com>
Co-developed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Morten Linderud [Wed, 15 Feb 2023 09:25:52 +0000 (10:25 +0100)]
tpm/eventlog: Don't abort tpm_read_log on faulty ACPI address
tpm_read_log_acpi() should return -ENODEV when no eventlog from the ACPI
table is found. If the firmware vendor includes an invalid log address
we are unable to map from the ACPI memory and tpm_read_log() returns -EIO
which would abort discovery of the eventlog.
Change the return value from -EIO to -ENODEV when acpi_os_map_iomem()
fails to map the event log.
The following hardware was used to test this issue:
Framework Laptop (Pre-production)
BIOS: INSYDE Corp, Revision: 3.2
TPM Device: NTC, Firmware Revision: 7.2
Dump of the faulty ACPI TPM2 table:
[000h 0000 4] Signature : "TPM2" [Trusted Platform Module hardware interface Table]
[004h 0004 4] Table Length :
0000004C
[008h 0008 1] Revision : 04
[009h 0009 1] Checksum : 2B
[00Ah 0010 6] Oem ID : "INSYDE"
[010h 0016 8] Oem Table ID : "TGL-ULT"
[018h 0024 4] Oem Revision :
00000002
[01Ch 0028 4] Asl Compiler ID : "ACPI"
[020h 0032 4] Asl Compiler Revision :
00040000
[024h 0036 2] Platform Class : 0000
[026h 0038 2] Reserved : 0000
[028h 0040 8] Control Address :
0000000000000000
[030h 0048 4] Start Method : 06 [Memory Mapped I/O]
[034h 0052 12] Method Parameters : 00 00 00 00 00 00 00 00 00 00 00 00
[040h 0064 4] Minimum Log Length :
00010000
[044h 0068 8] Log Address :
000000004053D000
Fixes: 0cf577a03f21 ("tpm: Fix handling of missing event log")
Tested-by: Erkki Eilonen <erkki@bearmetal.eu>
Signed-off-by: Morten Linderud <morten@linderud.pw>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Linus Torvalds [Sun, 12 Mar 2023 16:47:08 +0000 (09:47 -0700)]
Merge tag 'xfs-6.3-fixes-1' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix a crash if mount time quotacheck fails when there are inodes
queued for garbage collection.
- Fix an off by one error when discarding folios after writeback
failure.
* tag 'xfs-6.3-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix off-by-one-block in xfs_discard_folio()
xfs: quotacheck failure can race with background inode inactivation
Linus Torvalds [Sun, 12 Mar 2023 16:17:30 +0000 (09:17 -0700)]
Merge tag 'staging-6.3-rc2' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver fixes and removal from Greg KH:
"Here are four small staging driver fixes, and one big staging driver
deletion for 6.3-rc2.
The fixes are:
- rtl8192e driver fixes for where the driver was attempting to
execute various programs directly from the disk for unknown reasons
- rtl8723bs driver fixes for issues found by Hans in testing
The deleted driver is the removal of the r8188eu wireless driver as
now in 6.3-rc1 we have a "real" wifi driver for one that includes
support for many many more devices than this old driver did. So it's
time to remove it as it is no longer needed. The maintainers of this
driver all have acked its removal. Many thanks to them over the years
for working to clean it up and keep it working while the real driver
was being developed.
All of these have been in linux-next this week with no reported
problems"
* tag 'staging-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: r8188eu: delete driver
staging: rtl8723bs: Pass correct parameters to cfg80211_get_bss()
staging: rtl8723bs: Fix key-store index handling
staging: rtl8192e: Remove call_usermodehelper starting RadioPower.sh
staging: rtl8192e: Remove function ..dm_check_ac_dc_power calling a script
Linus Torvalds [Sun, 12 Mar 2023 16:12:03 +0000 (09:12 -0700)]
Merge tag 'x86_urgent_for_v6.3_rc2' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single erratum fix for AMD machines:
- Disable XSAVES on AMD Zen1 and Zen2 machines due to an erratum. No
impact to anything as those machines will fallback to XSAVEC which
is equivalent there"
* tag 'x86_urgent_for_v6.3_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Disable XSAVES on AMD family 0x17
Linus Torvalds [Sun, 12 Mar 2023 16:04:28 +0000 (09:04 -0700)]
Merge tag 'kernel.fork.v6.3-rc2' of gitolite.pub/scm/linux/kernel/git/brauner/linux
Pull clone3 fix from Christian Brauner:
"A simple fix for the clone3() system call.
The CLONE_NEWTIME allows the creation of time namespaces. The flag
reuses a bit from the CSIGNAL bits that are used in the legacy clone()
system call to set the signal that gets sent to the parent after the
child exits.
The clone3() system call doesn't rely on CSIGNAL anymore as it uses a
dedicated .exit_signal field in struct clone_args. So we blocked all
CSIGNAL bits in clone3_args_valid(). When CLONE_NEWTIME was introduced
and reused a CSIGNAL bit we forgot to adapt clone3_args_valid()
causing CLONE_NEWTIME with clone3() to be rejected. Fix this"
* tag 'kernel.fork.v6.3-rc2' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
selftests/clone3: test clone3 with CLONE_NEWTIME
fork: allow CLONE_NEWTIME in clone3 flags
Linus Torvalds [Sun, 12 Mar 2023 16:00:54 +0000 (09:00 -0700)]
Merge tag 'vfs.misc.v6.3-rc2' of git://git./linux/kernel/git/vfs/idmapping
Pull vfs fixes from Christian Brauner:
- When allocating pages for a watch queue failed, we didn't return an
error causing userspace to proceed even though all subsequent
notifcations would be lost. Make sure to return an error.
- Fix a misformed tree entry for the idmapping maintainers entry.
- When setting file leases from an idmapped mount via
generic_setlease() we need to take the idmapping into account
otherwise taking a lease would fail from an idmapped mount.
- Remove two redundant assignments, one in splice code and the other in
locks code, that static checkers complained about.
* tag 'vfs.misc.v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
filelocks: use mount idmapping for setlease permission check
fs/locks: Remove redundant assignment to cmd
splice: Remove redundant assignment to ret
MAINTAINERS: repair a malformed T: entry in IDMAPPED MOUNTS
watch_queue: fix IOC_WATCH_QUEUE_SET_SIZE alloc error paths
Linus Torvalds [Sun, 12 Mar 2023 15:55:55 +0000 (08:55 -0700)]
Merge tag 'ext4_for_linus_stable' of git://git./linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Bug fixes and regressions for ext4, the most serious of which is a
potential deadlock during directory renames that was introduced during
the merge window discovered by a combination of syzbot and lockdep"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: zero i_disksize when initializing the bootloader inode
ext4: make sure fs error flag setted before clear journal error
ext4: commit super block if fs record error when journal record without error
ext4, jbd2: add an optimized bmap for the journal inode
ext4: fix WARNING in ext4_update_inline_data
ext4: move where set the MAY_INLINE_DATA flag is set
ext4: Fix deadlock during directory rename
ext4: Fix comment about the 64BIT feature
docs: ext4: modify the group desc size to 64
ext4: fix another off-by-one fsmap error on 1k block filesystems
ext4: fix RENAME_WHITEOUT handling for inline directories
ext4: make kobj_type structures constant
ext4: fix cgroup writeback accounting with fs-layer encryption
Linus Torvalds [Sun, 12 Mar 2023 15:52:03 +0000 (08:52 -0700)]
cpumask: relax sanity checking constraints
The cpumask_check() was unnecessarily tight, and causes problems for the
users of cpumask_next().
We have a number of users that take the previous return value of one of
the bit scanning functions and subtract one to keep it in "range". But
since the scanning functions end up returning up to 'small_cpumask_bits'
instead of the tighter 'nr_cpumask_bits', the range really needs to be
using that widened form.
[ This "previous-1" behavior is also the reason we have all those
comments about /* -1 is a legal arg here. */ and separate checks for
that being ok. So we could have just made "small_cpumask_bits-1"
be a similar special "don't check this" value.
Tetsuo Handa even suggested a patch that only does that for
cpumask_next(), since that seems to be the only actual case that
triggers, but that all makes it even _more_ magical and special. So
just relax the check ]
One example of this kind of pattern being the 'c_start()' function in
arch/x86/kernel/cpu/proc.c, but also duplicated in various forms on
other architectures.
Reported-by: syzbot+96cae094d90877641f32@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=96cae094d90877641f32
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Link: https://lore.kernel.org/lkml/c1f4cc16-feea-b83c-82cf-1a1f007b7eb9@I-love.SAKURA.ne.jp/
Fixes: 596ff4a09b89 ("cpumask: re-introduce constant-sized cpumask optimizations")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>