Ye Bin [Mon, 29 Nov 2021 01:26:59 +0000 (09:26 +0800)]
 
block: Fix fsync always failed if once failed
We do test with inject error fault base on v4.19, after test some time we found
sync /dev/sda always failed.
[root@localhost] sync /dev/sda
sync: error syncing '/dev/sda': Input/output error
scsi log as follows:
[19069.812296] sd 0:0:0:0: [sda] tag#64 Send: scmd 0x00000000d03a0b6b
[19069.812302] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[19069.812533] sd 0:0:0:0: [sda] tag#64 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[19069.812536] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[19069.812539] sd 0:0:0:0: [sda] tag#64 scsi host busy 1 failed 0
[19069.812542] sd 0:0:0:0: Notifying upper driver of completion (result 0)
[19069.812546] sd 0:0:0:0: [sda] tag#64 sd_done: completed 0 of 0 bytes
[19069.812549] sd 0:0:0:0: [sda] tag#64 0 sectors total, 0 bytes done.
[19069.812564] print_req_error: I/O error, dev sda, sector 0
ftrace log as follows:
 rep-306069 [007] .... 19654.923315: block_bio_queue: 8,0 FWS 0 + 0 [rep]
 rep-306069 [007] .... 19654.923333: block_getrq: 8,0 FWS 0 + 0 [rep]
 kworker/7:1H-250   [007] .... 19654.923352: block_rq_issue: 8,0 FF 0 () 0 + 0 [kworker/7:1H]
 <idle>-0     [007] ..s. 19654.923562: block_rq_complete: 8,0 FF () 
18446744073709551615 + 0 [0]
 <idle>-0     [007] d.s. 19654.923576: block_rq_complete: 8,0 WS () 0 + 0 [-5]
As 
8d6996630c03 introduce 'fq->rq_status', this data only update when 'flush_rq'
reference count isn't zero. If flush request once failed and record error code
in 'fq->rq_status'. If there is no chance to update 'fq->rq_status',then do fsync
will always failed.
To address this issue reset 'fq->rq_status' after return error code to upper layer.
Fixes: 8d6996630c03("block: fix null pointer dereference in blk_mq_rq_timed_out()")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211129012659.1553733-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 12:18:02 +0000 (13:18 +0100)]
 
scsi: remove the gendisk argument to scsi_ioctl
Now that blk_execute_rq does not take a gendisk argument there is no need
to pass it through the scsi_ioctl callchain either.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 12:18:01 +0000 (13:18 +0100)]
 
block: remove the gendisk argument to blk_execute_rq
Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now.  Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 12:18:00 +0000 (13:18 +0100)]
 
block: remove the ->rq_disk field in struct request
Just use the disk attached to the request_queue instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 12:17:59 +0000 (13:17 +0100)]
 
block: don't check ->rq_disk in merges
There is a 1:1 relationship between request_queues and gendisks now, so
no need for these extra checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 12:17:58 +0000 (13:17 +0100)]
 
mtd_blkdevs: remove the sector out of range check in do_blktrans_request
The block layer already performs this check, no need to duplicate it in
the driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Colin Ian King [Fri, 26 Nov 2021 23:06:52 +0000 (23:06 +0000)]
 
block: Remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never
read, it is being updated later on. The assignment is redundant and
can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20211126230652.1175636-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:17 +0000 (12:58 +0100)]
 
block: simplify ioc_lookup_icq
Remove the ioc argument as it always points to current->io_context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:16 +0000 (12:58 +0100)]
 
block: simplify ioc_create_icq
Remove the ioc and gfp_mask argument, which are hard coded by the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:15 +0000 (12:58 +0100)]
 
block: return the io_context from create_task_io_context
Grab a reference to the newly allocated or existing io_context in
create_task_io_context and return it.  This simplifies the callers and
removes the need for double lookups.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:14 +0000 (12:58 +0100)]
 
block: use alloc_io_context in __copy_io
In __copy_io we know that the newly allocate task_struct does not have
an I/O context yet and is not exiting.  So just allocate the I/O context
struct and install it directly.  There is no need to lock the task
either as it is just being created.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:13 +0000 (12:58 +0100)]
 
block: factor out a alloc_io_context helper
Factor out a helper that just allocate an I/O context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:12 +0000 (12:58 +0100)]
 
block: remove get_io_context_active
Fold it into it's only caller, and remove a lof of the debug checks
that are not needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:11 +0000 (12:58 +0100)]
 
block: move the remaining elv.icq handling to the I/O scheduler
After the prepare side has been moved to the only I/O scheduler that
cares, do the same for the cleanup and the NULL initialization.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:10 +0000 (12:58 +0100)]
 
block: move blk_mq_sched_assign_ioc to blk-ioc.c
Move blk_mq_sched_assign_ioc so that many interfaces from the file can
be marked static.  Rename the function to ioc_find_get_icq as well and
return the icq to simplify the interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:09 +0000 (12:58 +0100)]
 
block: mark put_io_context_active static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:08 +0000 (12:58 +0100)]
 
Revert "block: Provide blk_mq_sched_get_icq()"
This reverts commit 
4896c4e64ba5d5d5acdbcf68c5910dd4f6d8fa62.
The helper is not needed any more.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:07 +0000 (12:58 +0100)]
 
bfq: use bfq_bic_lookup in bfq_limit_depth
No need to create a new I/O context if there is none present yet in
->limit_depth.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:06 +0000 (12:58 +0100)]
 
bfq: simplify bfq_bic_lookup
Remove the unused bfqd argument, and hardcode ioc to current->io_context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:05 +0000 (12:58 +0100)]
 
fork: move copy_io to block/blk-ioc.c
Move the copying of the I/O context to the block layer as that is where
we can use the proper low-level interfaces.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:04 +0000 (12:58 +0100)]
 
RDMA/qib: rename copy_io to qib_copy_io
Add the proper module prefix to avoid conflicts with a function
in the scheduler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 26 Nov 2021 16:19:43 +0000 (00:19 +0800)]
 
blk-mq: use bio->bi_opf after bio is checked
bio->bi_opf isn't finalized before checking the bio, so use it after
submit_bio_checks() returns.
Fixes: 5b13bc8a3fd5 ("blk-mq: cleanup request allocation")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:41 +0000 (14:36 +0100)]
 
bfq: Do not let waker requests skip proper accounting
Commit 
7cc4ffc55564 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.
Consider for example the following job file:
[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync
[slowwriter]
numjobs=1
prioclass=2
prio=7
fsync=200
[fastwriter]
numjobs=1
prioclass=2
prio=0
fsync=200
Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.
Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.
Fixes: 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in dispatch list")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:40 +0000 (14:36 +0100)]
 
bfq: Log waker detections
Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:39 +0000 (14:36 +0100)]
 
bfq: Provide helper to generate bfqq name
Instead of having helper formating bfqq pid, provide a helper to
generate full bfqq name as used in the traces. It saves some code
duplication and will save more in the coming tracepoints.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:38 +0000 (14:36 +0100)]
 
bfq: Limit waker detection in time
Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.
This significantly reduces false waker detection for workload like:
[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync
[slowwriter]
numjobs=1
fsync=200
[fastwriter]
numjobs=1
fsync=200
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:37 +0000 (14:36 +0100)]
 
bfq: Limit number of requests consumed by each cgroup
When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.
Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:36 +0000 (14:36 +0100)]
 
bfq: Store full bitmap depth in bfq_data
Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:35 +0000 (14:36 +0100)]
 
bfq: Track number of allocated requests in bfq_entity
When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:34 +0000 (14:36 +0100)]
 
block: Provide blk_mq_sched_get_icq()
Currently we lookup ICQ only after the request is allocated. However BFQ
will want to decide how many scheduler tags it allows a given bfq queue
(effectively a process) to consume based on cgroup weight. So provide a
function blk_mq_sched_get_icq() so that BFQ can lookup ICQ earlier.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sebastian Andrzej Siewior [Mon, 25 Oct 2021 07:06:58 +0000 (09:06 +0200)]
 
mmc: core: Use blk_mq_complete_request_direct().
The completion callback for the sdhci-pci device is invoked from a
kworker.
I couldn't identify in which context is mmc_blk_mq_req_done() invoke but
the remaining caller are from invoked from preemptible context. Here it
would make sense to complete the request directly instead scheduling
ksoftirqd for its completion.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Link: https://lore.kernel.org/r/20211025070658.1565848-3-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sebastian Andrzej Siewior [Mon, 25 Oct 2021 07:06:57 +0000 (09:06 +0200)]
 
blk-mq: Add blk_mq_complete_request_direct()
Add blk_mq_complete_request_direct() which completes the block request
directly instead deferring it to softirq for single queue devices.
This is useful for devices which complete the requests in preemptible
context and raising softirq from means scheduling ksoftirqd.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211025070658.1565848-2-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Wed, 24 Nov 2021 01:37:33 +0000 (17:37 -0800)]
 
blk-crypto: remove blk_crypto_unregister()
This function is trivial and is only used in one place.  Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124013733.347612-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 24 Nov 2021 06:28:56 +0000 (07:28 +0100)]
 
blk-mq: cleanup request allocation
Refactor the request alloction so that blk_mq_get_cached_request tries
to find a cached request first, and the entirely separate and now
self contained blk_mq_get_new_requests allocates one or more requests
if that is not possible.
There is a small change in behavior as submit_bio_checks is called
twice now if a cached request is present but can't be used, but that
is a small price to pay for unwinding this code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124062856.1444266-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 18:53:12 +0000 (19:53 +0100)]
 
block: don't include <linux/part_stat.h> in blk.h
Not needed, shift it into the source files that need it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 18:53:11 +0000 (19:53 +0100)]
 
block: don't include <linux/idr.h> in blk.h
Not needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 18:53:10 +0000 (19:53 +0100)]
 
block: don't include <linux/blk-mq.h> in blk.h
Not needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 18:53:09 +0000 (19:53 +0100)]
 
block: don't include blk-mq.h in blk.h
No needed, shift a blk-stat.h include into the source file that needs it
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 18:53:08 +0000 (19:53 +0100)]
 
block: don't include blk-mq-sched.h in blk.h
No needed, shift it into the source files that need it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 18:53:07 +0000 (19:53 +0100)]
 
block: remove the e argument to elevator_exit
All callers pass q->elevator.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 18:53:06 +0000 (19:53 +0100)]
 
block: remove elevator_exit
Open code elevator_exit in it's only caller, and rename __elevator_exit to
elevator_exit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 18:53:05 +0000 (19:53 +0100)]
 
block: move blk_get_flush_queue to blk-flush.c
blk_get_flush_queue is only used in blk-flush.c, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guo Zhengkui [Tue, 23 Nov 2021 06:33:40 +0000 (14:33 +0800)]
 
blk_mq: remove repeated includes
Remove a repeated "#include<linux/sched/sysctl.h>".
Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211123063340.25882-1-guozhengkui@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 13 Nov 2021 18:18:32 +0000 (11:18 -0700)]
 
block: move io_context creation into where it's needed
The only user of the io_context for IO is BFQ, yet we put the checking
and logic of it into the normal IO path.
Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that
helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 13 Nov 2021 21:03:26 +0000 (14:03 -0700)]
 
block: only allocate poll_stats if there's a user of them
This is essentially never used, yet it's about 1/3rd of the total
queue size. Allocate it when needed, and don't embed it in the queue.
Kill the queue flag for this while at it, since we can just check the
assigned pointer now.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 13 Nov 2021 20:37:38 +0000 (13:37 -0700)]
 
blk-ioprio: don't set bio priority if not needed
We don't need to write to the bio if:
1) No ioprio value has ever been assigned to the blkcg
2) We wouldn't anyway, depending on bio and blkcg IO priority
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 16:04:42 +0000 (17:04 +0100)]
 
blk-mq: move more plug handling from blk_mq_submit_bio into blk_add_rq_to_plug
Keep all the functionality for adding a request to a plug in a single place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 23 Nov 2021 16:04:41 +0000 (17:04 +0100)]
 
blk-mq: simplify the plug handling in blk_mq_submit_bio
blk_mq_submit_bio has two different plug cases, one that uses full
plugging and a limited plugging one.
The limited plugging case is only used for a corner case that does
not matter in real life:
 - no ->commit_rqs (so not NVMe)
 - no shared tags (so not SCSI)
 - not rotational (so no old disk or floppy driver)
 - must have multiple queues (so no eMMC)
Remove the limited merging case and all the related junk to simplify
blk_mq_submit_bio and the functions called from it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:25 +0000 (14:06 +0100)]
 
sr: set GENHD_FL_REMOVABLE earlier
Set up GENHD_FL_REMOVABLE together with the rest of the gendisk fields.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:24 +0000 (14:06 +0100)]
 
block: cleanup the GENHD_FL_* definitions
Switch to an enum and tidy up the documentation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:23 +0000 (14:06 +0100)]
 
block: don't set GENHD_FL_NO_PART for hidden gendisks
Hidden gendisks can't be opened using blkdev_get_*, so we can't really
reach any of the partition scanning paths or partitioning ioctls except
for the initial partition scan from add_disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:22 +0000 (14:06 +0100)]
 
block: remove GENHD_FL_EXT_DEVT
All modern drivers can support extra partitions using the extended
dev_t.  In fact except for the ioctl method drivers never even see
partitions in normal operation.
So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:21 +0000 (14:06 +0100)]
 
block: remove GENHD_FL_SUPPRESS_PARTITION_INFO
This flag is not set directly anywhere and only inherited from
GENHD_FL_HIDDEN.  Just check for GENHD_FL_HIDDEN instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:20 +0000 (14:06 +0100)]
 
mmc: don't set GENHD_FL_SUPPRESS_PARTITION_INFO
This manually reverts 
07b652cdbec3 ("mmc: card: Don't show eMMC RPMB and
BOOT areas in /proc/partitions").  Based on the commit description that
change was purely cosmetic.  mmc is the last driver that sets this
flag and thus prevents it from being removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20211122130625.1136848-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:19 +0000 (14:06 +0100)]
 
null_blk: don't suppress partitioning information
This manually reverts commit 
27290b469051 ("null_blk: suppress invalid
partition info").  The message in that commit log can't appearch as
the flag is never checked during probing, and there is no good reason
to treat null_blk special in /proc/partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 29 Nov 2021 13:38:04 +0000 (06:38 -0700)]
 
block: remove the GENHD_FL_HIDDEN check in blkdev_get_no_open
Hidden gendisks never hash the block device inode, so this can't happen.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:17 +0000 (14:06 +0100)]
 
block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART
The GENHD_FL_NO_PART_SCAN controls more than just partitions canning,
so rename it to GENHD_FL_NO_PART.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20211122130625.1136848-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:16 +0000 (14:06 +0100)]
 
block: merge disk_scan_partitions and blkdev_reread_part
Unify the functionality that implements a partition rescan for a
gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:15 +0000 (14:06 +0100)]
 
block: remove a dead check in show_partition
disk_max_parts never returns 0 given that ->minors for devices not using
the extended dev_t must be non-zero, and disk_max_parts always returns
DISK_MAX_PARTS for the latter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:14 +0000 (14:06 +0100)]
 
block: remove GENHD_FL_CD
GENHD_FL_CD marks a gendisk as a vaguely CD-ROM like device.
Besides being used internally inside of sunvdc.c an xen-blkfront it
is used by xen-blkback as a hint to claim a device exported to a
guest is a CD-ROM like device.  Just check for disk->cdi instead
which is the right indicator for "real" CD-ROM or DVD drivers.  This
will miss the paravirtualized guest drivers, but those make little
sense to report anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:13 +0000 (14:06 +0100)]
 
block: move GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE to disk->event_flags
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE is all about the event reporting
mechanism, so move it to the event_flags field.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 22 Nov 2021 13:06:12 +0000 (14:06 +0100)]
 
block: move GENHD_FL_NATIVE_CAPACITY to disk->state
The flag to indicate an unlocked native capacity is dynamic state,
not a driver capability flag, so move it to disk->state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:14:04 +0000 (07:14 +0100)]
 
block: don't include blk-mq headers in blk-core.c
All request based code is in the blk-mq files now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:14:03 +0000 (07:14 +0100)]
 
block: move blk_print_req_error to blk-mq.c
This function is only used by the request completion path.  Factor out
a blk_status_to_str to keep blk_errors private in blk-core.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:14:02 +0000 (07:14 +0100)]
 
block: move blk_dump_rq_flags to blk-mq.c
blk_dump_rq_flags deals with a request, so move it to blk-mq.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:14:01 +0000 (07:14 +0100)]
 
block: move blk_account_io_{start,done} to blk-mq.c
These are only used for request based I/O, so move them where they are
used.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:14:00 +0000 (07:14 +0100)]
 
block: move blk_steal_bios to blk-mq.c
Keep all the request based code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:13:59 +0000 (07:13 +0100)]
 
block: move blk_rq_init to blk-mq.c
blk_rq_init deals with a request structure, so move it to blk-mq.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:13:58 +0000 (07:13 +0100)]
 
block: move request based cloning helpers to blk-mq.c
Keep all the request based code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:13:57 +0000 (07:13 +0100)]
 
blk-mq: move blk_mq_flush_plug_list
Move blk_mq_flush_plug_list and blk_mq_plug_issue_direct down in blk-mq.c
to prepare for marking blk_mq_request_issue_directly static without the
need of a forward declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:13:56 +0000 (07:13 +0100)]
 
block: remove blk-exec.c
All this code is tightly coupled to the blk-mq core, so move it
there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-4-hch@lst.de
[axboe: remove doc generation for blk-exec.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:13:55 +0000 (07:13 +0100)]
 
block: remove rq_flush_dcache_pages
This function is trivial, and flush_dcache_page is always defined, so
just open code it in the 2.5 callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Nov 2021 06:13:54 +0000 (07:13 +0100)]
 
block: move blk_rq_err_bytes to scsi
blk_rq_err_bytes is only used by the scsi midlayer, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 28 Nov 2021 22:09:19 +0000 (14:09 -0800)]
 
Linux 5.16-rc3
Linus Torvalds [Sun, 28 Nov 2021 19:58:52 +0000 (11:58 -0800)]
 
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull vhost,virtio,vdpa bugfixes from Michael Tsirkin:
 "Misc fixes all over the place.
  Revert of virtio used length validation series: the approach taken
  does not seem to work, breaking too many guests in the process. We'll
  need to do length validation using some other approach"
[ This merge also ends up reverting commit 
f7a36b03a732 ("vsock/virtio:
  suppress used length validation"), which came in through the
  networking tree in the meantime, and was part of that whole used
  length validation series   - Linus ]
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vdpa_sim: avoid putting an uninitialized iova_domain
  vhost-vdpa: clean irqs before reseting vdpa device
  virtio-blk: modify the value type of num in virtio_queue_rq()
  vhost/vsock: cleanup removing `len` variable
  vhost/vsock: fix incorrect used length reported to the guest
  Revert "virtio_ring: validate used buffer length"
  Revert "virtio-net: don't let virtio core to validate used length"
  Revert "virtio-blk: don't let virtio core to validate used length"
  Revert "virtio-scsi: don't let virtio core to validate used buffer length"
Linus Torvalds [Sun, 28 Nov 2021 17:24:50 +0000 (09:24 -0800)]
 
Merge tag 'x86-urgent-2021-11-28' of git://git./linux/kernel/git/tip/tip
Pull x86 build fix from Thomas Gleixner:
 "A single fix for a missing __init annotation of prepare_command_line()"
* tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Mark prepare_command_line() __init
Linus Torvalds [Sun, 28 Nov 2021 17:15:34 +0000 (09:15 -0800)]
 
Merge tag 'sched-urgent-2021-11-28' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single scheduler fix to ensure that there is no stale KASAN shadow
  state left on the idle task's stack when a CPU is brought up after it
  was brought down before"
* tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/scs: Reset task stack state in bringup_cpu()
Linus Torvalds [Sun, 28 Nov 2021 17:10:54 +0000 (09:10 -0800)]
 
Merge tag 'perf-urgent-2021-11-28' of git://git./linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
 "A single fix for perf to prevent it from sending SIGTRAP to another
  task from a trace point event as it's not possible to deliver a
  synchronous signal to a different task from there"
* tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Ignore sigtrap for tracepoints destined for other tasks
Linus Torvalds [Sun, 28 Nov 2021 17:04:41 +0000 (09:04 -0800)]
 
Merge tag 'locking-urgent-2021-11-28' of git://git./linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "Two regression fixes for reader writer semaphores:
   - Plug a race in the lock handoff which is caused by inconsistency of
     the reader and writer path and can lead to corruption of the
     underlying counter.
   - down_read_trylock() is suboptimal when the lock is contended and
     multiple readers trylock concurrently. That's due to the initial
     value being read non-atomically which results in at least two
     compare exchange loops. Making the initial readout atomic reduces
     this significantly. Whith 40 readers by 11% in a benchmark which
     enforces contention on mmap_sem"
* tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Optimize down_read_trylock() under highly contended case
  locking/rwsem: Make handoff bit handling more consistent
Linus Torvalds [Sun, 28 Nov 2021 16:50:53 +0000 (08:50 -0800)]
 
Merge tag 'trace-v5.16-rc2-3' of git://git./linux/kernel/git/rostedt/linux-trace
Pull another tracing fix from Steven Rostedt:
 "Fix the fix of pid filtering
  The setting of the pid filtering flag tested the "trace only this pid"
  case twice, and ignored the "trace everything but this pid" case.
  The 5.15 kernel does things a little differently due to the new sparse
  pid mask introduced in 5.16, and as the bug was discovered running the
  5.15 kernel, and the first fix was initially done for that kernel,
  that fix handled both cases (only pid and all but pid), but the
  forward port to 5.16 created this bug"
* tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Test the 'Do not trace this pid' case in create event
Linus Torvalds [Sun, 28 Nov 2021 15:17:38 +0000 (07:17 -0800)]
 
Merge tag 'iommu-fixes-v5.16-rc2' of git://git./linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
 - Intel VT-d fixes:
     - Remove unused PASID_DISABLED
     - Fix RCU locking
     - Fix for the unmap_pages call-back
 - Rockchip RK3568 address mask fix
 - AMD IOMMUv2 log message clarification
* tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Fix unmap_pages support
  iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock()
  iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568
  iommu/amd: Clarify AMD IOMMUv2 initialization messages
  iommu/vt-d: Remove unused PASID_DISABLED
Linus Torvalds [Sat, 27 Nov 2021 22:49:35 +0000 (14:49 -0800)]
 
Merge tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd fixes from Steve French:
 "Five ksmbd server fixes, four of them for stable:
   - memleak fix
   - fix for default data stream on filesystems that don't support xattr
   - error logging fix
   - session setup fix
   - minor doc cleanup"
* tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix memleak in get_file_stream_info()
  ksmbd: contain default data stream even if xattr is empty
  ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec()
  docs: filesystem: cifs: ksmbd: Fix small layout issues
  ksmbd: Fix an error handling path in 'smb2_sess_setup()'
Guenter Roeck [Sat, 27 Nov 2021 15:44:42 +0000 (07:44 -0800)]
 
vmxnet3: Use generic Kconfig option for page size limit
Use the architecture independent Kconfig option PAGE_SIZE_LESS_THAN_64KB
to indicate that VMXNET3 requires a page size smaller than 64kB.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guenter Roeck [Sat, 27 Nov 2021 15:44:41 +0000 (07:44 -0800)]
 
fs: ntfs: Limit NTFS_RW to page sizes smaller than 64k
NTFS_RW code allocates page size dependent arrays on the stack. This
results in build failures if the page size is 64k or larger.
  fs/ntfs/aops.c: In function 'ntfs_write_mst_block':
  fs/ntfs/aops.c:1311:1: error:
	the frame size of 2240 bytes is larger than 2048 bytes
Since commit 
f22969a66041 ("powerpc/64s: Default to 64K pages for 64 bit
book3s") this affects ppc:allmodconfig builds, but other architectures
supporting page sizes of 64k or larger are also affected.
Increasing the maximum frame size for affected architectures just to
silence this error does not really help.  The frame size would have to
be set to a really large value for 256k pages.  Also, a large frame size
could potentially result in stack overruns in this code and elsewhere
and is therefore not desirable.  Make NTFS_RW dependent on page sizes
smaller than 64k instead.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guenter Roeck [Sat, 27 Nov 2021 15:44:40 +0000 (07:44 -0800)]
 
arch: Add generic Kconfig option indicating page size smaller than 64k
NTFS_RW and VMXNET3 require a page size smaller than 64kB.  Add generic
Kconfig option for use outside architecture code to avoid architecture
specific Kconfig options in that code.
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steven Rostedt (VMware) [Sat, 27 Nov 2021 21:45:26 +0000 (16:45 -0500)]
 
tracing: Test the 'Do not trace this pid' case in create event
When creating a new event (via a module, kprobe, eprobe, etc), the
descriptors that are created must add flags for pid filtering if an
instance has pid filtering enabled, as the flags are used at the time the
event is executed to know if pid filtering should be done or not.
The "Only trace this pid" case was added, but a cut and paste error made
that case checked twice, instead of checking the "Trace all but this pid"
case.
Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/
Fixes: 6cb206508b62 ("tracing: Check pid filtering when creating events")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Linus Torvalds [Sat, 27 Nov 2021 20:59:54 +0000 (12:59 -0800)]
 
Merge tag 'xfs-5.16-fixes-1' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
 "Fixes for a resource leak and a build robot complaint about totally
  dead code:
   - Fix buffer resource leak that could lead to livelock on corrupt fs.
   - Remove unused function xfs_inew_wait to shut up the build robots"
* tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: remove xfs_inew_wait
  xfs: Fix the free logic of state in xfs_attr_node_hasname
Linus Torvalds [Sat, 27 Nov 2021 20:50:03 +0000 (12:50 -0800)]
 
Merge tag 'iomap-5.16-fixes-1' of git://git./fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:
 "A single iomap bug fix and a cleanup for 5.16-rc2.
  The bug fix changes how iomap deals with reading from an inline data
  region -- whereas the current code (incorrectly) lets the iomap read
  iter try for more bytes after reading the inline region (which zeroes
  the rest of the page!) and hopes the next iteration terminates, we
  surveyed the inlinedata implementations and realized that all
  inlinedata implementations also require that the inlinedata region end
  at EOF, so we can simply terminate the read.
  The second patch documents these assumptions in the code so that
  they're not subtle implications anymore, and cleans up some of the
  grosser parts of that function.
  Summary:
   - Fix an accounting problem where unaligned inline data reads can run
     off the end of the read iomap iterator. iomap has historically
     required that inline data mappings only exist at the end of a file,
     though this wasn't documented anywhere.
   - Document iomap_read_inline_data and change its return type to be
     appropriate for the information that it's actually returning"
* tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: iomap_read_inline_data cleanup
  iomap: Fix inline extent handling in iomap_readpage
Linus Torvalds [Sat, 27 Nov 2021 20:03:57 +0000 (12:03 -0800)]
 
Merge tag 'trace-v5.16-rc2-2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
 "Two fixes to event pid filtering:
   - Make sure newly created events reflect the current state of pid
     filtering
   - Take pid filtering into account when recording trigger events.
     (Also clean up the if statement to be cleaner)"
* tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix pid filtering when triggers are attached
  tracing: Check pid filtering when creating events
Linus Torvalds [Sat, 27 Nov 2021 19:28:37 +0000 (11:28 -0800)]
 
Merge tag 'io_uring-5.16-2021-11-27' of git://git.kernel.dk/linux-block
Pull more io_uring fixes from Jens Axboe:
 "The locking fixup that was applied earlier this rc has both a deadlock
  and IRQ safety issue, let's get that ironed out before -rc3. This
  contains:
   - Link traversal locking fix (Pavel)
   - Cancelation fix (Pavel)
   - Relocate cond_resched() for huge buffer chain freeing, avoiding a
     softlockup warning (Ye)
   - Fix timespec validation (Ye)"
* tag 'io_uring-5.16-2021-11-27' of git://git.kernel.dk/linux-block:
  io_uring: Fix undefined-behaviour in io_issue_sqe
  io_uring: fix soft lockup when call __io_remove_buffers
  io_uring: fix link traversal locking
  io_uring: fail cancellation for EXITING tasks
Linus Torvalds [Sat, 27 Nov 2021 19:19:42 +0000 (11:19 -0800)]
 
Merge tag 'block-5.16-2021-11-27' of git://git.kernel.dk/linux-block
Pull more block fixes from Jens Axboe:
 "Turns out that the flushing out of pending fixes before the
  Thanksgiving break didn't quite work out in terms of timing, so here's
  a followup set of fixes:
   - rq_qos_done() should be called regardless of whether or not we're
     the final put of the request, it's not related to the freeing of
     the state. This fixes an IO stall with wbt that a few users have
     reported, a regression in this release.
   - Only define zram_wb_devops if it's used, fixing a compilation
     warning for some compilers"
* tag 'block-5.16-2021-11-27' of git://git.kernel.dk/linux-block:
  zram: only make zram_wb_devops for CONFIG_ZRAM_WRITEBACK
  block: call rq_qos_done() before ref check in batch completions
Linus Torvalds [Sat, 27 Nov 2021 19:15:17 +0000 (11:15 -0800)]
 
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "Twelve fixes, eleven in drivers (target, qla2xx, scsi_debug, mpt3sas,
  ufs). The core fix is a minor correction to the previous state update
  fix for the iscsi daemons"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: scsi_debug: Zero clear zones at reset write pointer
  scsi: core: sysfs: Fix setting device state to SDEV_RUNNING
  scsi: scsi_debug: Sanity check block descriptor length in resp_mode_select()
  scsi: target: configfs: Delete unnecessary checks for NULL
  scsi: target: core: Use RCU helpers for INQUIRY t10_alua_tg_pt_gp
  scsi: mpt3sas: Fix incorrect system timestamp
  scsi: mpt3sas: Fix system going into read-only mode
  scsi: mpt3sas: Fix kernel panic during drive powercycle test
  scsi: ufs: ufs-mediatek: Add put_device() after of_find_device_by_node()
  scsi: scsi_debug: Fix type in min_t to avoid stack OOB
  scsi: qla2xxx: edif: Fix off by one bug in qla_edif_app_getfcinfo()
  scsi: ufs: ufshpb: Fix warning in ufshpb_set_hpb_read_to_upiu()
Linus Torvalds [Sat, 27 Nov 2021 18:33:55 +0000 (10:33 -0800)]
 
Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
 "Highlights include:
  Stable fixes:
   - NFSv42: Fix pagecache invalidation after COPY/CLONE
  Bugfixes:
   - NFSv42: Don't fail clone() just because the server failed to return
     post-op attributes
   - SUNRPC: use different lockdep keys for INET6 and LOCAL
   - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
   - SUNRPC: fix header include guard in trace header"
* tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: use different lock keys for INET6 and LOCAL
  sunrpc: fix header include guard in trace header
  NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION
  NFSv42: Fix pagecache invalidation after COPY/CLONE
  NFS: Add a tracepoint to show the results of nfs_set_cache_invalid()
  NFSv42: Don't fail clone() unless the OP_CLONE operation failed
Linus Torvalds [Sat, 27 Nov 2021 18:27:35 +0000 (10:27 -0800)]
 
Merge tag 'erofs-for-5.16-rc3-fixes' of git://git./linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
 "Fix an ABBA deadlock introduced by XArray conversion"
* tag 'erofs-for-5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix deadlock when shrink erofs slab
Linus Torvalds [Sat, 27 Nov 2021 18:06:15 +0000 (10:06 -0800)]
 
Merge tag 'powerpc-5.16-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
 "Fix KVM using a Power9 instruction on earlier CPUs, which could lead
  to the host SLB being incorrectly invalidated and a subsequent host
  crash.
  Fix kernel hardlockup on vmap stack overflow on 32-bit.
  Thanks to Christophe Leroy, Nicholas Piggin, and Fabiano Rosas"
* tag 'powerpc-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/32: Fix hardlockup on vmap stack overflow
  KVM: PPC: Book3S HV: Prevent POWER7/8 TLB flush flushing SLB
Linus Torvalds [Sat, 27 Nov 2021 17:50:31 +0000 (09:50 -0800)]
 
Merge tag 'mips-fixes_5.16_2' of git://git./linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
 - build fix for ZSTD enabled configs
 - fix for preempt warning
 - fix for loongson FTLB detection
 - fix for page table level selection
* tag 'mips-fixes_5.16_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: use 3-level pgtable for 64KB page size on MIPS_VA_BITS_48
  MIPS: loongson64: fix FTLB configuration
  MIPS: Fix using smp_processor_id() in preemptible in show_cpuinfo()
  MIPS: boot/compressed/: add __ashldi3 to target for ZSTD compression
Ye Bin [Thu, 18 Nov 2021 01:59:07 +0000 (09:59 +0800)]
 
io_uring: Fix undefined-behaviour in io_issue_sqe
We got issue as follows:
================================================================================
UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14
signed integer overflow:
-
4966321760114568020 * 
1000000000 cannot be represented in type 'long long int'
CPU: 1 PID: 2186 Comm: syz-executor.2 Not tainted 4.19.90+ #12
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x170/0x1dc lib/dump_stack.c:118
 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
 handle_overflow+0x188/0x1dc lib/ubsan.c:192
 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213
 ktime_set include/linux/ktime.h:42 [inline]
 timespec64_to_ktime include/linux/ktime.h:78 [inline]
 io_timeout fs/io_uring.c:5153 [inline]
 io_issue_sqe+0x42c8/0x4550 fs/io_uring.c:5599
 __io_queue_sqe+0x1b0/0xbc0 fs/io_uring.c:5988
 io_queue_sqe+0x1ac/0x248 fs/io_uring.c:6067
 io_submit_sqe fs/io_uring.c:6137 [inline]
 io_submit_sqes+0xed8/0x1c88 fs/io_uring.c:6331
 __do_sys_io_uring_enter fs/io_uring.c:8170 [inline]
 __se_sys_io_uring_enter fs/io_uring.c:8129 [inline]
 __arm64_sys_io_uring_enter+0x490/0x980 fs/io_uring.c:8129
 invoke_syscall arch/arm64/kernel/syscall.c:53 [inline]
 el0_svc_common+0x374/0x570 arch/arm64/kernel/syscall.c:121
 el0_svc_handler+0x190/0x260 arch/arm64/kernel/syscall.c:190
 el0_svc+0x10/0x218 arch/arm64/kernel/entry.S:1017
================================================================================
As ktime_set only judge 'secs' if big than KTIME_SEC_MAX, but if we pass
negative value maybe lead to overflow.
To address this issue, we must check if 'sec' is negative.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211118015907.844807-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ye Bin [Mon, 22 Nov 2021 02:47:37 +0000 (10:47 +0800)]
 
io_uring: fix soft lockup when call __io_remove_buffers
I got issue as follows:
[ 567.094140] __io_remove_buffers: [1]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
[  594.360799] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
[  594.364987] Modules linked in:
[  594.365405] irq event stamp: 
604180238
[  594.365906] hardirqs last  enabled at (
604180237): [<
ffffffff93fec9bd>] _raw_spin_unlock_irqrestore+0x2d/0x50
[  594.367181] hardirqs last disabled at (
604180238): [<
ffffffff93fbbadb>] sysvec_apic_timer_interrupt+0xb/0xc0
[  594.368420] softirqs last  enabled at (
569080666): [<
ffffffff94200654>] __do_softirq+0x654/0xa9e
[  594.369551] softirqs last disabled at (
569080575): [<
ffffffff913e1d6a>] irq_exit_rcu+0x1ca/0x250
[  594.370692] CPU: 2 PID: 108 Comm: kworker/u32:5 Tainted: G            L    5.15.0-next-
20211112+ #88
[  594.371891] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[  594.373604] Workqueue: events_unbound io_ring_exit_work
[  594.374303] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50
[  594.375037] Code: 48 83 c7 18 53 48 89 f3 48 8b 74 24 10 e8 55 f5 55 fd 48 89 ef e8 ed a7 56 fd 80 e7 02 74 06 e8 43 13 7b fd fb bf 01 00 00 00 <e8> f8 78 474
[  594.377433] RSP: 0018:
ffff888101587a70 EFLAGS: 
00000202
[  594.378120] RAX: 
0000000024030f0d RBX: 
0000000000000246 RCX: 
1ffffffff2f09106
[  594.379053] RDX: 
0000000000000000 RSI: 
ffffffff9449f0e0 RDI: 
0000000000000001
[  594.379991] RBP: 
ffffffff9586cdc0 R08: 
0000000000000001 R09: 
fffffbfff2effcab
[  594.380923] R10: 
ffffffff977fe557 R11: 
fffffbfff2effcaa R12: 
ffff8881b8f3def0
[  594.381858] R13: 
0000000000000246 R14: 
ffff888153a8b070 R15: 
0000000000000000
[  594.382787] FS:  
0000000000000000(0000) GS:
ffff888399c00000(0000) knlGS:
0000000000000000
[  594.383851] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
[  594.384602] CR2: 
00007fcbe71d2000 CR3: 
00000000b4216000 CR4: 
00000000000006e0
[  594.385540] DR0: 
0000000000000000 DR1: 
0000000000000000 DR2: 
0000000000000000
[  594.386474] DR3: 
0000000000000000 DR6: 
00000000fffe0ff0 DR7: 
0000000000000400
[  594.387403] Call Trace:
[  594.387738]  <TASK>
[  594.388042]  find_and_remove_object+0x118/0x160
[  594.389321]  delete_object_full+0xc/0x20
[  594.389852]  kfree+0x193/0x470
[  594.390275]  __io_remove_buffers.part.0+0xed/0x147
[  594.390931]  io_ring_ctx_free+0x342/0x6a2
[  594.392159]  io_ring_exit_work+0x41e/0x486
[  594.396419]  process_one_work+0x906/0x15a0
[  594.399185]  worker_thread+0x8b/0xd80
[  594.400259]  kthread+0x3bf/0x4a0
[  594.401847]  ret_from_fork+0x22/0x30
[  594.402343]  </TASK>
Message from syslogd@localhost at Nov 13 09:09:54 ...
kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
[  596.793660] __io_remove_buffers: [
2099199]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
We can reproduce this issue by follow syzkaller log:
r0 = syz_io_uring_setup(0x401, &(0x7f0000000300), &(0x7f0000003000/0x2000)=nil, &(0x7f0000ff8000/0x4000)=nil, &(0x7f0000000280)=<r1=>0x0, &(0x7f0000000380)=<r2=>0x0)
sendmsg$ETHTOOL_MSG_FEATURES_SET(0xffffffffffffffff, &(0x7f0000003080)={0x0, 0x0, &(0x7f0000003040)={&(0x7f0000000040)=ANY=[], 0x18}}, 0x0)
syz_io_uring_submit(r1, r2, &(0x7f0000000240)=@IORING_OP_PROVIDE_BUFFERS={0x1f, 0x5, 0x0, 0x401, 0x1, 0x0, 0x100, 0x0, 0x1, {0xfffd}}, 0x0)
io_uring_enter(r0, 0x3a2d, 0x0, 0x0, 0x0, 0x0)
The reason above issue  is 'buf->list' has 2,100,000 nodes, occupied cpu lead
to soft lockup.
To solve this issue, we need add schedule point when do while loop in
'__io_remove_buffers'.
After add  schedule point we do regression, get follow data.
[  240.141864] __io_remove_buffers: [1]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
[  268.408260] __io_remove_buffers: [1]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
[  275.899234] __io_remove_buffers: [
2099199]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
[  296.741404] __io_remove_buffers: [1]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
[  305.090059] __io_remove_buffers: [
2099199]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
[  325.415746] __io_remove_buffers: [1]start ctx=0xffff8881b92d1000 bgid=65533 buf=0xffff8881a17d8f00
[  333.160318] __io_remove_buffers: [
2099199]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
...
Fixes:
8bab4c09f24e("io_uring: allow conditional reschedule for intensive iterators")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211122024737.2198530-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Steven Rostedt (VMware) [Fri, 26 Nov 2021 22:34:42 +0000 (17:34 -0500)]
 
tracing: Fix pid filtering when triggers are attached
If a event is filtered by pid and a trigger that requires processing of
the event to happen is a attached to the event, the discard portion does
not take the pid filtering into account, and the event will then be
recorded when it should not have been.
Cc: stable@vger.kernel.org
Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Alex Williamson [Fri, 26 Nov 2021 13:55:56 +0000 (21:55 +0800)]
 
iommu/vt-d: Fix unmap_pages support
When supporting only the .map and .unmap callbacks of iommu_ops,
the IOMMU driver can make assumptions about the size and alignment
used for mappings based on the driver provided pgsize_bitmap.  VT-d
previously used essentially PAGE_MASK for this bitmap as any power
of two mapping was acceptably filled by native page sizes.
However, with the .map_pages and .unmap_pages interface we're now
getting page-size and count arguments.  If we simply combine these
as (page-size * count) and make use of the previous map/unmap
functions internally, any size and alignment assumptions are very
different.
As an example, a given vfio device assignment VM will often create
a 4MB mapping at IOVA pfn [0x3fe00 - 0x401ff].  On a system that
does not support IOMMU super pages, the unmap_pages interface will
ask to unmap 1024 4KB pages at the base IOVA.  dma_pte_clear_level()
will recurse down to level 2 of the page table where the first half
of the pfn range exactly matches the entire pte level.  We clear the
pte, increment the pfn by the level size, but (oops) the next pte is
on a new page, so we exit the loop an pop back up a level.  When we
then update the pfn based on that higher level, we seem to assume
that the previous pfn value was at the start of the level.  In this
case the level size is 256K pfns, which we add to the base pfn and
get a results of 0x7fe00, which is clearly greater than 0x401ff,
so we're done.  Meanwhile we never cleared the ptes for the remainder
of the range.  When the VM remaps this range, we're overwriting valid
ptes and the VT-d driver complains loudly, as reported by the user
report linked below.
The fix for this seems relatively simple, if each iteration of the
loop in dma_pte_clear_level() is assumed to clear to the end of the
level pte page, then our next pfn should be calculated from level_pfn
rather than our working pfn.
Fixes: 3f34f1259776 ("iommu/vt-d: Implement map/unmap_pages() iommu_ops callback")
Reported-by: Ajay Garg <ajaygargnsit@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Link: https://lore.kernel.org/all/20211002124012.18186-1-ajaygargnsit@gmail.com/
Link: https://lore.kernel.org/r/163659074748.1617923.12716161410774184024.stgit@omen
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211126135556.397932-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>