Yu Kuai [Wed, 21 Jun 2023 16:51:07 +0000 (00:51 +0800)]
 
md/raid10: switch to use md_account_bio() for io accounting
Make sure that 'active_io' will represent inflight io instead of io that
is dispatching, and io accounting from all levels will be consistent.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-6-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:06 +0000 (00:51 +0800)]
 
md/raid1: switch to use md_account_bio() for io accounting
Two problems can be fixed this way:
1) 'active_io' will represent inflight io instead of io that is
dispatching.
2) If io accounting is enabled or disabled while io is still inflight,
bio_start_io_acct() and bio_end_io_acct() is not balanced and io
inflight counter will be leaked.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-5-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:05 +0000 (00:51 +0800)]
 
raid5: fix missing io accounting in raid5_align_endio()
Io will only be accounted as done from raid5_align_endio() if the io
succeeded, and io inflight counter will be leaked if such io failed.
Fix this problem by switching to use md_account_bio() for io accounting.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-4-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:04 +0000 (00:51 +0800)]
 
md: also clone new io if io accounting is disabled
Currently, 'active_io' is grabbed before make_reqeust() is called, and
it's dropped immediately make_reqeust() returns. Hence 'active_io'
actually means io is dispatching, not io is inflight.
For raid0 and raid456 that io accounting is enabled, 'active_io' will
also be grabbed when bio is cloned for io accounting, and this 'active_io'
is dropped until io is done.
Always clone new bio so that 'active_io' will mean that io is inflight,
raid1 and raid10 will switch to use this method in later patches.
Now that bio will be cloned even if io accounting is disabled, also
rename related structure from '*_acct_*' to '*_clone_*'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-3-yukuai1@huaweicloud.com
Yu Kuai [Wed, 21 Jun 2023 16:51:03 +0000 (00:51 +0800)]
 
md: move initialization and destruction of 'io_acct_set' to md.c
'io_acct_set' is only used for raid0 and raid456, prepare to use it for
raid1 and raid10, so that io accounting from different levels can be
consistent.
By the way, follow up patches will also use this io clone mechanism to
make sure 'active_io' represents in flight io, not io that is dispatching,
so that mddev_suspend will wait for io to be done as designed.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com
Christoph Hellwig [Thu, 15 Jun 2023 06:48:40 +0000 (08:48 +0200)]
 
md: deprecate bitmap file support
The support for bitmaps on files is a very bad idea abusing various kernel
APIs, and fundamentally requires the file to not be on the actual array
without a way to check that this is actually the case.  Add a deprecation
warning to see if we might be able to eventually drop it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-12-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:39 +0000 (08:48 +0200)]
 
md: make bitmap file support optional
The support for write intent bitmaps in files on an external files in md
is a hot mess that abuses ->bmap to map file offsets into physical device
objects, and also abuses buffer_heads in a creative way.
Make this code optional so that MD can be built into future kernels
without buffer_head support, and so that we can eventually deprecate it.
Note this does not affect the internal bitmap support, which has none of
the problems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-11-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:38 +0000 (08:48 +0200)]
 
md-bitmap: don't use ->index for pages backing the bitmap file
The md driver allocates pages for storing the bitmap file data, which
are not page cache pages, and then stores the page granularity file
offset in page->index, which is a field that isn't really valid except
for page cache pages.
Use a separate index for the superblock, and use the scheme used at
read size to recalculate the index for the bitmap pages instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-10-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:37 +0000 (08:48 +0200)]
 
md-bitmap: account for mddev->bitmap_info.offset in read_sb_page
Diretly apply mddev->bitmap_info.offset to the sector number to read
instead of doing that in both callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-9-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:36 +0000 (08:48 +0200)]
 
md-bitmap: cleanup read_sb_page
Convert read_sb_page to the normal kernel coding style, calculate the
target sector only once, and add a local iosize variable to make the call
to sync_page_io more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-8-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:35 +0000 (08:48 +0200)]
 
md-bitmap: refactor md_bitmap_init_from_disk
Split the confusing loop in md_bitmap_init_from_disk that iterates over
all chunks but also needs to read and map the pages into three separate
loops: one that iterates over the pages to read them, a second optional
one to iterate over the pages to mark them invalid if the bitmaps are
out of date, and a final one that actually iterates over the chunks.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306160552.smw0qbmb-lkp@intel.com/
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-7-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:34 +0000 (08:48 +0200)]
 
md-bitmap: rename read_page to read_file_page
Make the difference to read_sb_page clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-6-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:33 +0000 (08:48 +0200)]
 
md-bitmap: split file writes into a separate helper
Split the file write code out of write_page into a separate helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-5-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:32 +0000 (08:48 +0200)]
 
md-bitmap: use %pD to print the file name in md_bitmap_file_kick
Don't bother allocating an extra buffer in the I/O failure handler and
instead use the printk built-in format to print the last 4 path name
components.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-4-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:31 +0000 (08:48 +0200)]
 
md-bitmap: initialize variables at declaration time in md_bitmap_file_unmap
Just a small tidyup to prepare for bigger changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-3-hch@lst.de
Christoph Hellwig [Thu, 15 Jun 2023 06:48:30 +0000 (08:48 +0200)]
 
md-bitmap: set BITMAP_WRITE_ERROR in write_sb_page
Set BITMAP_WRITE_ERROR directly in write_sb_page instead of propagating
the error to the caller and setting it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-2-hch@lst.de
Yu Kuai [Mon, 29 May 2023 13:20:37 +0000 (21:20 +0800)]
 
md: enhance checking in md_check_recovery()
For md_check_recovery():
1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread.
2) if 'MD_RECOVERY_RUNING' is set:
 a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for
   md_do_sync() to be done.
 b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code
   expects that sync_thread is not NULL, otherwise new sync_thread will
   be registered, which will corrupt the array.
Make sure md_check_recovery() won't register new sync_thread if
'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for
the above corruption,
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-7-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:36 +0000 (21:20 +0800)]
 
md: wake up 'resync_wait' at last in md_reap_sync_thread()
md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...)
from action_store(), just make sure action_store() will still wait for
everything to be done in md_reap_sync_thread().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-6-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:35 +0000 (21:20 +0800)]
 
md: refactor idle/frozen_sync_thread() to fix deadlock
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
  raid10_end_write_request
   set_bit(R10BIO_WriteError, &r10_bio->state)
   one_write_done
    reschedule_retry
  // later from md thread
  raid10d
   handle_write_completed
    list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
  // later from md thread
  raid10d
   if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
    list_move(conf->bio_end_io_list.prev, &tmp)
    r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
    raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
  raid10_sync_request
   raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
  action_store
   mddev_lock
    md_unregister_thread
     kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
  raid10d
   md_check_recovery
    if (mddev_trylock(mddev))
     md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/
5ed54ffc-ce82-bf66-4eff-
390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/
e9067438-d713-f5f3-0d3d-
9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:34 +0000 (21:20 +0800)]
 
md: add a mutex to synchronize idle and frozen in action_store()
Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.
Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:33 +0000 (21:20 +0800)]
 
md: refactor action_store() for 'idle' and 'frozen'
Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
are no functional changes except that MD_RECOVERY_RUNNING is checked
again after 'reconfig_mutex' is held.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-3-yukuai1@huaweicloud.com
Yu Kuai [Mon, 29 May 2023 13:20:32 +0000 (21:20 +0800)]
 
Revert "md: unlock mddev before reap sync_thread in action_store"
This reverts commit 
9dfbdafda3b34e262e43e786077bab8e476a89d1.
Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:
list_add corruption. prev->next should be next (
ffff0001ac1daba0), but was 
ffff0000ce1a02a0. (prev=
ffff0000ce1a02a0).
Call trace:
 __list_add_valid+0xfc/0x140
 insert_work+0x78/0x1a0
 __queue_work+0x500/0xcf4
 queue_work_on+0xe8/0x12c
 md_check_recovery+0xa34/0xf30
 raid10d+0xb8/0x900 [raid10]
 md_thread+0x16c/0x2cc
 kthread+0x1a4/0x1ec
 ret_from_fork+0x10/0x18
This is because work is requeued while it's still inside workqueue:
t1:			t2:
action_store
 mddev_lock
  if (mddev->sync_thread)
   mddev_unlock
   md_unregister_thread
   // first sync_thread is done
			md_check_recovery
			 mddev_try_lock
			 /*
			  * once MD_RECOVERY_DONE is set, new sync_thread
			  * can start.
			  */
			 set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
			 INIT_WORK(&mddev->del_work, md_start_sync)
			 queue_work(md_misc_wq, &mddev->del_work)
			  test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
			  // set pending bit
			  insert_work
			   list_add_tail
			 mddev_unlock
   mddev_lock_nointr
   md_reap_sync_thread
   // MD_RECOVERY_RUNNING is cleared
 mddev_unlock
t3:
// before queued work started from t2
md_check_recovery
 // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
 INIT_WORK(&mddev->del_work, md_start_sync)
  work->data = 0
  // work pending bit is cleared
 queue_work(md_misc_wq, &mddev->del_work)
  insert_work
   list_add_tail
   // list is corrupted
The above commit is reverted to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com
Jinyoung Choi [Tue, 25 Jul 2023 05:18:39 +0000 (14:18 +0900)]
 
block: cleanup bio_integrity_prep
If a problem occurs in the process of creating an integrity payload, the
status of bio is always BLK_STS_RESOURCE.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230725051839epcms2p8e4d20ad6c51326ad032e8406f59d0aaa@epcms2p8
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 21 Jul 2023 17:27:30 +0000 (10:27 -0700)]
 
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
  blk_mq_get_tag()      /* may sleep, hence the functions it calls may also sleep */
  blk_execute_rq()             /* may sleep */
  blk_mq_run_hw_queues(q, async=false)
    blk_freeze_queue_start()   /* may sleep */
    blk_mq_requeue_work()      /* may sleep */
    scsi_kick_queue()
      scsi_requeue_run_queue() /* may sleep */
      scsi_run_host_queues()
        scsi_ioctl_reset()     /* may sleep */
  blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
    blk_mq_dispatch_plug_list(plug, from_sched=false)
      blk_mq_flush_plug_list(plug, from_schedule=false)
        __blk_flush_plug(plug, from_schedule=false)
	blk_add_rq_to_plug()
	  blk_mq_submit_bio()  /* may sleep if REQ_NOWAIT has not been set */
  blk_mq_plug_issue_direct()
    blk_mq_flush_plug_list()   /* see above */
  blk_mq_dispatch_plug_list(plug, from_sched=false)
    blk_mq_flush_plug_list()   /* see above */
  blk_mq_try_issue_directly()
    blk_mq_submit_bio()        /* may sleep if REQ_NOWAIT has not been set */
  blk_mq_try_issue_list_directly(hctx, list)
    blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 21 Jul 2023 17:27:29 +0000 (10:27 -0700)]
 
scsi: Remove a blk_mq_run_hw_queues() call
blk_mq_kick_requeue_list() calls blk_mq_run_hw_queues() asynchronously.
Leave out the direct blk_mq_run_hw_queues() call. This patch causes
scsi_run_queue() to call blk_mq_run_hw_queues() asynchronously instead
of synchronously. Since scsi_run_queue() is not called from the hot I/O
submission path, this patch does not affect the hot path.
This patch prepares for allowing blk_mq_run_hw_queue() to sleep if
BLK_MQ_F_BLOCKING has been set. scsi_run_queue() may be called from
atomic context and must not sleep. Hence the removal of the
blk_mq_run_hw_queues(q, false) call. See also scsi_unblock_requests().
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230721172731.955724-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 21 Jul 2023 17:27:28 +0000 (10:27 -0700)]
 
scsi: Inline scsi_kick_queue()
Inline scsi_kick_queue() to prepare for modifying the second argument
passed to blk_mq_run_hw_queues().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230721172731.955724-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:33 +0000 (09:54 -0700)]
 
block: don't pass a bio to bio_try_merge_hw_seg
There is no good reason to pass the bio to bio_try_merge_hw_seg.  Just
pass the current bvec and rename the function to bvec_try_merge_hw_page.
This will allow reusing this function for supporting multi-page integrity
payload bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:32 +0000 (09:54 -0700)]
 
block: move the bi_size update out of __bio_try_merge_page
The update of bi_size is the only thing in __bio_try_merge_page that
needs a bio.  Move it to the callers, and merge __bio_try_merge_page
and page_is_mergeable into a single bvec_try_merge_page that only takes
the current bvec instead of a full bio.  This will allow reusing this
function for supporting multi-page integrity payload bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:31 +0000 (09:54 -0700)]
 
block: downgrade a bio_full call in bio_add_page
bio_add_page already checks that there is space in bi_size a little
earlier.  So after we failed to add to an existing segment, just check
that there is another one available instead of duplicating the bi_size
check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:30 +0000 (09:54 -0700)]
 
block: move the bi_size overflow check in __bio_try_merge_page
Checking for availability in bi_size in a function that attempts to
merge into an existing segment is a bit odd, as the limit also applies
when adding a new segment.  This code works fine as we always call
__bio_try_merge_page, but contributes to sub-optimal calling conventions
and doesn't lead to clear code.
Move it to two of the callers instead, the third one already has a more
strict check that includes max_hw_segments anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:29 +0000 (09:54 -0700)]
 
block: move the bi_vcnt check out of __bio_try_merge_page
Move the bi_vcnt out of __bio_try_merge_page and into the two callers
that don't already have it in preparation for additional changes to
__bio_try_merge_page.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:28 +0000 (09:54 -0700)]
 
block: move the BIO_CLONED checks out of __bio_try_merge_page
__bio_try_merge_page is a way too low-level helper to assert that the
bio is not cloned.  Move the check into bio_add_page and
bio_iov_iter_get_pages instead, which are the high level entry points
that should enforce this variant.  bio_add_hw_page already this
check, coverig the third (indirect) caller of __bio_try_merge_page.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:27 +0000 (09:54 -0700)]
 
block: use SECTOR_SHIFT bio_add_hw_page
Use the SECTOR_SHIFT magic constant instead of the magic number.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jul 2023 16:54:26 +0000 (09:54 -0700)]
 
block: tidy up the bio full checks in bio_add_hw_page
bio_add_hw_page already checks if the number of bytes trying to be added
even fit into max_hw_sectors limit of the queue.   Remove the call to
bio_full and just do a check for the smaller of the number of segments
in the bio and the queue max segments limit, and do this cheap check
before the more expensive gap to previous check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nitesh Shetty [Wed, 19 Jul 2023 12:16:08 +0000 (17:46 +0530)]
 
block: refactor to use helper
Reduce some code by making use of bio_integrity_bytes().
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230719121608.32105-1-nj.shetty@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Mon, 17 Jul 2023 04:00:58 +0000 (12:00 +0800)]
 
blk-flush: reuse rq queuelist in flush state machine
Since we don't need to maintain inflight flush_data requests list
anymore, we can reuse rq->queuelist for flush pending list.
Note in mq_flush_data_end_io(), we need to re-initialize rq->queuelist
before reusing it in the state machine when end, since the rq->rq_next
also reuse it, may have corrupted rq->queuelist by the driver.
This patch decrease the size of struct request by 16 bytes.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230717040058.3993930-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Mon, 17 Jul 2023 04:00:57 +0000 (12:00 +0800)]
 
blk-flush: count inflight flush_data requests
The flush state machine use a double list to link all inflight
flush_data requests, to avoid issuing separate post-flushes for
these flush_data requests which shared PREFLUSH.
So we can't reuse rq->queuelist, this is why we need rq->flush.list
In preparation of the next patch that reuse rq->queuelist for flush
state machine, we change the double linked list to unsigned long
counter, which count all inflight flush_data requests.
This is ok since we only need to know if there is any inflight
flush_data request, so unsigned long counter is good.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-4-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Mon, 17 Jul 2023 04:00:56 +0000 (12:00 +0800)]
 
blk-flush: fix rq->flush.seq for post-flush requests
If the policy == (REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH), it means that the
data sequence and post-flush sequence need to be done for this request.
The rq->flush.seq should record what sequences have been done (or don't
need to be done). So in this case, pre-flush doesn't need to be done,
we should init rq->flush.seq to REQ_FSEQ_PREFLUSH not REQ_FSEQ_POSTFLUSH.
Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Mon, 17 Jul 2023 04:00:55 +0000 (12:00 +0800)]
 
blk-mq: use percpu csd to remote complete instead of per-rq csd
If request need to be completed remotely, we insert it into percpu llist,
and smp_call_function_single_async() if llist is empty previously.
We don't need to use per-rq csd, percpu csd is enough. And the size of
struct request is decreased by 24 bytes.
This way is cleaner, and looks correct, given block softirq is guaranteed
to be scheduled to consume the list if one new request is added to this
percpu list, either smp_call_function_single_async() returns -EBUSY or 0.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 7 Jul 2023 09:42:39 +0000 (11:42 +0200)]
 
block: don't allow enabling a cache on devices that don't support it
Currently the write_cache attribute allows enabling the QUEUE_FLAG_WC
flag on devices that never claimed the capability.
Fix that by adding a QUEUE_FLAG_HW_WC flag that is set by
blk_queue_write_cache and guards re-enabling the cache through sysfs.
Note that any rescan that calls blk_queue_write_cache will still
re-enable the write cache as in the current code.
Fixes: 93e9d8e836cb ("block: add ability to flag write back caching on a device")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230707094239.107968-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 7 Jul 2023 09:42:38 +0000 (11:42 +0200)]
 
block: cleanup queue_wc_store
Get rid of the local queue_wc_store variable and handling setting and
clearing the QUEUE_FLAG_WC flag diretly instead the if / else if.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230707094239.107968-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Thomas Weißschuh [Thu, 13 Jul 2023 19:29:35 +0000 (21:29 +0200)]
 
nbd: automatically load module on genl access
Add a module alias to nbd.ko that allows the generic netlink core to
automatically load the module when netlink messages for nbd are
received.
This frees the user from manually having to load the module before using
nbd functionality via netlink.
If the system policy allows it this can even be used to load the nbd
module from containers which would otherwise not have access to the
necessary module files to do a normal "modprobe nbd".
For example this avoids the following error when using nbd-client:
$ nbd-client localhost 10809 /dev/nbd0
...
Error: Couldn't resolve the nbd netlink family, make sure the nbd module is loaded and your nbd driver supports the netlink interface.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Josef Bacik <josef@toxicpadna.com>
Link: https://lore.kernel.org/r/20230713-b4-nbd-genl-v3-1-226cbddba04b@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Azeem Shaikh [Mon, 3 Jul 2023 17:21:59 +0000 (17:21 +0000)]
 
blk-wbt: Replace strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230703172159.3668349-3-azeemshaikh38@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Azeem Shaikh [Mon, 3 Jul 2023 17:21:58 +0000 (17:21 +0000)]
 
kyber: Replace strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230703172159.3668349-2-azeemshaikh38@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 16 Jul 2023 22:10:37 +0000 (15:10 -0700)]
 
Linux 6.5-rc2
Linus Torvalds [Sun, 16 Jul 2023 21:12:49 +0000 (14:12 -0700)]
 
Merge tag 'xtensa-
20230716' of https://github.com/jcmvbkbc/linux-xtensa
Pull xtensa fixes from Max Filippov:
 - fix interaction between unaligned exception handler and load/store
   exception handler
 - fix parsing ISS network interface specification string
 - add comment about etherdev freeing to ISS network driver
* tag 'xtensa-
20230716' of https://github.com/jcmvbkbc/linux-xtensa:
  xtensa: fix unaligned and load/store configuration interaction
  xtensa: ISS: fix call to split_if_spec
  xtensa: ISS: add comment about etherdev freeing
Linus Torvalds [Sun, 16 Jul 2023 20:46:08 +0000 (13:46 -0700)]
 
Merge tag 'perf_urgent_for_v6.5_rc2' of git://git./linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
 - Fix a lockdep warning when the event given is the first one, no event
   group exists yet but the code still goes and iterates over event
   siblings
* tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
Linus Torvalds [Sun, 16 Jul 2023 20:34:29 +0000 (13:34 -0700)]
 
Merge tag 'objtool_urgent_for_v6.5_rc2' of git://git./linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:
 - Mark copy_iovec_from_user() __noclone in order to prevent gcc from
   doing an inter-procedural optimization and confuse objtool
 - Initialize struct elf fully to avoid build failures
* tag 'objtool_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  iov_iter: Mark copy_iovec_from_user() noclone
  objtool: initialize all of struct elf
Linus Torvalds [Sun, 16 Jul 2023 20:22:08 +0000 (13:22 -0700)]
 
Merge tag 'sched_urgent_for_v6.5_rc2' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
 - Remove a cgroup from under a polling process properly
 - Fix the idle sibling selection
* tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: use kernfs polling functions for PSI trigger polling
  sched/fair: Use recent_used_cpu to test p->cpus_ptr
Linus Torvalds [Sun, 16 Jul 2023 19:55:31 +0000 (12:55 -0700)]
 
Merge tag 'pinctrl-v6.5-2' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
 "I'm mostly on vacation but what would vacation be without a few
  critical fixes so people can use their gaming laptops when hiding away
  from the sun (or rain)?
   - Fix a really annoying interrupt storm in the AMD driver affecting
     Asus TUF gaming notebooks
   - Fix device tree parsing in the Renesas driver"
* tag 'pinctrl-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: amd: Unify debounce handling into amd_pinconf_set()
  pinctrl: amd: Drop pull up select configuration
  pinctrl: amd: Use amd_pinconf_set() for all config options
  pinctrl: amd: Only use special debounce behavior for GPIO 0
  pinctrl: renesas: rzg2l: Handle non-unique subnode names
  pinctrl: renesas: rzv2m: Handle non-unique subnode names
Linus Torvalds [Sun, 16 Jul 2023 19:49:05 +0000 (12:49 -0700)]
 
Merge tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
 - Two reconnect fixes: important fix to address inFlight count to leak
   (which can leak credits), and fix for better handling a deleted share
 - DFS fix
 - SMB1 cleanup fix
 - deferred close fix
* tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix mid leak during reconnection after timeout threshold
  cifs: is_network_name_deleted should return a bool
  smb: client: fix missed ses refcounting
  smb: client: Fix -Wstringop-overflow issues
  cifs: if deferred close is disabled then close files immediately
Linus Torvalds [Sun, 16 Jul 2023 19:28:04 +0000 (12:28 -0700)]
 
Merge tag 'powerpc-6.5-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
 - Fix Speculation_Store_Bypass reporting in /proc/self/status on
   Power10
 - Fix HPT with 4K pages since recent changes by implementing pmd_same()
 - Fix 64-bit native_hpte_remove() to be irq-safe
Thanks to Aneesh Kumar K.V, Nageswara R Sastry, and Russell Currey.
* tag 'powerpc-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size
  powerpc/64e: Fix obtool warnings in exceptions-64e.S
  powerpc/security: Fix Speculation_Store_Bypass reporting on Power10
  powerpc/64s: Fix native_hpte_remove() to be irq-safe
Linus Torvalds [Sun, 16 Jul 2023 19:18:18 +0000 (12:18 -0700)]
 
Merge tag 'hardening-v6.5-rc2' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
 - Remove LTO-only suffixes from promoted global function symbols
   (Yonghong Song)
 - Remove unused .text..refcount section from vmlinux.lds.h (Petr Pavlu)
 - Add missing __always_inline to sparc __arch_xchg() (Arnd Bergmann)
 - Claim maintainership of string routines
* tag 'hardening-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  sparc: mark __arch_xchg() as __always_inline
  MAINTAINERS: Foolishly claim maintainership of string routines
  kallsyms: strip LTO-only suffixes from promoted global functions
  vmlinux.lds.h: Remove a reference to no longer used sections .text..refcount
Linus Torvalds [Sun, 16 Jul 2023 19:13:51 +0000 (12:13 -0700)]
 
Merge tag 'probes-fixes-v6.5-rc1-2' of git://git./linux/kernel/git/trace/linux-trace
Pull probe fixes from Masami Hiramatsu:
 - fprobe: Add a comment why fprobe will be skipped if another kprobe is
   running in fprobe_kprobe_handler().
 - probe-events: Fix some issues related to fetch-arguments:
    - Fix double counting of the string length for user-string and
      symstr. This will require longer buffer in the array case.
    - Fix not to count error code (minus value) for the total used
      length in array argument. This makes the total used length
      shorter.
    - Fix to update dynamic used data size counter only if fetcharg uses
      the dynamic size data. This may mis-count the used dynamic data
      size and corrupt data.
    - Revert "tracing: Add "(fault)" name injection to kernel probes"
      because that did not work correctly with a bug, and we agreed the
      current '(fault)' output (instead of '"(fault)"' like a string)
      explains what happened more clearly.
    - Fix to record 0-length (means fault access) data_loc data in fetch
      function itself, instead of store_trace_args(). If we record an
      array of string, this will fix to save fault access data on each
      entry of the array correctly.
* tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
  Revert "tracing: Add "(fault)" name injection to kernel probes"
  tracing/probes: Fix to update dynamic data counter if fetcharg uses it
  tracing/probes: Fix not to count error code to total length
  tracing/probes: Fix to avoid double count of the string length on the array
  fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
Linus Torvalds [Sat, 15 Jul 2023 15:51:02 +0000 (08:51 -0700)]
 
Merge tag 'spi-fix-v6.5-rc1' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
 "A couple of fairly minor driver specific fixes here, plus a bunch of
  maintainership and admin updates. Nothing too remarkable"
* tag 'spi-fix-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  mailmap: add entry for Jonas Gorski
  MAINTAINERS: add myself for spi-bcm63xx
  spi: s3c64xx: clear loopback bit after loopback test
  spi: bcm63xx: fix max prepend length
  MAINTAINERS: Add myself as a maintainer for Microchip SPI
Linus Torvalds [Sat, 15 Jul 2023 15:46:09 +0000 (08:46 -0700)]
 
Merge tag 'regmap-fix-v6.5-rc1' of git://git./linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown:
 "One fix for an out of bounds access in the interupt code here"
* tag 'regmap-fix-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap-irq: Fix out-of-bounds access when allocating config buffers
Linus Torvalds [Sat, 15 Jul 2023 15:40:00 +0000 (08:40 -0700)]
 
Merge tag 'iommu-fixes-v6.5-rc1' of git://git./linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
 - Fix a regression causing a crash on sysfs access of iommu-group
   specific files
 - Fix signedness bug in SVA code
* tag 'iommu-fixes-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/sva: Fix signedness bug in iommu_sva_alloc_pasid()
  iommu: Fix crash during syfs iommu_groups/N/type
Linus Torvalds [Sat, 15 Jul 2023 03:19:25 +0000 (20:19 -0700)]
 
Merge tag 'x86_urgent_for_6.5_rc2' of git://git./linux/kernel/git/tip/tip
Pull x86 CFI fixes from Peter Zijlstra:
 "Fix kCFI/FineIBT weaknesses
  The primary bug Alyssa noticed was that with FineIBT enabled function
  prologues have a spurious ENDBR instruction:
    __cfi_foo:
	endbr64
	subl	$hash, %r10d
	jz	1f
	ud2
	nop
    1:
    foo:
	endbr64 <--- *sadface*
  This means that any indirect call that fails to target the __cfi
  symbol and instead targets (the regular old) foo+0, will succeed due
  to that second ENDBR.
  Fixing this led to the discovery of a single indirect call that was
  still doing this: ret_from_fork(). Since that's an assembly stub the
  compiler would not generate the proper kCFI indirect call magic and it
  would not get patched.
  Brian came up with the most comprehensive fix -- convert the thing to
  C with only a very thin asm wrapper. This ensures the kernel thread
  boostrap is a proper kCFI call.
  While discussing all this, Kees noted that kCFI hashes could/should be
  poisoned to seal all functions whose address is never taken, further
  limiting the valid kCFI targets -- much like we already do for IBT.
  So what was a 'simple' observation and fix cascaded into a bunch of
  inter-related CFI infrastructure fixes"
* tag 'x86_urgent_for_6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=y
  x86/fineibt: Poison ENDBR at +0
  x86: Rewrite ret_from_fork() in C
  x86/32: Remove schedule_tail_wrapper()
  x86/cfi: Extend ENDBR sealing to kCFI
  x86/alternative: Rename apply_ibt_endbr()
  x86/cfi: Extend {JMP,CAKK}_NOSPEC comment
Linus Torvalds [Sat, 15 Jul 2023 02:57:29 +0000 (19:57 -0700)]
 
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "This is a bunch of small driver fixes and a larger rework of zone disk
  handling (which reaches into blk and nvme).
  The aacraid array-bounds fix is now critical since the security people
  turned on -Werror for some build tests, which now fail without it"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: storvsc: Handle SRB status value 0x30
  scsi: block: Improve checks in blk_revalidate_disk_zones()
  scsi: block: virtio_blk: Set zone limits before revalidating zones
  scsi: block: nullblk: Set zone limits before revalidating zones
  scsi: nvme: zns: Set zone limits before revalidating zones
  scsi: sd_zbc: Set zone limits before revalidating zones
  scsi: ufs: core: Add support for qTimestamp attribute
  scsi: aacraid: Avoid -Warray-bounds warning
  scsi: ufs: ufs-mediatek: Add dependency for RESET_CONTROLLER
  scsi: ufs: core: Update contact email for monitor sysfs nodes
  scsi: scsi_debug: Remove dead code
  scsi: qla2xxx: Use vmalloc_array() and vcalloc()
  scsi: fnic: Use vmalloc_array() and vcalloc()
  scsi: qla2xxx: Fix error code in qla2x00_start_sp()
  scsi: qla2xxx: Silence a static checker warning
  scsi: lpfc: Fix a possible data race in lpfc_unregister_fcf_rescan()
Linus Torvalds [Sat, 15 Jul 2023 02:52:18 +0000 (19:52 -0700)]
 
Merge tag 'block-6.5-2023-07-14' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
 - NVMe pull request via Keith:
      - Don't require quirk to use duplicate namespace identifiers
        (Christoph, Sagi)
      - One more BOGUS_NID quirk (Pankaj)
      - IO timeout and error hanlding fixes for PCI (Keith)
      - Enhanced metadata format mask fix (Ankit)
      - Association race condition fix for fibre channel (Michael)
      - Correct debugfs error checks (Minjie)
      - Use PAGE_SECTORS_SHIFT where needed (Damien)
      - Reduce kernel logs for legacy nguid attribute (Keith)
      - Use correct dma direction when unmapping metadata (Ming)
 - Fix for a flush handling regression in this release (Christoph)
 - Fix for batched request time stamping (Chengming)
 - Fix for a regression in the mq-deadline position calculation (Bart)
 - Lockdep fix for blk-crypto (Eric)
 - Fix for a regression in the Amiga partition handling changes
   (Michael)
* tag 'block-6.5-2023-07-14' of git://git.kernel.dk/linux:
  block: queue data commands from the flush state machine at the head
  blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq
  nvme-pci: fix DMA direction of unmapping integrity data
  nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
  block/mq-deadline: Fix a bug in deadline_from_pos()
  nvme: ensure disabling pairs with unquiesce
  nvme-fc: fix race between error recovery and creating association
  nvme-fc: return non-zero status code when fails to create association
  nvme: fix parameter check in nvme_fault_inject_init()
  nvme: warn only once for legacy uuid attribute
  block: remove dead struc request->completion_data field
  nvme: fix the NVME_ID_NS_NVM_STS_MASK definition
  nvmet: use PAGE_SECTORS_SHIFT
  nvme: add BOGUS_NID quirk for Samsung SM953
  blk-crypto: use dynamic lock class for blk_crypto_profile::lock
  block/partition: fix signedness issue for Amiga partitions
Linus Torvalds [Sat, 15 Jul 2023 02:46:54 +0000 (19:46 -0700)]
 
Merge tag 'io_uring-6.5-2023-07-14' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
 "Just a single tweak for the wait logic in io_uring"
* tag 'io_uring-6.5-2023-07-14' of git://git.kernel.dk/linux:
  io_uring: Use io_schedule* in cqring wait
Linus Torvalds [Fri, 14 Jul 2023 18:14:07 +0000 (11:14 -0700)]
 
Merge tag 'riscv-for-linus-6.5-rc2' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
 - fix a formatting error in the hwprobe documentation
 - fix a spurious warning in the RISC-V PMU driver
 - fix memory detection on rv32 (problem does not manifest on any known
   system)
 - avoid parsing legacy parsing of I in ACPI ISA strings
* tag 'riscv-for-linus-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  RISC-V: Don't include Zicsr or Zifencei in I from ACPI
  riscv: mm: fix truncation warning on RV32
  perf: RISC-V: Remove PERF_HES_STOPPED flag checking in riscv_pmu_start()
  Documentation: RISC-V: hwprobe: Fix a formatting error
Linus Torvalds [Fri, 14 Jul 2023 18:07:04 +0000 (11:07 -0700)]
 
Merge tag 'pm-6.5-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix hibernation (after recent changes), frequency QoS and the
  sparc cpufreq driver.
  Specifics:
   - Unbreak the /sys/power/resume interface after recent changes (Azat
     Khuzhin).
   - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
     Yang).
   - Remove __init from cpufreq callbacks in the sparc driver, because
     they may be called after initialization too (Viresh Kumar)"
* tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: sparc: Don't mark cpufreq callbacks with __init
  PM: QoS: Restore support for default value on frequency QoS
  PM: hibernate: Fix writing maj:min to /sys/power/resume
Rafael J. Wysocki [Fri, 14 Jul 2023 17:13:21 +0000 (19:13 +0200)]
 
Merge branches 'pm-sleep' and 'pm-qos'
Merge a PM QoS fix and a hibernation fix for 6.5-rc2.
 - Unbreak the /sys/power/resume interface after recent changes (Azat
   Khuzhin).
 - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai
   Yang).
* pm-sleep:
  PM: hibernate: Fix writing maj:min to /sys/power/resume
* pm-qos:
  PM: QoS: Restore support for default value on frequency QoS
Shyam Prasad N [Fri, 14 Jul 2023 08:56:33 +0000 (08:56 +0000)]
 
cifs: fix mid leak during reconnection after timeout threshold
When the number of responses with status of STATUS_IO_TIMEOUT
exceeds a specified threshold (NUM_STATUS_IO_TIMEOUT), we reconnect
the connection. But we do not return the mid, or the credits
returned for the mid, or reduce the number of in-flight requests.
This bug could result in the server->in_flight count to go bad,
and also cause a leak in the mids.
This change moves the check to a few lines below where the
response is decrypted, even of the response is read from the
transform header. This way, the code for returning the mids
can be reused.
Also, the cifs_reconnect was reconnecting just the transport
connection before. In case of multi-channel, this may not be
what we want to do after several timeouts. Changed that to
reconnect the session and the tree too.
Also renamed NUM_STATUS_IO_TIMEOUT to a more appropriate name
MAX_STATUS_IO_TIMEOUT.
Fixes: 8e670f77c4a5 ("Handle STATUS_IO_TIMEOUT gracefully")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Shyam Prasad N [Fri, 14 Jul 2023 08:56:34 +0000 (08:56 +0000)]
 
cifs: is_network_name_deleted should return a bool
Currently, is_network_name_deleted and it's implementations
do not return anything if the network name did get deleted.
So the function doesn't fully achieve what it advertizes.
Changed the function to return a bool instead. It will now
return true if the error returned is STATUS_NETWORK_NAME_DELETED
and the share (tree id) was found to be connected. It returns
false otherwise.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Linus Torvalds [Fri, 14 Jul 2023 16:10:28 +0000 (09:10 -0700)]
 
Merge tag 'drm-fixes-2023-07-14-1' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
 "There were a bunch of fixes lined up for 2 weeks, so we have quite a
  few scattered fixes, mostly amdgpu and i915, but ttm has a bunch and
  nouveau makes an appearance.
  So a bit busier than usual for rc2, but nothing seems out of the
  ordinary.
  fbdev:
   - dma: Fix documented default preferred_bpp value
  ttm:
   - fix warning that we shouldn't mix && and ||
   - never consider pinned BOs for eviction&swap
   - Don't leak a resource on eviction error
   - Don't leak a resource on swapout move error
   - fix bulk_move corruption when adding a entry
  client:
   - Send hotplug event after registering a client
  dma-buf:
   - keep the signaling time of merged fences v3
   - fix an error pointer vs NULL bug
  sched:
   - wait for all deps in kill jobs
   - call set fence parent from scheduled
  i915:
   - Don't preserve dpll_hw_state for slave crtc in Bigjoiner
   - Consider OA buffer boundary when zeroing out reports
   - Remove dead code from gen8_pte_encode
   - Fix one wrong caching mode enum usage
  amdgpu:
   - SMU i2c locking fix
   - Fix a possible deadlock in process restoration for ROCm apps
   - Disable PCIe lane/speed switching on Intel platforms (the platforms
     don't support it)
  nouveau:
   - disp: fix HDMI on gt215+
   - disp/g94: enable HDMI
   - acr: Abort loading ACR if no firmware was found
   - bring back blit subchannel for pre nv50 GPUs
   - Fix drm_dp_remove_payload() invocation
  ivpu:
   - Fix VPU register access in irq disable
   - Clear specific interrupt status bits on C0
  bridge:
   - dw_hdmi: fix connector access for scdc
   - ti-sn65dsi86: Fix auxiliary bus lifetime
  panel:
   - simple: Add connector_type for innolux_at043tn24
   - simple: Add Powertip PH800480T013 drm_display_mode flags"
* tag 'drm-fixes-2023-07-14-1' of git://anongit.freedesktop.org/drm/drm: (32 commits)
  drm/nouveau: bring back blit subchannel for pre nv50 GPUs
  drm/nouveau/acr: Abort loading ACR if no firmware was found
  drm/amd: Align SMU11 SMU_MSG_OverridePcieParameters implementation with SMU13
  drm/amd: Move helper for dynamic speed switch check out of smu13
  drm/amd/pm: conditionally disable pcie lane/speed switching for SMU13
  drm/amd/pm: share the code around SMU13 pcie parameters update
  drm/amdgpu: avoid restore process run into dead loop.
  drm/amd/pm: fix smu i2c data read risk
  drm/nouveau/disp/g94: enable HDMI
  drm/nouveau/disp: fix HDMI on gt215+
  drm/client: Send hotplug event after registering a client
  drm/i915: Fix one wrong caching mode enum usage
  drm/i915: Remove dead code from gen8_pte_encode
  drm/i915/perf: Consider OA buffer boundary when zeroing out reports
  drm/i915: Don't preserve dpll_hw_state for slave crtc in Bigjoiner
  drm/ttm: never consider pinned BOs for eviction&swap
  drm/fbdev-dma: Fix documented default preferred_bpp value
  dma-buf: fix an error pointer vs NULL bug
  accel/ivpu: Clear specific interrupt status bits on C0
  accel/ivpu: Fix VPU register access in irq disable
  ...
Linus Torvalds [Fri, 14 Jul 2023 16:05:15 +0000 (09:05 -0700)]
 
Merge tag 'ceph-for-6.5-rc2' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
 "A fix to prevent a potential buffer overrun in the messenger, marked
  for stable"
* tag 'ceph-for-6.5-rc2' of https://github.com/ceph/ceph-client:
  libceph: harden msgr2.1 frame segment length checks
Christoph Hellwig [Fri, 14 Jul 2023 14:30:14 +0000 (16:30 +0200)]
 
block: queue data commands from the flush state machine at the head
We used to insert the data commands following a pre-flush to the head
of the queue until commit 
1e82fadfc6b ("blk-mq: do not do head insertions
post-pre-flush commands").  Not doing this seems to cause hangs of
such commands on NFS workloads when exported from file systems with
SATA SSDs.  I have no idea why this would starve these workloads,
but doing a semantic revert of this patch (which looks quite different
due to various other changes) fixes the hangs.
Fixes: 1e82fadfc6b ("blk-mq: do not do head insertions post-pre-flush commands")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230714143014.11879-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan Carpenter [Thu, 6 Apr 2023 08:55:31 +0000 (11:55 +0300)]
 
iommu/sva: Fix signedness bug in iommu_sva_alloc_pasid()
The ida_alloc_range() function returns negative error codes on error.
On success it returns values in the min to max range (inclusive).  It
never returns more then INT_MAX even if "max" is higher.  It never
returns values in the 0 to (min - 1) range.
The bug is that "min" is an unsigned int so negative error codes will
be promoted to high positive values errors treated as success.
Fixes: 1a14bf0fc7ed ("iommu/sva: Use GFP_KERNEL for pasid allocation")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/6b32095d-7491-4ebb-a850-12e96209eaaf@kili.mountain
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Jason Gunthorpe [Mon, 26 Jun 2023 15:13:11 +0000 (12:13 -0300)]
 
iommu: Fix crash during syfs iommu_groups/N/type
The err_restore_domain flow was accidently inserted into the success path
in commit 
1000dccd5d13 ("iommu: Allow IOMMU_RESV_DIRECT to work on
ARM"). It should only happen if iommu_create_device_direct_mappings()
fails. This caused the domains the be wrongly changed and freed whenever
the sysfs is used, resulting in an oops:
  BUG: kernel NULL pointer dereference, address: 
0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 1 PID: 3417 Comm: avocado Not tainted 6.4.0-rc4-next-
20230602 #3
  Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.3.6 07/06/2021
  RIP: 0010:__iommu_attach_device+0xc/0xa0
  Code: c0 c3 cc cc cc cc 48 89 f0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 8b 47 08 <48> 8b 00 48 85 c0 74 74 48 89 f5 e8 64 12 49 00 41 89 c4 85 c0 74
  RSP: 0018:
ffffabae0220bd48 EFLAGS: 
00010246
  RAX: 
0000000000000000 RBX: 
ffff9ac04f70e410 RCX: 
0000000000000001
  RDX: 
ffff9ac044db20c0 RSI: 
ffff9ac044fa50d0 RDI: 
ffff9ac04f70e410
  RBP: 
ffff9ac044fa50d0 R08: 
1000000100209001 R09: 
00000000000002dc
  R10: 
0000000000000000 R11: 
0000000000000000 R12: 
ffff9ac043d54700
  R13: 
ffff9ac043d54700 R14: 
0000000000000001 R15: 
0000000000000001
  FS:  
00007f02e30ae000(0000) GS:
ffff9afeb2440000(0000) knlGS:
0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
  CR2: 
0000000000000000 CR3: 
000000012afca006 CR4: 
0000000000770ee0
  PKRU: 
55555554
  Call Trace:
   <TASK>
   ? __die+0x24/0x70
   ? page_fault_oops+0x82/0x150
   ? __iommu_queue_command_sync+0x80/0xc0
   ? exc_page_fault+0x69/0x150
   ? asm_exc_page_fault+0x26/0x30
   ? __iommu_attach_device+0xc/0xa0
   ? __iommu_attach_device+0x1c/0xa0
   __iommu_device_set_domain+0x42/0x80
   __iommu_group_set_domain_internal+0x5d/0x160
   iommu_setup_default_domain+0x318/0x400
   iommu_group_store_type+0xb1/0x200
   kernfs_fop_write_iter+0x12f/0x1c0
   vfs_write+0x2a2/0x3b0
   ksys_write+0x63/0xe0
   do_syscall_64+0x3f/0x90
   entry_SYSCALL_64_after_hwframe+0x6e/0xd8
  RIP: 0033:0x7f02e2f14a6f
Reorganize the error flow so that the success branch and error branches
are clearer.
Fixes: 1000dccd5d13 ("iommu: Allow IOMMU_RESV_DIRECT to work on ARM")
Reported-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Tested-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-5bd8cc969d9e+1f1-iommu_set_def_fix_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Masami Hiramatsu (Google) [Tue, 11 Jul 2023 14:16:07 +0000 (23:16 +0900)]
 
tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails
Fix to record 0-length data to data_loc in fetch_store_string*() if it fails
to get the string data.
Currently those expect that the data_loc is updated by store_trace_args() if
it returns the error code. However, that does not work correctly if the
argument is an array of strings. In that case, store_trace_args() only clears
the first entry of the array (which may have no error) and leaves other
entries. So it should be cleared by fetch_store_string*() itself.
Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated
only if it is used (ret > 0 and argument is a dynamic data.)
Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/
Fixes: 40b53b771806 ("tracing: probeevent: Add array type support")
Cc: stable@vger.kernel.org
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Dave Airlie [Fri, 14 Jul 2023 01:44:54 +0000 (11:44 +1000)]
 
Merge tag 'amd-drm-fixes-6.5-2023-07-12' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.5-2023-07-12:
amdgpu:
- SMU i2c locking fix
- Fix a possible deadlock in process restoration for ROCm apps
- Disable PCIe lane/speed switching on Intel platforms (the platforms don't support it)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230712184009.7740-1-alexander.deucher@amd.com
Dave Airlie [Fri, 14 Jul 2023 01:13:14 +0000 (11:13 +1000)]
 
Merge tag 'drm-intel-fixes-2023-07-13' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Don't preserve dpll_hw_state for slave crtc in Bigjoiner (Stanislav Lisovskiy)
- Consider OA buffer boundary when zeroing out reports [perf] (Umesh Nerlige Ramappa)
- Remove dead code from gen8_pte_encode (Tvrtko Ursulin)
- Fix one wrong caching mode enum usage (Tvrtko Ursulin)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ZK+nHLCltaxoxVw/@tursulin-desk
Dave Airlie [Fri, 14 Jul 2023 00:37:18 +0000 (10:37 +1000)]
 
Merge tag 'drm-misc-fixes-2023-07-13' of ssh://git.freedesktop.org/git/drm/drm-misc into drm-fixes
A couple of nouveau patches addressing improving HDMI support and
firmware handling, a fix for TTM to skip pinned BO when evicting, and a
fix for the fbdev documentation.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/nq3ke75juephbex5acfyi5t6bxv22nhmfcpfhru55haj2nv3us@gehrlmjbqgjk
Linus Torvalds [Thu, 13 Jul 2023 21:35:02 +0000 (14:35 -0700)]
 
Merge tag 'erofs-for-6.5-rc2-fixes' of git://git./linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
 "Three patches address regressions related to post-EOF unexpected
  behaviors and fsdax unavailability of chunk-based regular files.
  The other two patches mainly get rid of kmap_atomic() and simplify
  z_erofs_transform_plain().
   - Fix two unexpected loop cases when reading beyond EOF
   - Fix fsdax unavailability for chunk-based regular files
   - Get rid of the remaining kmap_atomic()
   - Minor cleanups"
* tag 'erofs-for-6.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix fsdax unavailability for chunk-based regular files
  erofs: avoid infinite loop in z_erofs_do_read_page() when reading beyond EOF
  erofs: avoid useless loops in z_erofs_pcluster_readmore() when reading beyond EOF
  erofs: simplify z_erofs_transform_plain()
  erofs: get rid of the remaining kmap_atomic()
Linus Torvalds [Thu, 13 Jul 2023 21:21:22 +0000 (14:21 -0700)]
 
Merge tag 'net-6.5-rc2' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter, wireless and ebpf.
  Current release - regressions:
   - netfilter: conntrack: gre: don't set assured flag for clash entries
   - wifi: iwlwifi: remove 'use_tfh' config to fix crash
  Previous releases - regressions:
   - ipv6: fix a potential refcount underflow for idev
   - icmp6: ifix null-ptr-deref of ip6_null_entry->rt6i_idev in
     icmp6_dev()
   - bpf: fix max stack depth check for async callbacks
   - eth: mlx5e:
      - check for NOT_READY flag state after locking
      - fix page_pool page fragment tracking for XDP
   - eth: igc:
      - fix tx hang issue when QBV gate is closed
      - fix corner cases for TSN offload
   - eth: octeontx2-af: Move validation of ptp pointer before its usage
   - eth: ena: fix shift-out-of-bounds in exponential backoff
  Previous releases - always broken:
   - core: prevent skb corruption on frag list segmentation
   - sched:
      - cls_fw: fix improper refcount update leads to use-after-free
      - sch_qfq: account for stab overhead in qfq_enqueue
   - netfilter:
      - report use refcount overflow
      - prevent OOB access in nft_byteorder_eval
   - wifi: mt7921e: fix init command fail with enabled device
   - eth: ocelot: fix oversize frame dropping for preemptible TCs
   - eth: fec: recycle pages for transmitted XDP frames"
* tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  selftests: tc-testing: add test for qfq with stab overhead
  net/sched: sch_qfq: account for stab overhead in qfq_enqueue
  selftests: tc-testing: add tests for qfq mtu sanity check
  net/sched: sch_qfq: reintroduce lmax bound check for MTU
  wifi: cfg80211: fix receiving mesh packets without RFC1042 header
  wifi: rtw89: debug: fix error code in rtw89_debug_priv_send_h2c_set()
  net: txgbe: fix eeprom calculation error
  net/sched: make psched_mtu() RTNL-less safe
  net: ena: fix shift-out-of-bounds in exponential backoff
  netdevsim: fix uninitialized data in nsim_dev_trap_fa_cookie_write()
  net/sched: flower: Ensure both minimum and maximum ports are specified
  MAINTAINERS: Add another mailing list for QUALCOMM ETHQOS ETHERNET DRIVER
  docs: netdev: update the URL of the status page
  wifi: iwlwifi: remove 'use_tfh' config to fix crash
  xdp: use trusted arguments in XDP hints kfuncs
  bpf: cpumap: Fix memory leak in cpu_map_update_elem
  wifi: airo: avoid uninitialized warning in airo_get_rate()
  octeontx2-pf: Add additional check for MCAM rules
  net: dsa: Removed unneeded of_node_put in felix_parse_ports_node
  net: fec: use netdev_err_once() instead of netdev_err()
  ...
Linus Torvalds [Thu, 13 Jul 2023 20:44:28 +0000 (13:44 -0700)]
 
Merge tag 'trace-v6.5-rc1-3' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
 - Fix some missing-prototype warnings
 - Fix user events struct args (did not include size of struct)
   When creating a user event, the "struct" keyword is to denote that
   the size of the field will be passed in. But the parsing failed to
   handle this case.
 - Add selftest to struct sizes for user events
 - Fix sample code for direct trampolines.
   The sample code for direct trampolines attached to handle_mm_fault().
   But the prototype changed and the direct trampoline sample code was
   not updated. Direct trampolines needs to have the arguments correct
   otherwise it can fail or crash the system.
 - Remove unused ftrace_regs_caller_ret() prototype.
 - Quiet false positive of FORTIFY_SOURCE
   Due to backward compatibility, the structure used to save stack
   traces in the kernel had a fixed size of 8. This structure is
   exported to user space via the tracing format file. A change was made
   to allow more than 8 functions to be recorded, and user space now
   uses the size field to know how many functions are actually in the
   stack.
   But the structure still has size of 8 (even though it points into the
   ring buffer that has the required amount allocated to hold a full
   stack.
   This was fine until the fortifier noticed that the
   memcpy(&entry->caller, stack, size) was greater than the 8 functions
   and would complain at runtime about it.
   Hide this by using a pointer to the stack location on the ring buffer
   instead of using the address of the entry structure caller field.
 - Fix a deadloop in reading trace_pipe that was caused by a mismatch
   between ring_buffer_empty() returning false which then asked to read
   the data, but the read code uses rb_num_of_entries() that returned
   zero, and causing a infinite "retry".
 - Fix a warning caused by not using all pages allocated to store ftrace
   functions, where this can happen if the linker inserts a bunch of
   "NULL" entries, causing the accounting of how many pages needed to be
   off.
 - Fix histogram synthetic event crashing when the start event is
   removed and the end event is still using a variable from it
 - Fix memory leak in freeing iter->temp in tracing_release_pipe()
* tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix memory leak of iter->temp when reading trace_pipe
  tracing/histograms: Add histograms to hist_vars if they have referenced variables
  tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
  ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
  ring-buffer: Fix deadloop issue on reading trace_pipe
  tracing: arm64: Avoid missing-prototype warnings
  selftests/user_events: Test struct size match cases
  tracing/user_events: Fix struct arg size match check
  x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
  arm64: ftrace: Add direct call trampoline samples support
  samples: ftrace: Save required argument registers in sample trampolines
Linus Torvalds [Thu, 13 Jul 2023 20:39:36 +0000 (13:39 -0700)]
 
Merge tag 'for-linus-6.5-rc2-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
 - a cleanup of the Xen related ELF-notes
 - a fix for virtio handling in Xen dom0 when running Xen in a VM
* tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parent
  x86/Xen: tidy xen-head.S
Linus Torvalds [Thu, 13 Jul 2023 20:34:00 +0000 (13:34 -0700)]
 
Merge tag 'sh-for-v6.5-tag2' of git://git./linux/kernel/git/glaubitz/sh-linux
Pull sh fixes from John Paul Adrian Glaubitz:
 "The sh updates introduced multiple regressions.
  In particular, the change 
a8ac2961148e ("sh: Avoid using IRQ0 on SH3
  and SH4") causes several boards to hang during boot due to incorrect
  IRQ numbers.
  Geert Uytterhoeven has contributed patches that handle the virq offset
  in the IRQ code for the dreamcast, highlander and r2d boards while
  Artur Rojek has contributed a patch which handles the virq offset for
  the hd64461 companion chip"
* tag 'sh-for-v6.5-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
  sh: hd64461: Handle virq offset for offchip IRQ base and HD64461 IRQ
  sh: mach-dreamcast: Handle virq offset in cascaded IRQ demux
  sh: mach-highlander: Handle virq offset in cascaded IRL demux
  sh: mach-r2d: Handle virq offset in cascaded IRL demux
Jens Axboe [Thu, 13 Jul 2023 20:26:38 +0000 (14:26 -0600)]
 
Merge tag 'nvme-6.5-2023-07-13' of git://git.infradead.org/nvme into block-6.5
Pull NVMe fixes from Keith:
"nvme fixes for Linux 6.5
 - Don't require quirk to use duplicate namespace identifiers
   (Christoph, Sagi)
 - One more BOGUS_NID quirk (Pankaj)
 - IO timeout and error hanlding fixes for PCI (Keith)
 - Enhanced metadata format mask fix (Ankit)
 - Association race condition fix for fibre channel (Michael)
 - Correct debugfs error checks (Minjie)
 - Use PAGE_SECTORS_SHIFT where needed (Damien)
 - Reduce kernel logs for legacy nguid attribute (Keith)
 - Use correct dma direction when unmapping metadata (Ming)"
* tag 'nvme-6.5-2023-07-13' of git://git.infradead.org/nvme:
  nvme-pci: fix DMA direction of unmapping integrity data
  nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
  nvme: ensure disabling pairs with unquiesce
  nvme-fc: fix race between error recovery and creating association
  nvme-fc: return non-zero status code when fails to create association
  nvme: fix parameter check in nvme_fault_inject_init()
  nvme: warn only once for legacy uuid attribute
  nvme: fix the NVME_ID_NS_NVM_STS_MASK definition
  nvmet: use PAGE_SECTORS_SHIFT
  nvme: add BOGUS_NID quirk for Samsung SM953
Chengming Zhou [Mon, 10 Jul 2023 10:55:16 +0000 (18:55 +0800)]
 
blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq
The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation
state of the block device. Most of the time request is allocated after
rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected.
But for plug batched allocation introduced by the commit 
47c122e35d7e
("block: pre-allocate requests if plug is started and is a batch"), we can
rq_qos_throttle() after the allocation of the request. This is what the
blk_mq_get_cached_request() does.
In this case, the cached request alloc_time_ns or start_time_ns is much
ahead if blocked in any qos ->throttle().
Fix it by setting alloc_time_ns and start_time_ns to now when the allocated
request is actually used.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230710105516.2053478-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Arnd Bergmann [Wed, 28 Jun 2023 09:49:18 +0000 (11:49 +0200)]
 
sparc: mark __arch_xchg() as __always_inline
An otherwise correct change to the atomic operations uncovered an
existing bug in the sparc __arch_xchg() function, which is calls
__xchg_called_with_bad_pointer() when its arguments are unknown at
compile time:
ERROR: modpost: "__xchg_called_with_bad_pointer" [lib/atomic64_test.ko] undefined!
This now happens because gcc determines that it's better to not inline the
function. Avoid this by just marking the function as __always_inline
to force the compiler to do the right thing here.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/c525adc9-6623-4660-8718-e0c9311563b8@roeck-us.net/
Fixes: d12157efc8e08 ("locking/atomic: make atomic*_{cmp,}xchg optional")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Andi Shyti <andi.shyti@linux.intel.com>
Link: https://lore.kernel.org/r/20230628094938.2318171-1-arnd@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Wed, 12 Jul 2023 19:46:29 +0000 (12:46 -0700)]
 
MAINTAINERS: Foolishly claim maintainership of string routines
Since the string API is tightly coupled with FORTIFY_SOURCE, I am
offering myself up as maintainer for it. Thankfully Andy is already a
reviewer and can keep me on the straight and narrow.
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lore.kernel.org/r/20230712194625.never.252-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Masami Hiramatsu (Google) [Tue, 11 Jul 2023 14:15:57 +0000 (23:15 +0900)]
 
Revert "tracing: Add "(fault)" name injection to kernel probes"
This reverts commit 
2e9906f84fc7c99388bb7123ade167250d50f1c0.
It was turned out that commit 
2e9906f84fc7 ("tracing: Add "(fault)"
name injection to kernel probes") did not work correctly and probe
events still show just '(fault)' (instead of '"(fault)"'). Also,
current '(fault)' is more explicit that it faulted.
This also moves FAULT_STRING macro to trace.h so that synthetic
event can keep using it, and uses it in trace_probe.c too.
Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/
Link: https://lore.kernel.org/all/20230706230642.3793a593@rorschach.local.home/
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Masami Hiramatsu (Google) [Tue, 11 Jul 2023 14:15:48 +0000 (23:15 +0900)]
 
tracing/probes: Fix to update dynamic data counter if fetcharg uses it
Fix to update dynamic data counter ('dyndata') and max length ('maxlen')
only if the fetcharg uses the dynamic data. Also get out arg->dynamic
from unlikely(). This makes dynamic data address wrong if
process_fetch_insn() returns error on !arg->dynamic case.
Link: https://lore.kernel.org/all/168908494781.123124.8160245359962103684.stgit@devnote2/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/20230710233400.5aaf024e@gandalf.local.home/
Fixes: 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Masami Hiramatsu (Google) [Tue, 11 Jul 2023 14:15:38 +0000 (23:15 +0900)]
 
tracing/probes: Fix not to count error code to total length
Fix not to count the error code (which is minus value) to the total
used length of array, because it can mess up the return code of
process_fetch_insn_bottom(). Also clear the 'ret' value because it
will be used for calculating next data_loc entry.
Link: https://lore.kernel.org/all/168908493827.123124.2175257289106364229.stgit@devnote2/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/
Fixes: 9b960a38835f ("tracing: probeevent: Unify fetch_insn processing common part")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Masami Hiramatsu (Google) [Tue, 11 Jul 2023 14:15:29 +0000 (23:15 +0900)]
 
tracing/probes: Fix to avoid double count of the string length on the array
If an array is specified with the ustring or symstr, the length of the
strings are accumlated on both of 'ret' and 'total', which means the
length is double counted.
Just set the length to the 'ret' value for avoiding double counting.
Link: https://lore.kernel.org/all/168908492917.123124.15076463491122036025.stgit@devnote2/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/
Fixes: 88903c464321 ("tracing/probe: Add ustring type for user-space string")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Masami Hiramatsu (Google) [Fri, 7 Jul 2023 16:38:03 +0000 (01:38 +0900)]
 
fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
Add a comment the reason why fprobe_kprobe_handler() exits if any other
kprobe is running.
Link: https://lore.kernel.org/all/168874788299.159442.2485957441413653858.stgit@devnote2/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/20230706120916.3c6abf15@gandalf.local.home/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Ming Lei [Thu, 13 Jul 2023 09:26:20 +0000 (17:26 +0800)]
 
nvme-pci: fix DMA direction of unmapping integrity data
DMA direction should be taken in dma_unmap_page() for unmapping integrity
data.
Fix this DMA direction, and reported in Guangwu's test.
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Fixes: 4aedb705437f ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Thu, 13 Jul 2023 13:30:42 +0000 (15:30 +0200)]
 
nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
While duplicate IDs are still very harmful, including the potential to easily
see changing devices in /dev/disk/by-id, it turn out they are extremely
common for cheap end user NVMe devices.
Relax our check for them for so that it doesn't reject the probe on
single-ported PCIe devices, but prints a big warning instead.  In doubt
we'd still like to see quirk entries to disable the potential for
changing supposed stable device identifier links, but this will at least
allow users how have two (or more) of these devices to use them without
having to manually add a new PCI ID entry with the quirk through sysfs or
by patching the kernel.
Fixes: 2079f41ec6ff ("nvme: check that EUI/GUID/UUID are globally unique")
Cc: stable@vger.kernel.org # 6.0+
Co-developed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Zheng Yejian [Thu, 13 Jul 2023 14:14:35 +0000 (22:14 +0800)]
 
tracing: Fix memory leak of iter->temp when reading trace_pipe
kmemleak reports:
  unreferenced object 0xffff88814d14e200 (size 256):
    comm "cat", pid 336, jiffies 
4294871818 (age 779.490s)
    hex dump (first 32 bytes):
      04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00  ................
      0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff  .........Z......
    backtrace:
      [<
ffffffff9bdff18f>] __kmalloc+0x4f/0x140
      [<
ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0
      [<
ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0
      [<
ffffffff9bc94490>] print_trace_line+0x3e0/0x950
      [<
ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0
      [<
ffffffff9bf03a43>] vfs_read+0x143/0x520
      [<
ffffffff9bf04c2d>] ksys_read+0xbd/0x160
      [<
ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90
      [<
ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
when reading file 'trace_pipe', 'iter->temp' is allocated or relocated
in trace_find_next_entry() but not freed before 'trace_pipe' is closed.
To fix it, free 'iter->temp' in tracing_release_pipe().
Link: https://lore.kernel.org/linux-trace-kernel/20230713141435.1133021-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Ilya Dryomov [Mon, 10 Jul 2023 18:39:29 +0000 (20:39 +0200)]
 
libceph: harden msgr2.1 frame segment length checks
ceph_frame_desc::fd_lens is an int array.  decode_preamble() thus
effectively casts u32 -> int but the checks for segment lengths are
written as if on unsigned values.  While reading in HELLO or one of the
AUTH frames (before authentication is completed), arithmetic in
head_onwire_len() can get duped by negative ctrl_len and produce
head_len which is less than CEPH_PREAMBLE_LEN but still positive.
This would lead to a buffer overrun in prepare_read_control() as the
preamble gets copied to the newly allocated buffer of size head_len.
Cc: stable@vger.kernel.org
Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Reported-by: Thelford Williams <thelford@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Paolo Abeni [Thu, 13 Jul 2023 09:12:01 +0000 (11:12 +0200)]
 
Merge branch 'net-sched-fixes-for-sch_qfq'
Pedro Tammela says:
====================
net/sched: fixes for sch_qfq
Patch 1 fixes a regression introduced in 6.4 where the MTU size could be
bigger than 'lmax'.
Patch 3 fixes an issue where the code doesn't account for qdisc_pkt_len()
returning a size bigger then 'lmax'.
Patches 2 and 4 are selftests for the issues above.
====================
Link: https://lore.kernel.org/r/20230711210103.597831-1-pctammela@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pedro Tammela [Tue, 11 Jul 2023 21:01:03 +0000 (18:01 -0300)]
 
selftests: tc-testing: add test for qfq with stab overhead
A packet with stab overhead greater than QFQ_MAX_LMAX should be dropped
by the QFQ qdisc as it can't handle such lengths.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pedro Tammela [Tue, 11 Jul 2023 21:01:02 +0000 (18:01 -0300)]
 
net/sched: sch_qfq: account for stab overhead in qfq_enqueue
Lion says:
-------
In the QFQ scheduler a similar issue to CVE-2023-31436
persists.
Consider the following code in net/sched/sch_qfq.c:
static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                struct sk_buff **to_free)
{
     unsigned int len = qdisc_pkt_len(skb), gso_segs;
    // ...
     if (unlikely(cl->agg->lmax < len)) {
         pr_debug("qfq: increasing maxpkt from %u to %u for class %u",
              cl->agg->lmax, len, cl->common.classid);
         err = qfq_change_agg(sch, cl, cl->agg->class_weight, len);
         if (err) {
             cl->qstats.drops++;
             return qdisc_drop(skb, sch, to_free);
         }
    // ...
     }
Similarly to CVE-2023-31436, "lmax" is increased without any bounds
checks according to the packet length "len". Usually this would not
impose a problem because packet sizes are naturally limited.
This is however not the actual packet length, rather the
"qdisc_pkt_len(skb)" which might apply size transformations according to
"struct qdisc_size_table" as created by "qdisc_get_stab()" in
net/sched/sch_api.c if the TCA_STAB option was set when modifying the qdisc.
A user may choose virtually any size using such a table.
As a result the same issue as in CVE-2023-31436 can occur, allowing heap
out-of-bounds read / writes in the kmalloc-8192 cache.
-------
We can create the issue with the following commands:
tc qdisc add dev $DEV root handle 1: stab mtu 2048 tsize 512 mpu 0 \
overhead 
999999999 linklayer ethernet qfq
tc class add dev $DEV parent 1: classid 1:1 htb rate 6mbit burst 15k
tc filter add dev $DEV parent 1: matchall classid 1:1
ping -I $DEV 1.1.1.2
This is caused by incorrectly assuming that qdisc_pkt_len() returns a
length within the QFQ_MIN_LMAX < len < QFQ_MAX_LMAX.
Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Reported-by: Lion <nnamrec@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pedro Tammela [Tue, 11 Jul 2023 21:01:01 +0000 (18:01 -0300)]
 
selftests: tc-testing: add tests for qfq mtu sanity check
QFQ only supports a certain bound of MTU size so make sure
we check for this requirement in the tests.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pedro Tammela [Tue, 11 Jul 2023 21:01:00 +0000 (18:01 -0300)]
 
net/sched: sch_qfq: reintroduce lmax bound check for MTU
25369891fcef deletes a check for the case where no 'lmax' is
specified which 
3037933448f6 previously fixed as 'lmax'
could be set to the device's MTU without any bound checking
for QFQ_LMAX_MIN and QFQ_LMAX_MAX. Therefore, reintroduce the check.
Fixes: 25369891fcef ("net/sched: sch_qfq: refactor parsing of netlink parameters")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Artur Rojek [Mon, 10 Jul 2023 23:31:32 +0000 (01:31 +0200)]
 
sh: hd64461: Handle virq offset for offchip IRQ base and HD64461 IRQ
A recent change to start counting SuperH IRQ #s from 16 breaks support
for the Hitachi HD64461 companion chip.
Move the offchip IRQ base and HD64461 IRQ # by 16 in order to
accommodate for the new virq numbering rules.
Fixes: a8ac2961148e ("sh: Avoid using IRQ0 on SH3 and SH4")
Signed-off-by: Artur Rojek <contact@artur-rojek.eu>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Link: https://lore.kernel.org/r/20230710233132.69734-1-contact@artur-rojek.eu
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Geert Uytterhoeven [Sun, 9 Jul 2023 13:10:43 +0000 (15:10 +0200)]
 
sh: mach-dreamcast: Handle virq offset in cascaded IRQ demux
Take into account the virq offset when translating cascaded interrupts.
Fixes: a8ac2961148e8c72 ("sh: Avoid using IRQ0 on SH3 and SH4")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Link: https://lore.kernel.org/r/7d0cb246c9f1cd24bb1f637ec5cb67e799a4c3b8.1688908227.git.geert+renesas@glider.be
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>