linux.git
22 months agoRevert "nvme-fc: fix race between error recovery and creating association"
Keith Busch [Mon, 18 Dec 2023 16:19:39 +0000 (08:19 -0800)]
Revert "nvme-fc: fix race between error recovery and creating association"

The commit was identified to might sleep in invalid context and is
blocking regression testing.

This reverts commit ee6fdc5055e916b1dd497f11260d4901c4c1e55e.

Link: https://lore.kernel.org/linux-nvme/hkhl56n665uvc6t5d6h3wtx7utkcorw4xlwi7d2t2bnonavhe6@xaan6pu43ap6/
Link: https://lists.infradead.org/pipermail/linux-nvme/2023-December/043756.html
Reported-by: Daniel Wagner <dwagner@suse.de>
Reported-by: Maurizio Lombardi <mlombard@redhat.com>
Cc: Michael Liang <mliang@purestorage.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agoMerge tag 'md-fixes-20231207-1' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jens Axboe [Thu, 7 Dec 2023 19:15:18 +0000 (12:15 -0700)]
Merge tag 'md-fixes-20231207-1' of https://git./linux/kernel/git/song/md into block-6.7

Pull MD fix from Song:

"This change from Yu Kuai fixes a bug reported in
 https://bugzilla.kernel.org/show_bug.cgi?id=218200"

* tag 'md-fixes-20231207-1' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: split MD_RECOVERY_NEEDED out of mddev_resume

22 months agomd: split MD_RECOVERY_NEEDED out of mddev_resume
Yu Kuai [Thu, 7 Dec 2023 02:07:24 +0000 (10:07 +0800)]
md: split MD_RECOVERY_NEEDED out of mddev_resume

New mddev_resume() calls are added to synchronize IO with array
reconfiguration, however, this introduces a performance regression while
adding it in md_start_sync():

1) someone sets MD_RECOVERY_NEEDED first;
2) daemon thread grabs reconfig_mutex, then clears MD_RECOVERY_NEEDED and
   queues a new sync work;
3) daemon thread releases reconfig_mutex;
4) in md_start_sync
   a) check that there are spares that can be added/removed, then suspend
      the array;
   b) remove_and_add_spares may not be called, or called without really
      add/remove spares;
   c) resume the array, then set MD_RECOVERY_NEEDED again!

Loop between 2 - 4, then mddev_suspend() will be called quite often, for
consequence, normal IO will be quite slow.

Fix this problem by don't set MD_RECOVERY_NEEDED again in md_start_sync(),
hence the loop will be broken.

Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Suggested-by: Song Liu <song@kernel.org>
Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218200
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231207020724.2797445-1-yukuai1@huaweicloud.com
22 months agoMerge tag 'nvme-6.7-2023-12-7' of git://git.infradead.org/nvme into block-6.7
Jens Axboe [Thu, 7 Dec 2023 17:30:54 +0000 (10:30 -0700)]
Merge tag 'nvme-6.7-2023-12-7' of git://git.infradead.org/nvme into block-6.7

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.7

 - Proper nvme ctrl state setting (Keith)
 - Passthrough command optimization (Keith)
 - Spectre fix (Nitesh)
 - Kconfig clarifications (Shin'ichiro)
 - Frozen state deadlock fix (Bitao)
 - Power setting quirk (Georg)"

* tag 'nvme-6.7-2023-12-7' of git://git.infradead.org/nvme:
  nvme-pci: Add sleep quirk for Kingston drives
  nvme: fix deadlock between reset and scan
  nvme: prevent potential spectre v1 gadget
  nvme: improve NVME_HOST_AUTH and NVME_TARGET_AUTH config descriptions
  nvme-ioctl: move capable() admin check to the end
  nvme: ensure reset state check ordering
  nvme: introduce helper function to get ctrl state

22 months agonvme-pci: Add sleep quirk for Kingston drives
Georg Gottleuber [Wed, 20 Sep 2023 08:52:10 +0000 (10:52 +0200)]
nvme-pci: Add sleep quirk for Kingston drives

Some Kingston NV1 and A2000 are wasting a lot of power on specific TUXEDO
platforms in s2idle sleep if 'Simple Suspend' is used.

This patch applies a new quirk 'Force No Simple Suspend' to achieve a
low power sleep without 'Simple Suspend'.

Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agoMerge tag 'md-fixes-20231206' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jens Axboe [Wed, 6 Dec 2023 22:31:58 +0000 (15:31 -0700)]
Merge tag 'md-fixes-20231206' of https://git./linux/kernel/git/song/md into block-6.7

Pull MD fixes from Song:

"This set from Yu Kuai fixes issues around sync_work, which was introduced
 in 6.7 kernels."

* tag 'md-fixes-20231206' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: fix stopping sync thread
  md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()
  md: fix missing flush of sync_work

22 months agomd: fix stopping sync thread
Yu Kuai [Tue, 5 Dec 2023 09:42:15 +0000 (17:42 +0800)]
md: fix stopping sync thread

Currently sync thread is stopped from multiple contex:
 - idle_sync_thread
 - frozen_sync_thread
 - __md_stop_writes
 - md_set_readonly
 - do_md_stop

And there are some problems:
1) sync_work is flushed while reconfig_mutex is grabbed, this can
   deadlock because the work function will grab reconfig_mutex as well.
2) md_reap_sync_thread() can't be called directly while md_do_sync() is
   not finished yet, for example, commit 130443d60b1b ("md: refactor
   idle/frozen_sync_thread() to fix deadlock").
3) If MD_RECOVERY_RUNNING is not set, there is no need to stop
   sync_thread at all because sync_thread must not be registered.

Factor out a helper stop_sync_thread(), so that above contex will behave
the same. Fix 1) by flushing sync_work after reconfig_mutex is released,
before waiting for sync_thread to be done; Fix 2) bt letting daemon thread
to unregister sync_thread; Fix 3) by always checking MD_RECOVERY_RUNNING
first.

Fixes: db5e653d7c9f ("md: delay choosing sync action to md_start_sync()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231205094215.1824240-4-yukuai1@huaweicloud.com
22 months agomd: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()
Yu Kuai [Tue, 5 Dec 2023 09:42:14 +0000 (17:42 +0800)]
md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()

If md_set_readonly() failed, the array could still be read-write, however
'MD_RECOVERY_FROZEN' could still be set, which leave the array in an
abnormal state that sync or recovery can't continue anymore.
Hence make sure the flag is cleared after md_set_readonly() returns.

Fixes: 88724bfa68be ("md: wait for pending superblock updates before switching to read-only")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231205094215.1824240-3-yukuai1@huaweicloud.com
22 months agomd: fix missing flush of sync_work
Yu Kuai [Tue, 5 Dec 2023 09:42:13 +0000 (17:42 +0800)]
md: fix missing flush of sync_work

Commit ac619781967b ("md: use separate work_struct for md_start_sync()")
use a new sync_work to replace del_work, however, stop_sync_thread() and
__md_stop_writes() was trying to wait for sync_thread to be done, hence
they should switch to use sync_work as well.

Noted that md_start_sync() from sync_work will grab 'reconfig_mutex',
hence other contex can't held the same lock to flush work, and this will
be fixed in later patches.

Fixes: ac619781967b ("md: use separate work_struct for md_start_sync()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231205094215.1824240-2-yukuai1@huaweicloud.com
22 months agonvme: fix deadlock between reset and scan
Bitao Hu [Thu, 30 Nov 2023 02:13:37 +0000 (10:13 +0800)]
nvme: fix deadlock between reset and scan

If controller reset occurs when allocating namespace, both
nvme_reset_work and nvme_scan_work will hang, as shown below.

Test Scripts:

    for ((t=1;t<=128;t++))
    do
    nsid=`nvme create-ns /dev/nvme1 -s 14537724 -c 14537724 -f 0 -m 0 \
    -d 0 | awk -F: '{print($NF);}'`
    nvme attach-ns /dev/nvme1 -n $nsid -c 0
    done
    nvme reset /dev/nvme1

We will find that both nvme_reset_work and nvme_scan_work hung:

    INFO: task kworker/u249:4:17848 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    task:kworker/u249:4  state:D stack:    0 pid:17848 ppid:     2
    flags:0x00000028
    Workqueue: nvme-reset-wq nvme_reset_work [nvme]
    Call trace:
    __switch_to+0xb4/0xfc
    __schedule+0x22c/0x670
    schedule+0x4c/0xd0
    blk_mq_freeze_queue_wait+0x84/0xc0
    nvme_wait_freeze+0x40/0x64 [nvme_core]
    nvme_reset_work+0x1c0/0x5cc [nvme]
    process_one_work+0x1d8/0x4b0
    worker_thread+0x230/0x440
    kthread+0x114/0x120
    INFO: task kworker/u249:3:22404 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    task:kworker/u249:3  state:D stack:    0 pid:22404 ppid:     2
    flags:0x00000028
    Workqueue: nvme-wq nvme_scan_work [nvme_core]
    Call trace:
    __switch_to+0xb4/0xfc
    __schedule+0x22c/0x670
    schedule+0x4c/0xd0
    rwsem_down_write_slowpath+0x32c/0x98c
    down_write+0x70/0x80
    nvme_alloc_ns+0x1ac/0x38c [nvme_core]
    nvme_validate_or_alloc_ns+0xbc/0x150 [nvme_core]
    nvme_scan_ns_list+0xe8/0x2e4 [nvme_core]
    nvme_scan_work+0x60/0x500 [nvme_core]
    process_one_work+0x1d8/0x4b0
    worker_thread+0x260/0x440
    kthread+0x114/0x120
    INFO: task nvme:28428 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    task:nvme            state:D stack:    0 pid:28428 ppid: 27119
    flags:0x00000000
    Call trace:
    __switch_to+0xb4/0xfc
    __schedule+0x22c/0x670
    schedule+0x4c/0xd0
    schedule_timeout+0x160/0x194
    do_wait_for_common+0xac/0x1d0
    __wait_for_common+0x78/0x100
    wait_for_completion+0x24/0x30
    __flush_work.isra.0+0x74/0x90
    flush_work+0x14/0x20
    nvme_reset_ctrl_sync+0x50/0x74 [nvme_core]
    nvme_dev_ioctl+0x1b0/0x250 [nvme_core]
    __arm64_sys_ioctl+0xa8/0xf0
    el0_svc_common+0x88/0x234
    do_el0_svc+0x7c/0x90
    el0_svc+0x1c/0x30
    el0_sync_handler+0xa8/0xb0
    el0_sync+0x148/0x180

The reason for the hang is that nvme_reset_work occurs while nvme_scan_work
is still running. nvme_scan_work may add new ns into ctrl->namespaces
list after nvme_reset_work frozen all ns->q in ctrl->namespaces list.
The newly added ns is not frozen, so nvme_wait_freeze will wait forever.
Unfortunately, ctrl->namespaces_rwsem is held by nvme_reset_work, so
nvme_scan_work will also wait forever. Now we are deadlocked!

PROCESS1                         PROCESS2
==============                   ==============
nvme_scan_work
  ...                            nvme_reset_work
  nvme_validate_or_alloc_ns        nvme_dev_disable
    nvme_alloc_ns                    nvme_start_freeze
     down_write                      ...
     nvme_ns_add_to_ctrl_list        ...
     up_write                      nvme_wait_freeze
    ...                              down_read
    nvme_alloc_ns                    blk_mq_freeze_queue_wait
     down_write

Fix by marking the ctrl with say NVME_CTRL_FROZEN flag set in
nvme_start_freeze and cleared in nvme_unfreeze. Then the scan can check
it before adding the new namespace (under the namespaces_rwsem).

Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agonvme: prevent potential spectre v1 gadget
Nitesh Shetty [Tue, 28 Nov 2023 12:29:57 +0000 (17:59 +0530)]
nvme: prevent potential spectre v1 gadget

This patch fixes the smatch warning, "nvmet_ns_ana_grpid_store() warn:
potential spectre issue 'nvmet_ana_group_enabled' [w] (local cap)"
Prevent the contents of kernel memory from being leaked to  user space
via speculative execution by using array_index_nospec.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agonvme: improve NVME_HOST_AUTH and NVME_TARGET_AUTH config descriptions
Shin'ichiro Kawasaki [Wed, 29 Nov 2023 04:49:51 +0000 (13:49 +0900)]
nvme: improve NVME_HOST_AUTH and NVME_TARGET_AUTH config descriptions

Currently two similar config options NVME_HOST_AUTH and NVME_TARGET_AUTH
have almost same descriptions. It is confusing to choose them in
menuconfig. Improve the descriptions to distinguish them.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agonvme-ioctl: move capable() admin check to the end
Keith Busch [Tue, 2 May 2023 18:43:41 +0000 (11:43 -0700)]
nvme-ioctl: move capable() admin check to the end

This can be an expensive call on some kernel configs. Move it to the end
after checking the cheaper ways to determine if the command is allowed.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agonvme: ensure reset state check ordering
Keith Busch [Fri, 27 Oct 2023 17:58:12 +0000 (10:58 -0700)]
nvme: ensure reset state check ordering

A different CPU may be setting the ctrl->state value, so ensure proper
barriers to prevent optimizing to a stale state. Normally it isn't a
problem to observe the wrong state as it is merely advisory to take a
quicker path during initialization and error recovery, but seeing an old
state can report unexpected ENETRESET errors when a reset request was in
fact successful.

Reported-by: Minh Hoang <mh2022@meta.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
22 months agonvme: introduce helper function to get ctrl state
Keith Busch [Mon, 30 Oct 2023 15:13:09 +0000 (08:13 -0700)]
nvme: introduce helper function to get ctrl state

The controller state is typically written by another CPU, so reading it
should ensure no optimizations are taken. This is a repeated pattern in
the driver, so start with adding a convenience function that returns the
controller state with READ_ONCE().

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agoMerge tag 'md-fixes-20231201-1' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jens Axboe [Sat, 2 Dec 2023 01:37:24 +0000 (18:37 -0700)]
Merge tag 'md-fixes-20231201-1' of https://git./linux/kernel/git/song/md into block-6.7

Pull MD fix from Song:

"This change fixes issue with raid456 reshape."

* tag 'md-fixes-20231201-1' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid6: use valid sector values to determine if an I/O should wait on the reshape

22 months agomd/raid6: use valid sector values to determine if an I/O should wait on the reshape
David Jeffery [Tue, 28 Nov 2023 18:11:39 +0000 (13:11 -0500)]
md/raid6: use valid sector values to determine if an I/O should wait on the reshape

During a reshape or a RAID6 array such as expanding by adding an additional
disk, I/Os to the region of the array which have not yet been reshaped can
stall indefinitely. This is from errors in the stripe_ahead_of_reshape
function causing md to think the I/O is to a region in the actively
undergoing the reshape.

stripe_ahead_of_reshape fails to account for the q disk having a sector
value of 0. By not excluding the q disk from the for loop, raid6 will always
generate a min_sector value of 0, causing a return value which stalls.

The function's max_sector calculation also uses min() when it should use
max(), causing the max_sector value to always be 0. During a backwards
rebuild this can cause the opposite problem where it allows I/O to advance
when it should wait.

Fixing these errors will allow safe I/O to advance in a timely manner and
delay only I/O which is unsafe due to stripes in the middle of undergoing
the reshape.

Fixes: 486f60558607 ("md/raid5: Check all disks in a stripe_head for reshape progress")
Cc: stable@vger.kernel.org # v6.0+
Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231128181233.6187-1-djeffery@redhat.com
22 months agoMerge tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme into block-6.7
Jens Axboe [Fri, 1 Dec 2023 16:09:16 +0000 (09:09 -0700)]
Merge tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme into block-6.7

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.7

 - Invalid namespace identification error handling (Marizio Ewan, Keith)
 - Fabrics keep-alive tuning (Mark)"

* tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme:
  nvme-core: check for too small lba shift
  nvme: check for valid nvme_identify_ns() before using it
  nvme-core: fix a memory leak in nvme_ns_info_from_identify()
  nvme: fine-tune sending of first keep-alive

22 months agonvme-core: check for too small lba shift
Keith Busch [Tue, 28 Nov 2023 17:36:04 +0000 (09:36 -0800)]
nvme-core: check for too small lba shift

The block layer doesn't support logical block sizes smaller than 512
bytes. The nvme spec doesn't support that small either, but the driver
isn't checking to make sure the device responded with usable data.
Failing to catch this will result in a kernel bug, either from a
division by zero when stacking, or a zero length bio.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agoblk-mq: don't count completed flush data request as inflight in case of quiesce
Ming Lei [Fri, 1 Dec 2023 08:56:05 +0000 (16:56 +0800)]
blk-mq: don't count completed flush data request as inflight in case of quiesce

Request queue quiesce may interrupt flush sequence, and the original request
may have been marked as COMPLETE, but can't get finished because of
queue quiesce.

This way is fine from driver viewpoint, because flush sequence is block
layer concept, and it isn't related with driver.

However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count &
drain inflight requests, then the wait & drain never gets done because
the completed & not-finished flush request is counted as inflight.

Fix this issue by not counting completed flush data request as inflight in
case of quiesce.

Cc: Mike Snitzer <snitzer@kernel.org>
Cc: David Jeffery <djeffery@redhat.com>
Cc: John Pittman <jpittman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
22 months agoblock: Document the role of the two attribute groups
Bart Van Assche [Tue, 28 Nov 2023 19:40:19 +0000 (11:40 -0800)]
block: Document the role of the two attribute groups

It is nontrivial to derive the role of the two attribute groups in source
file block/blk-sysfs.c. Hence add a comment that explains their roles. See
also commit 6d85ebf95c44 ("blk-sysfs: add a new attr_group for blk_mq").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20231128194019.72762-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
22 months agoblock: warn once for each partition in bio_check_ro()
Yu Kuai [Tue, 28 Nov 2023 12:30:27 +0000 (20:30 +0800)]
block: warn once for each partition in bio_check_ro()

Commit 1b0a151c10a6 ("blk-core: use pr_warn_ratelimited() in
bio_check_ro()") fix message storm by limit the rate, however, there
will still be lots of message in the long term. Fix it better by warn
once for each partition.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231128123027.971610-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
22 months agoblock: move .bd_inode into 1st cacheline of block_device
Ming Lei [Tue, 28 Nov 2023 12:30:26 +0000 (20:30 +0800)]
block: move .bd_inode into 1st cacheline of block_device

The .bd_inode field of block_device is used in IO fast path of
blkdev_write_iter() and blkdev_llseek(), so it is more efficient to keep
it into the 1st cacheline.

.bd_openers is only touched in open()/close(), and .bd_size_lock is only
for updating bdev capacity, which is in slow path too.

So swap .bd_inode layout with .bd_openers & .bd_size_lock to move
.bd_inode into the 1st cache line.

Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231128123027.971610-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
22 months agonvme: check for valid nvme_identify_ns() before using it
Ewan D. Milne [Mon, 27 Nov 2023 20:56:57 +0000 (15:56 -0500)]
nvme: check for valid nvme_identify_ns() before using it

When scanning namespaces, it is possible to get valid data from the first
call to nvme_identify_ns() in nvme_alloc_ns(), but not from the second
call in nvme_update_ns_info_block().  In particular, if the NSID becomes
inactive between the two commands, a storage device may return a buffer
filled with zero as per 4.1.5.1.  In this case, we can get a kernel crash
due to a divide-by-zero in blk_stack_limits() because ns->lba_shift will
be set to zero.

PID: 326      TASK: ffff95fec3cd8000  CPU: 29   COMMAND: "kworker/u98:10"
 #0 [ffffad8f8702f9e0] machine_kexec at ffffffff91c76ec7
 #1 [ffffad8f8702fa38] __crash_kexec at ffffffff91dea4fa
 #2 [ffffad8f8702faf8] crash_kexec at ffffffff91deb788
 #3 [ffffad8f8702fb00] oops_end at ffffffff91c2e4bb
 #4 [ffffad8f8702fb20] do_trap at ffffffff91c2a4ce
 #5 [ffffad8f8702fb70] do_error_trap at ffffffff91c2a595
 #6 [ffffad8f8702fbb0] exc_divide_error at ffffffff928506e6
 #7 [ffffad8f8702fbd0] asm_exc_divide_error at ffffffff92a00926
    [exception RIP: blk_stack_limits+434]
    RIP: ffffffff92191872  RSP: ffffad8f8702fc80  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff95efa0c91800  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: 0000000000000001
    RBP: 00000000ffffffff   R8: ffff95fec7df35a8   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: ffff95fed33c09a8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffffad8f8702fce0] nvme_update_ns_info_block at ffffffffc06d3533 [nvme_core]
 #9 [ffffad8f8702fd18] nvme_scan_ns at ffffffffc06d6fa7 [nvme_core]

This happened when the check for valid data was moved out of nvme_identify_ns()
into one of the callers.  Fix this by checking in both callers.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218186
Fixes: 0dd6fff2aad4 ("nvme: bring back auto-removal of deleted namespaces during sequential scan")
Cc: stable@vger.kernel.org
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agonvme-core: fix a memory leak in nvme_ns_info_from_identify()
Maurizio Lombardi [Thu, 23 Nov 2023 14:07:41 +0000 (15:07 +0100)]
nvme-core: fix a memory leak in nvme_ns_info_from_identify()

In case of error, free the nvme_id_ns structure that was allocated
by nvme_identify_ns().

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
22 months agonvme: fine-tune sending of first keep-alive
Mark O'Donovan [Fri, 24 Nov 2023 20:56:59 +0000 (20:56 +0000)]
nvme: fine-tune sending of first keep-alive

Keep-alive commands are sent half-way through the kato period.
This normally works well but fails when the keep-alive system is
started when we are more than half way through the kato.
This can happen on larger setups or due to host delays.
With this change we now time the initial keep-alive command from
the controller initialisation time, rather than the keep-alive
mechanism activation time.

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agobcache: revert replacing IS_ERR_OR_NULL with IS_ERR
Markus Weippert [Fri, 24 Nov 2023 15:14:37 +0000 (16:14 +0100)]
bcache: revert replacing IS_ERR_OR_NULL with IS_ERR

Commit 028ddcac477b ("bcache: Remove unnecessary NULL point check in
node allocations") replaced IS_ERR_OR_NULL by IS_ERR. This leads to a
NULL pointer dereference.

BUG: kernel NULL pointer dereference, address: 0000000000000080
Call Trace:
 ? __die_body.cold+0x1a/0x1f
 ? page_fault_oops+0xd2/0x2b0
 ? exc_page_fault+0x70/0x170
 ? asm_exc_page_fault+0x22/0x30
 ? btree_node_free+0xf/0x160 [bcache]
 ? up_write+0x32/0x60
 btree_gc_coalesce+0x2aa/0x890 [bcache]
 ? bch_extent_bad+0x70/0x170 [bcache]
 btree_gc_recurse+0x130/0x390 [bcache]
 ? btree_gc_mark_node+0x72/0x230 [bcache]
 bch_btree_gc+0x5da/0x600 [bcache]
 ? cpuusage_read+0x10/0x10
 ? bch_btree_gc+0x600/0x600 [bcache]
 bch_gc_thread+0x135/0x180 [bcache]

The relevant code starts with:

    new_nodes[0] = NULL;

    for (i = 0; i < nodes; i++) {
        if (__bch_keylist_realloc(&keylist, bkey_u64s(&r[i].b->key)))
            goto out_nocoalesce;
    // ...
out_nocoalesce:
    // ...
    for (i = 0; i < nodes; i++)
        if (!IS_ERR(new_nodes[i])) {  // IS_ERR_OR_NULL before
028ddcac477b
            btree_node_free(new_nodes[i]);  // new_nodes[0] is NULL
            rw_unlock(true, new_nodes[i]);
        }

This patch replaces IS_ERR() by IS_ERR_OR_NULL() to fix this.

Fixes: 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations")
Link: https://lore.kernel.org/all/3DF4A87A-2AC1-4893-AE5F-E921478419A9@suse.de/
Cc: stable@vger.kernel.org
Cc: Zheng Wang <zyytlz.wz@163.com>
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Markus Weippert <markus@gekmihesg.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agonvme: tcp: fix compile-time checks for TLS mode
Arnd Bergmann [Wed, 22 Nov 2023 22:47:19 +0000 (23:47 +0100)]
nvme: tcp: fix compile-time checks for TLS mode

When CONFIG_NVME_KEYRING is enabled as a loadable module, but the TCP
host code is built-in, it fails to link:

arm-linux-gnueabi-ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl':
tcp.c:(.text+0x1940): undefined reference to `nvme_tls_psk_default'

The problem is that the compile-time conditionals are inconsistent here,
using a mix of #ifdef CONFIG_NVME_TCP_TLS, IS_ENABLED(CONFIG_NVME_TCP_TLS)
and IS_ENABLED(CONFIG_NVME_KEYRING) checks, with CONFIG_NVME_KEYRING
controlling whether the implementation is actually built.

Change it to use IS_ENABLED(CONFIG_NVME_KEYRING) checks consistently,
which should help readability and make it less error-prone. Combining
it with the check for the ctrl->opts->tls flag lets the compiler drop
all the TLS code in configurations without this feature, which also
helps runtime behavior in addition to avoiding the link failure.

To make it possible for the compiler to build the dead code, both
the tls_handshake_timeout variable and the TLS specific members
of nvme_tcp_queue need to be moved out of the #ifdef block as well,
but at least the former of these gets optimized out again.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231122224719.4042108-4-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agonvme: target: fix Kconfig select statements
Arnd Bergmann [Wed, 22 Nov 2023 22:47:18 +0000 (23:47 +0100)]
nvme: target: fix Kconfig select statements

When the NVME target code is built-in but its TCP frontend is a loadable
module, enabling keyring support causes a link failure:

x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make':
configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id'

The problem is that CONFIG_NVME_TARGET_TCP_TLS is a 'bool' symbol that
depends on the tristate CONFIG_NVME_TARGET_TCP, so any 'select' from
it inherits the state of the tristate symbol rather than the intended
CONFIG_NVME_TARGET one that contains the actual call.

The same thing is true for CONFIG_KEYS, which itself is required for
NVME_KEYRING.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231122224719.4042108-3-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agonvme: target: fix nvme_keyring_id() references
Arnd Bergmann [Wed, 22 Nov 2023 22:47:17 +0000 (23:47 +0100)]
nvme: target: fix nvme_keyring_id() references

In configurations without CONFIG_NVME_TARGET_TCP_TLS, the keyring
code might not be available, or using it will result in a runtime
failure:

x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make':
configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id'

Add a check to ensure we only check the keyring if there is a chance
of it being used, which avoids both the runtime and link-time
problems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231122224719.4042108-2-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agoMerge tag 'nvme-6.7-2023-11-22' of git://git.infradead.org/nvme into block-6.7
Jens Axboe [Wed, 22 Nov 2023 17:19:27 +0000 (10:19 -0700)]
Merge tag 'nvme-6.7-2023-11-22' of git://git.infradead.org/nvme into block-6.7

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.7

 - TCP TLS fixes (Hannes)
 - Authentifaction fixes (Mark, Hannes)
 - Properly terminate target names (Christoph)"

* tag 'nvme-6.7-2023-11-22' of git://git.infradead.org/nvme:
  nvme: move nvme_stop_keep_alive() back to original position
  nvmet-tcp: always initialize tls_handshake_tmo_work
  nvmet: nul-terminate the NQNs passed in the connect command
  nvme: blank out authentication fabrics options if not configured
  nvme: catch errors from nvme_configure_metadata()
  nvme-tcp: only evaluate 'tls' option if TLS is selected
  nvme-auth: set explanation code for failure2 msgs
  nvme-auth: unlock mutex in one place only

23 months agonvme: move nvme_stop_keep_alive() back to original position
Hannes Reinecke [Tue, 21 Nov 2023 08:01:03 +0000 (09:01 +0100)]
nvme: move nvme_stop_keep_alive() back to original position

Stopping keep-alive not only stops the keep-alive workqueue,
but also needs to be synchronized with I/O termination as we
must not send a keep-alive command after all I/O had been
terminated.
So to avoid any regressions move the call to stop_keep_alive()
back to its original position and ensure that keep-alive is
correctly stopped failing to setup the admin queue.

Fixes: 4733b65d82bd ("nvme: start keep-alive after admin queue setup")
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agonbd: pass nbd_sock to nbd_read_reply() instead of index
Li Nan [Mon, 11 Sep 2023 02:33:08 +0000 (10:33 +0800)]
nbd: pass nbd_sock to nbd_read_reply() instead of index

If a socket is processing ioctl 'NBD_SET_SOCK', config->socks might be
krealloc in nbd_add_socket(), and a garbage request is received now, a UAF
may occurs.

  T1
  nbd_ioctl
   __nbd_ioctl
    nbd_add_socket
     blk_mq_freeze_queue
T2
   recv_work
    nbd_read_reply
     sock_xmit
     krealloc config->socks
   def config->socks

Pass nbd_sock to nbd_read_reply(). And introduce a new function
sock_xmit_recv(), which differs from sock_xmit only in the way it get
socket.

==================================================================
BUG: KASAN: use-after-free in sock_xmit+0x525/0x550
Read of size 8 at addr ffff8880188ec428 by task kworker/u12:1/18779

Workqueue: knbd4-recv recv_work
Call Trace:
 __dump_stack
 dump_stack+0xbe/0xfd
 print_address_description.constprop.0+0x19/0x170
 __kasan_report.cold+0x6c/0x84
 kasan_report+0x3a/0x50
 sock_xmit+0x525/0x550
 nbd_read_reply+0xfe/0x2c0
 recv_work+0x1c2/0x750
 process_one_work+0x6b6/0xf10
 worker_thread+0xdd/0xd80
 kthread+0x30a/0x410
 ret_from_fork+0x22/0x30

Allocated by task 18784:
 kasan_save_stack+0x1b/0x40
 kasan_set_track
 set_alloc_info
 __kasan_kmalloc
 __kasan_kmalloc.constprop.0+0xf0/0x130
 slab_post_alloc_hook
 slab_alloc_node
 slab_alloc
 __kmalloc_track_caller+0x157/0x550
 __do_krealloc
 krealloc+0x37/0xb0
 nbd_add_socket
 +0x2d3/0x880
 __nbd_ioctl
 nbd_ioctl+0x584/0x8e0
 __blkdev_driver_ioctl
 blkdev_ioctl+0x2a0/0x6e0
 block_ioctl+0xee/0x130
 vfs_ioctl
 __do_sys_ioctl
 __se_sys_ioctl+0x138/0x190
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x61/0xc6

Freed by task 18784:
 kasan_save_stack+0x1b/0x40
 kasan_set_track+0x1c/0x30
 kasan_set_free_info+0x20/0x40
 __kasan_slab_free.part.0+0x13f/0x1b0
 slab_free_hook
 slab_free_freelist_hook
 slab_free
 kfree+0xcb/0x6c0
 krealloc+0x56/0xb0
 nbd_add_socket+0x2d3/0x880
 __nbd_ioctl
 nbd_ioctl+0x584/0x8e0
 __blkdev_driver_ioctl
 blkdev_ioctl+0x2a0/0x6e0
 block_ioctl+0xee/0x130
 vfs_ioctl
 __do_sys_ioctl
 __se_sys_ioctl+0x138/0x190
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x61/0xc6

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230911023308.3467802-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agos390/dasd: protect device queue against concurrent access
Jan Höppner [Wed, 25 Oct 2023 13:24:37 +0000 (15:24 +0200)]
s390/dasd: protect device queue against concurrent access

In dasd_profile_start() the amount of requests on the device queue are
counted. The access to the device queue is unprotected against
concurrent access. With a lot of parallel I/O, especially with alias
devices enabled, the device queue can change while dasd_profile_start()
is accessing the queue. In the worst case this leads to a kernel panic
due to incorrect pointer accesses.

Fix this by taking the device lock before accessing the queue and
counting the requests. Additionally the check for a valid profile data
pointer can be done earlier to avoid unnecessary locking in a hot path.

Cc: <stable@vger.kernel.org>
Fixes: 4fa52aa7a82f ("[S390] dasd: add enhanced DASD statistics interface")
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20231025132437.1223363-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agos390/dasd: resolve spelling mistake
Muhammad Muzammil [Wed, 25 Oct 2023 13:24:36 +0000 (15:24 +0200)]
s390/dasd: resolve spelling mistake

resolve typing mistake from pimary to primary

Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com>
Link: https://lore.kernel.org/r/20231010043140.28416-1-m.muzzammilashraf@gmail.com
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20231025132437.1223363-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agoblock/null_blk: Fix double blk_mq_start_request() warning
Chengming Zhou [Mon, 20 Nov 2023 03:25:21 +0000 (03:25 +0000)]
block/null_blk: Fix double blk_mq_start_request() warning

When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, null_queue_rq()
would return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE for the request,
which has been marked as MQ_RQ_IN_FLIGHT by blk_mq_start_request().

Then null_queue_rqs() put these requests in the rqlist, return back to
the block layer core, which would try to queue them individually again,
so the warning in blk_mq_start_request() triggered.

Fix it by splitting the null_queue_rq() into two parts: the first is the
preparation of request, the second is the handling of request. We put
the blk_mq_start_request() after the preparation part, which may fail
and return back to the block layer core.

The throttling also belongs to the preparation part, so move it before
blk_mq_start_request(). And change the return type of null_handle_cmd()
to void, since it always return BLK_STS_OK now.

Reported-by: <syzbot+fcc47ba2476570cbbeb0@syzkaller.appspotmail.com>
Closes: https://lore.kernel.org/all/0000000000000e6aac06098aee0c@google.com/
Fixes: d78bfa1346ab ("block/null_blk: add queue_rqs() support")
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20231120032521.1012037-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agonvmet-tcp: always initialize tls_handshake_tmo_work
Hannes Reinecke [Fri, 20 Oct 2023 05:06:06 +0000 (07:06 +0200)]
nvmet-tcp: always initialize tls_handshake_tmo_work

The TLS handshake timeout work item should always be
initialized to avoid a crash when cancelling the workqueue.

Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
Suggested-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agonvmet: nul-terminate the NQNs passed in the connect command
Christoph Hellwig [Fri, 17 Nov 2023 13:13:36 +0000 (08:13 -0500)]
nvmet: nul-terminate the NQNs passed in the connect command

The host and subsystem NQNs are passed in the connect command payload and
interpreted as nul-terminated strings.  Ensure they actually are
nul-terminated before using them.

Fixes: a07b4970f464 "nvmet: add a generic NVMe target")
Reported-by: Alon Zahavi <zahavi.alon@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agonvme: blank out authentication fabrics options if not configured
Hannes Reinecke [Thu, 16 Nov 2023 12:14:35 +0000 (13:14 +0100)]
nvme: blank out authentication fabrics options if not configured

If the config option NVME_HOST_AUTH is not selected we should not
accept the corresponding fabrics options. This allows userspace
to detect if NVMe authentication has been enabled for the kernel.

Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: f50fff73d620 ("nvme: implement In-Band authentication")
Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agonvme: catch errors from nvme_configure_metadata()
Hannes Reinecke [Tue, 14 Nov 2023 13:27:01 +0000 (14:27 +0100)]
nvme: catch errors from nvme_configure_metadata()

nvme_configure_metadata() is issuing I/O, so we might incur an I/O
error which will cause the connection to be reset.
But in that case any further probing will race with reset and
cause UAF errors.
So return a status from nvme_configure_metadata() and abort
probing if there was an I/O error.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agonvme-tcp: only evaluate 'tls' option if TLS is selected
Hannes Reinecke [Tue, 14 Nov 2023 13:18:21 +0000 (14:18 +0100)]
nvme-tcp: only evaluate 'tls' option if TLS is selected

We only need to evaluate the 'tls' connect option if TLS is
enabled; otherwise we might be getting a link error.

Fixes: 706add13676d ("nvme: keyring: fix conditional compilation")
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/r/202311140426.0eHrTXBr-lkp@intel.com/
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agonvme-auth: set explanation code for failure2 msgs
Mark O'Donovan [Wed, 11 Oct 2023 08:45:12 +0000 (08:45 +0000)]
nvme-auth: set explanation code for failure2 msgs

Some error cases were not setting an auth-failure-reason-code-explanation.
This means an AUTH_Failure2 message will be sent with an explanation value
of 0 which is a reserved value.

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agonvme-auth: unlock mutex in one place only
Mark O'Donovan [Wed, 11 Oct 2023 08:45:11 +0000 (08:45 +0000)]
nvme-auth: unlock mutex in one place only

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
23 months agoblock: Remove blk_set_runtime_active()
Damien Le Moal [Mon, 20 Nov 2023 07:06:11 +0000 (16:06 +0900)]
block: Remove blk_set_runtime_active()

The function blk_set_runtime_active() is called only from
blk_post_runtime_resume(), so there is no need for that function to be
exported. Open-code this function directly in blk_post_runtime_resume()
and remove it.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231120070611.33951-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agonbd: fix null-ptr-dereference while accessing 'nbd->config'
Li Nan [Thu, 16 Nov 2023 16:23:16 +0000 (00:23 +0800)]
nbd: fix null-ptr-dereference while accessing 'nbd->config'

Memory reordering may occur in nbd_genl_connect(), causing config_refs
to be set to 1 while nbd->config is still empty. Opening nbd at this
time will cause null-ptr-dereference.

   T1                      T2
   nbd_open
    nbd_get_config_unlocked
                     nbd_genl_connect
                      nbd_alloc_and_init_config
                       //memory reordered
                        refcount_set(&nbd->config_refs, 1)  // 2
     nbd->config
      ->null point
     nbd->config = config  // 1

Fix it by adding smp barrier to guarantee the execution sequence.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-4-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agonbd: factor out a helper to get nbd_config without holding 'config_lock'
Li Nan [Thu, 16 Nov 2023 16:23:15 +0000 (00:23 +0800)]
nbd: factor out a helper to get nbd_config without holding 'config_lock'

There are no functional changes, just to make code cleaner and prepare
to fix null-ptr-dereference while accessing 'nbd->config'.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-3-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agonbd: fold nbd config initialization into nbd_alloc_config()
Li Nan [Thu, 16 Nov 2023 16:23:14 +0000 (00:23 +0800)]
nbd: fold nbd config initialization into nbd_alloc_config()

There are no functional changes, make the code cleaner and prepare to
fix null-ptr-dereference while accessing 'nbd->config'.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-2-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agoMerge tag 'md-fixes-20231120' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jens Axboe [Mon, 20 Nov 2023 16:45:31 +0000 (09:45 -0700)]
Merge tag 'md-fixes-20231120' of https://git./linux/kernel/git/song/md into block-6.7

Pull MD fix from Song.

* tag 'md-fixes-20231120' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: fix bi_status reporting in md_end_clone_io

23 months agobcache: avoid NULL checking to c->root in run_cache_set()
Coly Li [Mon, 20 Nov 2023 05:25:03 +0000 (13:25 +0800)]
bcache: avoid NULL checking to c->root in run_cache_set()

In run_cache_set() after c->root returned from bch_btree_node_get(), it
is checked by IS_ERR_OR_NULL(). Indeed it is unncessary to check NULL
because bch_btree_node_get() will not return NULL pointer to caller.

This patch replaces IS_ERR_OR_NULL() by IS_ERR() for the above reason.

Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-11-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc()
Coly Li [Mon, 20 Nov 2023 05:25:02 +0000 (13:25 +0800)]
bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc()

This patch adds code comments to bch_btree_node_get() and
__bch_btree_node_alloc() that NULL pointer will not be returned and it
is unnecessary to check NULL pointer by the callers of these routines.

Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-10-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()
Coly Li [Mon, 20 Nov 2023 05:25:01 +0000 (13:25 +0800)]
bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()

Commit 028ddcac477b ("bcache: Remove unnecessary NULL point check in
node allocations") do the following change inside btree_gc_coalesce(),

31 @@ -1340,7 +1340,7 @@ static int btree_gc_coalesce(
32         memset(new_nodes, 0, sizeof(new_nodes));
33         closure_init_stack(&cl);
34
35 -       while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b))
36 +       while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b))
37                 keys += r[nodes++].keys;
38
39         blocks = btree_default_blocks(b->c) * 2 / 3;

At line 35 the original r[nodes].b is not always allocatored from
__bch_btree_node_alloc(), and possibly initialized as NULL pointer by
caller of btree_gc_coalesce(). Therefore the change at line 36 is not
correct.

This patch replaces the mistaken IS_ERR() by IS_ERR_OR_NULL() to avoid
potential issue.

Fixes: 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations")
Cc: <stable@vger.kernel.org> # 6.5+
Cc: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-9-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race
Mingzhe Zou [Mon, 20 Nov 2023 05:25:00 +0000 (13:25 +0800)]
bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race

We get a kernel crash about "unable to handle kernel paging request":

```dmesg
[368033.032005] BUG: unable to handle kernel paging request at ffffffffad9ae4b5
[368033.032007] PGD fc3a0d067 P4D fc3a0d067 PUD fc3a0e063 PMD 8000000fc38000e1
[368033.032012] Oops: 0003 [#1] SMP PTI
[368033.032015] CPU: 23 PID: 55090 Comm: bch_dirtcnt[0] Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-147.5.1.es8_24.x86_64 #1
[368033.032017] Hardware name: Tsinghua Tongfang THTF Chaoqiang Server/072T6D, BIOS 2.4.3 01/17/2017
[368033.032027] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1d0
[368033.032029] Code: 8b 02 48 85 c0 74 f6 48 89 c1 eb d0 c1 e9 12 83 e0
03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 c0 3d 02 00 48 03 04 cd 60 68 93
ad <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 02
[368033.032031] RSP: 0018:ffffbb48852abe00 EFLAGS: 00010082
[368033.032032] RAX: ffffffffad9ae4b5 RBX: 0000000000000246 RCX: 0000000000003bf3
[368033.032033] RDX: ffff97b0ff8e3dc0 RSI: 0000000000600000 RDI: ffffbb4884743c68
[368033.032034] RBP: 0000000000000001 R08: 0000000000000000 R09: 000007ffffffffff
[368033.032035] R10: ffffbb486bb01000 R11: 0000000000000001 R12: ffffffffc068da70
[368033.032036] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[368033.032038] FS:  0000000000000000(0000) GS:ffff97b0ff8c0000(0000) knlGS:0000000000000000
[368033.032039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[368033.032040] CR2: ffffffffad9ae4b5 CR3: 0000000fc3a0a002 CR4: 00000000003626e0
[368033.032042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[368033.032043] bcache: bch_cached_dev_attach() Caching rbd479 as bcache462 on set 8cff3c36-4a76-4242-afaa-7630206bc70b
[368033.032045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[368033.032046] Call Trace:
[368033.032054]  _raw_spin_lock_irqsave+0x32/0x40
[368033.032061]  __wake_up_common_lock+0x63/0xc0
[368033.032073]  ? bch_ptr_invalid+0x10/0x10 [bcache]
[368033.033502]  bch_dirty_init_thread+0x14c/0x160 [bcache]
[368033.033511]  ? read_dirty_submit+0x60/0x60 [bcache]
[368033.033516]  kthread+0x112/0x130
[368033.033520]  ? kthread_flush_work_fn+0x10/0x10
[368033.034505]  ret_from_fork+0x35/0x40
```

The crash occurred when call wake_up(&state->wait), and then we want
to look at the value in the state. However, bch_sectors_dirty_init()
is not found in the stack of any task. Since state is allocated on
the stack, we guess that bch_sectors_dirty_init() has exited, causing
bch_dirty_init_thread() to be unable to handle kernel paging request.

In order to verify this idea, we added some printing information during
wake_up(&state->wait). We find that "wake up" is printed twice, however
we only expect the last thread to wake up once.

```dmesg
[  994.641004] alcache: bch_dirty_init_thread() wake up
[  994.641018] alcache: bch_dirty_init_thread() wake up
[  994.641523] alcache: bch_sectors_dirty_init() init exit
```

There is a race. If bch_sectors_dirty_init() exits after the first wake
up, the second wake up will trigger this bug("unable to handle kernel
paging request").

Proceed as follows:

bch_sectors_dirty_init
    kthread_run ==============> bch_dirty_init_thread(bch_dirtcnt[0])
            ...                         ...
    atomic_inc(&state.started)          ...
            ...                         ...
    atomic_read(&state.enough)          ...
            ...                 atomic_set(&state->enough, 1)
    kthread_run ======================================================> bch_dirty_init_thread(bch_dirtcnt[1])
            ...                 atomic_dec_and_test(&state->started)            ...
    atomic_inc(&state.started)          ...                                     ...
            ...                 wake_up(&state->wait)                           ...
    atomic_read(&state.enough)                                          atomic_dec_and_test(&state->started)
            ...                                                                 ...
    wait_event(state.wait, atomic_read(&state.started) == 0)                    ...
    return                                                                      ...
                                                                        wake_up(&state->wait)

We believe it is very common to wake up twice if there is no dirty, but
crash is an extremely low probability event. It's hard for us to reproduce
this issue. We attached and detached continuously for a week, with a total
of more than one million attaches and only one crash.

Putting atomic_inc(&state.started) before kthread_run() can avoid waking
up twice.

Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-8-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: fixup lock c->root error
Mingzhe Zou [Mon, 20 Nov 2023 05:24:59 +0000 (13:24 +0800)]
bcache: fixup lock c->root error

We had a problem with io hung because it was waiting for c->root to
release the lock.

crash> cache_set.root -l cache_set.list ffffa03fde4c0050
  root = 0xffff802ef454c800
crash> btree -o 0xffff802ef454c800 | grep rw_semaphore
  [ffff802ef454c858] struct rw_semaphore lock;
crash> struct rw_semaphore ffff802ef454c858
struct rw_semaphore {
  count = {
    counter = -4294967297
  },
  wait_list = {
    next = 0xffff00006786fc28,
    prev = 0xffff00005d0efac8
  },
  wait_lock = {
    raw_lock = {
      {
        val = {
          counter = 0
        },
        {
          locked = 0 '\000',
          pending = 0 '\000'
        },
        {
          locked_pending = 0,
          tail = 0
        }
      }
    }
  },
  osq = {
    tail = {
      counter = 0
    }
  },
  owner = 0xffffa03fdc586603
}

The "counter = -4294967297" means that lock count is -1 and a write lock
is being attempted. Then, we found that there is a btree with a counter
of 1 in btree_cache_freeable.

crash> cache_set -l cache_set.list ffffa03fde4c0050 -o|grep btree_cache
  [ffffa03fde4c1140] struct list_head btree_cache;
  [ffffa03fde4c1150] struct list_head btree_cache_freeable;
  [ffffa03fde4c1160] struct list_head btree_cache_freed;
  [ffffa03fde4c1170] unsigned int btree_cache_used;
  [ffffa03fde4c1178] wait_queue_head_t btree_cache_wait;
  [ffffa03fde4c1190] struct task_struct *btree_cache_alloc_lock;
crash> list -H ffffa03fde4c1140|wc -l
973
crash> list -H ffffa03fde4c1150|wc -l
1123
crash> cache_set.btree_cache_used -l cache_set.list ffffa03fde4c0050
  btree_cache_used = 2097
crash> list -s btree -l btree.list -H ffffa03fde4c1140|grep -E -A2 "^  lock = {" > btree_cache.txt
crash> list -s btree -l btree.list -H ffffa03fde4c1150|grep -E -A2 "^  lock = {" > btree_cache_freeable.txt
[root@node-3 127.0.0.1-2023-08-04-16:40:28]# pwd
/var/crash/127.0.0.1-2023-08-04-16:40:28
[root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache.txt|grep counter|grep -v "counter = 0"
[root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache_freeable.txt|grep counter|grep -v "counter = 0"
      counter = 1

We found that this is a bug in bch_sectors_dirty_init() when locking c->root:
    (1). Thread X has locked c->root(A) write.
    (2). Thread Y failed to lock c->root(A), waiting for the lock(c->root A).
    (3). Thread X bch_btree_set_root() changes c->root from A to B.
    (4). Thread X releases the lock(c->root A).
    (5). Thread Y successfully locks c->root(A).
    (6). Thread Y releases the lock(c->root B).

        down_write locked ---(1)----------------------┐
                |                                     |
                |   down_read waiting ---(2)----┐     |
                |           |               ┌-------------┐ ┌-------------┐
        bch_btree_set_root ===(3)========>> | c->root   A | | c->root   B |
                |           |               └-------------┘ └-------------┘
            up_write ---(4)---------------------┘     |            |
                            |                         |            |
                    down_read locked ---(5)-----------┘            |
                            |                                      |
                        up_read ---(6)-----------------------------┘

Since c->root may change, the correct steps to lock c->root should be
the same as bch_root_usage(), compare after locking.

static unsigned int bch_root_usage(struct cache_set *c)
{
        unsigned int bytes = 0;
        struct bkey *k;
        struct btree *b;
        struct btree_iter iter;

        goto lock_root;

        do {
                rw_unlock(false, b);
lock_root:
                b = c->root;
                rw_lock(false, b, b->level);
        } while (b != c->root);

        for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad)
                bytes += bkey_bytes(k);

        rw_unlock(false, b);

        return (bytes * 100) / btree_bytes(c);
}

Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: fixup init dirty data errors
Mingzhe Zou [Mon, 20 Nov 2023 05:24:58 +0000 (13:24 +0800)]
bcache: fixup init dirty data errors

We found that after long run, the dirty_data of the bcache device
will have errors. This error cannot be eliminated unless re-register.

We also found that reattach after detach, this error can accumulate.

In bch_sectors_dirty_init(), all inode <= d->id keys will be recounted
again. This is wrong, we only need to count the keys of the current
device.

Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: prevent potential division by zero error
Rand Deeb [Mon, 20 Nov 2023 05:24:57 +0000 (13:24 +0800)]
bcache: prevent potential division by zero error

In SHOW(), the variable 'n' is of type 'size_t.' While there is a
conditional check to verify that 'n' is not equal to zero before
executing the 'do_div' macro, concerns arise regarding potential
division by zero error in 64-bit environments.

The concern arises when 'n' is 64 bits in size, greater than zero, and
the lower 32 bits of it are zeros. In such cases, the conditional check
passes because 'n' is non-zero, but the 'do_div' macro casts 'n' to
'uint32_t,' effectively truncating it to its lower 32 bits.
Consequently, the 'n' value becomes zero.

To fix this potential division by zero error and ensure precise
division handling, this commit replaces the 'do_div' macro with
div64_u64(). div64_u64() is designed to work with 64-bit operands,
guaranteeing that division is performed correctly.

This change enhances the robustness of the code, ensuring that division
operations yield accurate results in all scenarios, eliminating the
possibility of division by zero, and improving compatibility across
different 64-bit environments.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Rand Deeb <rand.sec96@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: remove redundant assignment to variable cur_idx
Colin Ian King [Mon, 20 Nov 2023 05:24:56 +0000 (13:24 +0800)]
bcache: remove redundant assignment to variable cur_idx

Variable cur_idx is being initialized with a value that is never read,
it is being re-assigned later in a while-loop. Remove the redundant
assignment. Cleans up clang scan build warning:

drivers/md/bcache/writeback.c:916:2: warning: Value stored to 'cur_idx'
is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: check return value from btree_node_alloc_replacement()
Coly Li [Mon, 20 Nov 2023 05:24:55 +0000 (13:24 +0800)]
bcache: check return value from btree_node_alloc_replacement()

In btree_gc_rewrite_node(), pointer 'n' is not checked after it returns
from btree_gc_rewrite_node(). There is potential possibility that 'n' is
a non NULL ERR_PTR(), referencing such error code is not permitted in
following code. Therefore a return value checking is necessary after 'n'
is back from btree_node_alloc_replacement().

Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20231120052503.6122-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agobcache: avoid oversize memory allocation by small stripe_size
Coly Li [Mon, 20 Nov 2023 05:24:54 +0000 (13:24 +0800)]
bcache: avoid oversize memory allocation by small stripe_size

Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are
used for dirty data writeback, their sizes are decided by backing device
capacity and stripe size. Larger backing device capacity or smaller
stripe size make these two arraies occupies more dynamic memory space.

Currently bcache->stripe_size is directly inherited from
queue->limits.io_opt of underlying storage device. For normal hard
drives, its limits.io_opt is 0, and bcache sets the corresponding
stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for
devices do declare value for queue->limits.io_opt, small stripe_size
(comparing to 1TB) becomes an issue for oversize memory allocations of
bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the
capacity of hard drives gets much larger in recent decade.

For example a raid5 array assembled by three 20TB hardrives, the raid
device capacity is 40TB with typical 512KB limits.io_opt. After the math
calculation in bcache code, these two arraies will occupy 400MB dynamic
memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is
declared on a new 2TB hard drive, then these two arraies request 2GB and
512MB dynamic memory from kzalloc(). The result is that bcache device
always fails to initialize on his system.

To avoid the oversize memory allocation, bcache->stripe_size should not
directly inherited by queue->limits.io_opt from the underlying device.
This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size
and set bcache device's stripe size against the declared limits.io_opt
value from the underlying storage device,
- If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will
  set its stripe size directly by this limits.io_opt value.
- If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will
  set its stripe size by a value multiplying limits.io_opt and euqal or
  large than BCH_MIN_STRIPE_SZ.

Then the minimal stripe size of a bcache device will always be >= 4MB.
For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by
bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB
in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied
by these two arraies will be 2.5MB in total.

Such mount of memory allocated for bcache->stripe_sectors_dirty and
bcache->full_dirty_stripes is reasonable for most of storage devices.

Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@lists.ewheeler.net>
Link: https://lore.kernel.org/r/20231120052503.6122-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agomd: fix bi_status reporting in md_end_clone_io
Song Liu [Fri, 17 Nov 2023 23:56:30 +0000 (15:56 -0800)]
md: fix bi_status reporting in md_end_clone_io

md_end_clone_io() may overwrite error status in orig_bio->bi_status with
BLK_STS_OK. This could happen when orig_bio has BIO_CHAIN (split by
md_submit_bio => bio_split_to_limits, for example). As a result, upper
layer may miss error reported from md (or the device) and consider the
failed IO was successful.

Fix this by only update orig_bio->bi_status when current bio reports
error and orig_bio is BLK_STS_OK. This is the same behavior as
__bio_chain_endio().

Fixes: 10764815ff47 ("md: add io accounting for raid0 and raid5")
Cc: stable@vger.kernel.org # v5.14+
Reported-by: Bhanu Victor DiCara <00bvd0+linux@gmail.com>
Closes: https://lore.kernel.org/regressions/5727380.DvuYhMxLoT@bvd0/
Signed-off-by: Song Liu <song@kernel.org>
Tested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
23 months agoblk-cgroup: bypass blkcg_deactivate_policy after destroying
Ming Lei [Fri, 17 Nov 2023 02:35:24 +0000 (10:35 +0800)]
blk-cgroup: bypass blkcg_deactivate_policy after destroying

blkcg_deactivate_policy() can be called after blkg_destroy_all()
returns, and it isn't necessary since blkg_destroy_all has covered
policy deactivation.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231117023527.3188627-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agoblk-cgroup: avoid to warn !rcu_read_lock_held() in blkg_lookup()
Ming Lei [Fri, 17 Nov 2023 02:35:23 +0000 (10:35 +0800)]
blk-cgroup: avoid to warn !rcu_read_lock_held() in blkg_lookup()

So far, all callers either holds spin lock or rcu read explicitly, and
most of the caller has added WARN_ON_ONCE(!rcu_read_lock_held()) or
lockdep_assert_held(&disk->queue->queue_lock).

Remove WARN_ON_ONCE(!rcu_read_lock_held()) from blkg_lookup() for
killing the false positive warning from blkg_conf_prep().

Reported-by: Changhui Zhong <czhong@redhat.com>
Fixes: 83462a6c971c ("blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231117023527.3188627-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agoblk-throttle: fix lockdep warning of "cgroup_mutex or RCU read lock required!"
Ming Lei [Fri, 17 Nov 2023 02:35:22 +0000 (10:35 +0800)]
blk-throttle: fix lockdep warning of "cgroup_mutex or RCU read lock required!"

Inside blkg_for_each_descendant_pre(), both
css_for_each_descendant_pre() and blkg_lookup() requires RCU read lock,
and either cgroup_assert_mutex_or_rcu_locked() or rcu_read_lock_held()
is called.

Fix the warning by adding rcu read lock.

Reported-by: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231117023527.3188627-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agoblk-mq: make sure active queue usage is held for bio_integrity_prep()
Christoph Hellwig [Mon, 13 Nov 2023 03:52:31 +0000 (11:52 +0800)]
blk-mq: make sure active queue usage is held for bio_integrity_prep()

blk_integrity_unregister() can come if queue usage counter isn't held
for one bio with integrity prepared, so this request may be completed with
calling profile->complete_fn, then kernel panic.

Another constraint is that bio_integrity_prep() needs to be called
before bio merge.

Fix the issue by:

- call bio_integrity_prep() with one queue usage counter grabbed reliably

- call bio_integrity_prep() before bio merge

Fixes: 900e080752025f00 ("block: move queue enter logic into blk_mq_submit_bio()")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/r/20231113035231.2708053-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
23 months agoLinux 6.7-rc1
Linus Torvalds [Mon, 13 Nov 2023 00:19:07 +0000 (16:19 -0800)]
Linux 6.7-rc1

23 months agowifi: iwlwifi: fix system commands group ordering
Miri Korenblit [Sun, 12 Nov 2023 14:36:20 +0000 (16:36 +0200)]
wifi: iwlwifi: fix system commands group ordering

The commands should be sorted inside the group definition.
Fix the ordering so we won't get following warning:
WARN_ON(iwl_cmd_groups_verify_sorted(trans_cfg))

Link: https://lore.kernel.org/regressions/2fa930bb-54dd-4942-a88d-05a47c8e9731@gmail.com/
Link: https://lore.kernel.org/linux-wireless/CAHk-=wix6kqQ5vHZXjOPpZBfM7mMm9bBZxi2Jh7XnaKCqVf94w@mail.gmail.com/
Fixes: b6e3d1ba4fcf ("wifi: iwlwifi: mvm: implement new firmware API for statistics")
Tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
Tested-by: Damian Tometzki <damian@riscv-rocks.de>
Acked-by: Kalle Valo <kvalo@kernel.org>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
23 months agoMerge tag 'parisc-for-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 12 Nov 2023 19:05:31 +0000 (11:05 -0800)]
Merge tag 'parisc-for-6.7-rc1-2' of git://git./linux/kernel/git/deller/parisc-linux

Pull parisc architecture fixes from Helge Deller:

 - Include the upper 5 address bits when inserting TLB entries on a
   64-bit kernel.

   On physical machines those are ignored, but in qemu it's nice to have
   them included and to be correct.

 - Stop the 64-bit kernel and show a warning if someone tries to boot on
   a machine with a 32-bit CPU

 - Fix a "no previous prototype" warning in parport-gsc

* tag 'parisc-for-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Prevent booting 64-bit kernels on PA1.x machines
  parport: gsc: mark init function static
  parisc/pgtable: Do not drop upper 5 address bits of physical address

23 months agoMerge tag 'loongarch-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai...
Linus Torvalds [Sun, 12 Nov 2023 18:58:08 +0000 (10:58 -0800)]
Merge tag 'loongarch-6.7' of git://git./linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch updates from Huacai Chen:

 - support PREEMPT_DYNAMIC with static keys

 - relax memory ordering for atomic operations

 - support BPF CPU v4 instructions for LoongArch

 - some build and runtime warning fixes

* tag 'loongarch-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  selftests/bpf: Enable cpu v4 tests for LoongArch
  LoongArch: BPF: Support signed mod instructions
  LoongArch: BPF: Support signed div instructions
  LoongArch: BPF: Support 32-bit offset jmp instructions
  LoongArch: BPF: Support unconditional bswap instructions
  LoongArch: BPF: Support sign-extension mov instructions
  LoongArch: BPF: Support sign-extension load instructions
  LoongArch: Add more instruction opcodes and emit_* helpers
  LoongArch/smp: Call rcutree_report_cpu_starting() earlier
  LoongArch: Relax memory ordering for atomic operations
  LoongArch: Mark __percpu functions as always inline
  LoongArch: Disable module from accessing external data directly
  LoongArch: Support PREEMPT_DYNAMIC with static keys

23 months agoMerge tag 'powerpc-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 12 Nov 2023 18:50:38 +0000 (10:50 -0800)]
Merge tag 'powerpc-6.7-2' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Finish a refactor of pgprot_framebuffer() which dependend
   on some changes that were merged via the drm tree

 - Fix some kernel-doc warnings to quieten the bots

Thanks to Nathan Lynch and Thomas Zimmermann.

* tag 'powerpc-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/rtas: Fix ppc_rtas_rmo_buf_show() kernel-doc
  powerpc/pseries/rtas-work-area: Fix rtas_work_area_reserve_arena() kernel-doc
  powerpc/fb: Call internal __phys_mem_access_prot() in fbdev code
  powerpc: Remove file parameter from phys_mem_access_prot()
  powerpc/machdep: Remove trailing whitespaces

23 months agoMerge tag '6.7-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 12 Nov 2023 01:17:22 +0000 (17:17 -0800)]
Merge tag '6.7-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - ctime caching fix (for setxattr)

 - encryption fix

 - DNS resolver mount fix

 - debugging improvements

 - multichannel fixes including cases where server stops or starts
   supporting multichannel after mount

 - reconnect fix

 - minor cleanups

* tag '6.7-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number for cifs.ko
  cifs: handle when server stops supporting multichannel
  cifs: handle when server starts supporting multichannel
  Missing field not being returned in ioctl CIFS_IOC_GET_MNT_INFO
  smb3: allow dumping session and tcon id to improve stats analysis and debugging
  smb: client: fix mount when dns_resolver key is not available
  smb3: fix caching of ctime on setxattr
  smb3: minor cleanup of session handling code
  cifs: reconnect work should have reference on server struct
  cifs: do not pass cifs_sb when trying to add channels
  cifs: account for primary channel in the interface list
  cifs: distribute channels across interfaces based on speed
  cifs: handle cases where a channel is closed
  smb3: more minor cleanups for session handling routines
  smb3: minor RDMA cleanup
  cifs: Fix encryption of cleared, but unset rq_iter data buffers

23 months agoMerge tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 11 Nov 2023 00:35:04 +0000 (16:35 -0800)]
Merge tag 'probes-fixes-v6.7-rc1' of git://git./linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - Documentation update: Add a note about argument and return value
   fetching is the best effort because it depends on the type.

 - objpool: Fix to make internal global variables static in
   test_objpool.c.

 - kprobes: Unify kprobes_exceptions_nofify() prototypes. There are the
   same prototypes in asm/kprobes.h for some architectures, but some of
   them are missing the prototype and it causes a warning. So move the
   prototype into linux/kprobes.h.

 - tracing: Fix to check the tracepoint event and return event at
   parsing stage. The tracepoint event doesn't support %return but if
   $retval exists, it will be converted to %return silently. This finds
   that case and rejects it.

 - tracing: Fix the order of the descriptions about the parameters of
   __kprobe_event_gen_cmd_start() to be consistent with the argument
   list of the function.

* tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobes: Fix the order of argument descriptions
  tracing: fprobe-event: Fix to check tracepoint event and return
  kprobes: unify kprobes_exceptions_nofify() prototypes
  lib: test_objpool: make global variables static
  Documentation: tracing: Add a note about argument and retval access

23 months agoMerge tag 'fbdev-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller...
Linus Torvalds [Fri, 10 Nov 2023 23:07:01 +0000 (15:07 -0800)]
Merge tag 'fbdev-for-6.7-rc1' of git://git./linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes and cleanups from Helge Deller:

 - fix double free and resource leaks in imsttfb

 - lots of remove callback cleanups and section mismatch fixes in
   omapfb, amifb and atmel_lcdfb

 - error code fix and memparse simplification in omapfb

* tag 'fbdev-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: (31 commits)
  fbdev: fsl-diu-fb: mark wr_reg_wa() static
  fbdev: amifb: Convert to platform remove callback returning void
  fbdev: amifb: Mark driver struct with __refdata to prevent section mismatch warning
  fbdev: hyperv_fb: fix uninitialized local variable use
  fbdev: omapfb/tpd12s015: Convert to platform remove callback returning void
  fbdev: omapfb/tfp410: Convert to platform remove callback returning void
  fbdev: omapfb/sharp-ls037v7dw01: Convert to platform remove callback returning void
  fbdev: omapfb/opa362: Convert to platform remove callback returning void
  fbdev: omapfb/hdmi: Convert to platform remove callback returning void
  fbdev: omapfb/dvi: Convert to platform remove callback returning void
  fbdev: omapfb/dsi-cm: Convert to platform remove callback returning void
  fbdev: omapfb/dpi: Convert to platform remove callback returning void
  fbdev: omapfb/analog-tv: Convert to platform remove callback returning void
  fbdev: atmel_lcdfb: Convert to platform remove callback returning void
  fbdev: omapfb/tpd12s015: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/tfp410: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/sharp-ls037v7dw01: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/opa362: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/hdmi: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/dvi: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  ...

23 months agotracing/kprobes: Fix the order of argument descriptions
Yujie Liu [Tue, 31 Oct 2023 04:13:05 +0000 (12:13 +0800)]
tracing/kprobes: Fix the order of argument descriptions

The order of descriptions should be consistent with the argument list of
the function, so "kretprobe" should be the second one.

int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe,
                                 const char *name, const char *loc, ...)

Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/
Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions")
Suggested-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
23 months agoMerge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Fri, 10 Nov 2023 22:59:30 +0000 (14:59 -0800)]
Merge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Daniel Vetter:
 "Dave's VPN to the big machine died, so it's on me to do fixes pr this
  and next week while everyone else is at plumbers.

   - big pile of amd fixes, but mostly for hw support newly added in 6.7

   - i915 fixes, mostly minor things

   - qxl memory leak fix

   - vc4 uaf fix in mock helpers

   - syncobj fix for DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE"

* tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm: (78 commits)
  drm/amdgpu: fix error handling in amdgpu_vm_init
  drm/amdgpu: Fix possible null pointer dereference
  drm/amdgpu: move UVD and VCE sched entity init after sched init
  drm/amdgpu: move kfd_resume before the ip late init
  drm/amd: Explicitly check for GFXOFF to be enabled for s0ix
  drm/amdgpu: Change WREG32_RLC to WREG32_SOC15_RLC where inst != 0 (v2)
  drm/amdgpu: Use correct KIQ MEC engine for gfx9.4.3 (v5)
  drm/amdgpu: add smu v13.0.6 pcs xgmi ras error query support
  drm/amdgpu: fix software pci_unplug on some chips
  drm/amd/display: remove duplicated argument
  drm/amdgpu: correct mca debugfs dump reg list
  drm/amdgpu: correct acclerator check architecutre dump
  drm/amdgpu: add pcs xgmi v6.4.0 ras support
  drm/amdgpu: Change extended-scope MTYPE on GC 9.4.3
  drm/amdgpu: disable smu v13.0.6 mca debug mode by default
  drm/amdgpu: Support multiple error query modes
  drm/amdgpu: refine smu v13.0.6 mca dump driver
  drm/amdgpu: Do not program PF-only regs in hdp_v4_0.c under SRIOV (v2)
  drm/amdgpu: Skip PCTL0_MMHUB_DEEPSLEEP_IB write in jpegv4.0.3 under SRIOV
  drm: amd: Resolve Sphinx unexpected indentation warning
  ...

23 months agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 10 Nov 2023 20:22:14 +0000 (12:22 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:
 "Mostly PMU fixes and a reworking of the pseudo-NMI disabling on broken
  MediaTek firmware:

   - Move the MediaTek GIC quirk handling from irqchip to core. Before
     the merging window commit 44bd78dd2b88 ("irqchip/gic-v3: Disable
     pseudo NMIs on MediaTek devices w/ firmware issues") temporarily
     addressed this issue. Fixed now at a deeper level in the arch code

   - Reject events meant for other PMUs in the CoreSight PMU driver,
     otherwise some of the core PMU events would disappear

   - Fix the Armv8 PMUv3 driver driver to not truncate 64-bit registers,
     causing some events to be invisible

   - Remove duplicate declaration of __arm64_sys##name following the
     patch to avoid prototype warning for syscalls

   - Typos in the elf_hwcap documentation"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64/syscall: Remove duplicate declaration
  Revert "arm64: smp: avoid NMI IPIs with broken MediaTek FW"
  arm64: Move MediaTek GIC quirk handling from irqchip to core
  arm64/arm: arm_pmuv3: perf: Don't truncate 64-bit registers
  perf: arm_cspmu: Reject events meant for other PMUs
  Documentation/arm64: Fix typos in elf_hwcaps

23 months agoMerge tag 'sound-fix-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 10 Nov 2023 19:57:51 +0000 (11:57 -0800)]
Merge tag 'sound-fix-6.7-rc1' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "A collection of fixes for rc1.

  The majority of changes are various ASoC driver-specific small fixes
  and usual HD-audio quirks, while there are a couple of core changes: a
  fix in ALSA core procfs code to avoid deadlocks at disconnection and
  an ASoC core fix for DAPM clock widgets"

* tag 'sound-fix-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  OSS: dmasound/paula: Convert to platform remove callback returning void
  ALSA: hda: ASUS UM5302LA: Added quirks for cs35L41/10431A83 on i2c bus
  ALSA: info: Fix potential deadlock at disconnection
  ASoC: nau8540: Add self recovery to improve capture quility
  ALSA: hda/realtek: Add support dual speaker for Dell
  ALSA: hda: Add ASRock X670E Taichi to denylist
  ALSA: hda/realtek: Add quirk for ASUS UX7602ZM
  ASoC: SOF: sof-client: trivial: fix comment typo
  ASoC: dapm: fix clock get name
  ASoC: hdmi-codec: register hpd callback on component probe
  ASoC: mediatek: mt8186_mt6366_rt1019_rt5682s: trivial: fix error messages
  ASoC: da7219: Improve system suspend and resume handling
  ASoC: codecs: Modify macro value error
  ASoC: codecs: Modify the wrong judgment of re value
  ASoC: codecs: Modify the maximum value of calib
  ASoC: amd: acp: fix for i2s mode register field update
  ASoC: codecs: aw88399: Fix -Wuninitialized in aw_dev_set_vcalb()
  ASoC: rt712-sdca: fix speaker route missing issue
  ASoC: rockchip: Fix unused rockchip_i2s_tdm_match warning for !CONFIG_OF
  ASoC: ti: omap-mcbsp: Fix runtime PM underflow warnings

23 months agoMerge tag 'amd-drm-next-6.7-2023-11-10' of https://gitlab.freedesktop.org/agd5f/linux...
Daniel Vetter [Fri, 10 Nov 2023 19:51:37 +0000 (20:51 +0100)]
Merge tag 'amd-drm-next-6.7-2023-11-10' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.7-2023-11-10:

amdgpu:
- SR-IOV fixes
- DMCUB fixes
- DCN3.5 fixes
- DP2 fixes
- SubVP fixes
- SMU14 fixes
- SDMA4.x fixes
- Suspend/resume fixes
- AGP regression fix
- UAF fixes for some error cases
- SMU 13.0.6 fixes
- Documentation fixes
- RAS fixes
- Hotplug fixes
- Scheduling entity ordering fix
- GPUVM fixes

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231110190703.4741-1-alexander.deucher@amd.com
23 months agoMerge tag 'spi-fix-v6.7-merge-window' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 10 Nov 2023 19:44:38 +0000 (11:44 -0800)]
Merge tag 'spi-fix-v6.7-merge-window' of git://git./linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A couple of fixes that came in during the merge window: one Kconfig
  dependency fix and another fix for a long standing issue where a sync
  transfer races with system suspend"

* tag 'spi-fix-v6.7-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: Fix null dereference on suspend
  spi: spi-zynq-qspi: add spi-mem to driver kconfig dependencies

23 months agoMerge tag 'mmc-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Linus Torvalds [Fri, 10 Nov 2023 19:40:38 +0000 (11:40 -0800)]
Merge tag 'mmc-v6.7-2' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
 "MMC core:
   - Fix broken cache-flush support for Micron eMMCs
   - Revert 'mmc: core: Capture correct oemid-bits for eMMC cards'

  MMC host:
   - sdhci_am654: Fix TAP value parsing for legacy speed mode
   - sdhci-pci-gli: Fix support for ASPM mode for GL9755/GL9750
   - vub300: Fix an error path in probe"

* tag 'mmc-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-pci-gli: GL9750: Mask the replay timer timeout of AER
  mmc: sdhci-pci-gli: GL9755: Mask the replay timer timeout of AER
  Revert "mmc: core: Capture correct oemid-bits for eMMC cards"
  mmc: vub300: fix an error code
  mmc: Add quirk MMC_QUIRK_BROKEN_CACHE_FLUSH for Micron eMMC Q2J54A
  mmc: sdhci_am654: fix start loop index for TAP value parsing

23 months agoMerge tag 'pwm/for-6.7-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 10 Nov 2023 19:34:16 +0000 (11:34 -0800)]
Merge tag 'pwm/for-6.7-rc1-fixes' of git://git./linux/kernel/git/thierry.reding/linux-pwm

Pull pwm fixes from Thierry Reding:
 "This contains two very small fixes that I failed to include in the
  main pull request"

* tag 'pwm/for-6.7-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
  pwm: Fix double shift bug
  pwm: samsung: Fix a bit test in pwm_samsung_resume()

23 months agoMerge tag 'io_uring-6.7-2023-11-10' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 10 Nov 2023 19:25:58 +0000 (11:25 -0800)]
Merge tag 'io_uring-6.7-2023-11-10' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:
 "Mostly just a few fixes and cleanups caused by the read multishot
  support.

  Outside of that, a stable fix for how a connect retry is done"

* tag 'io_uring-6.7-2023-11-10' of git://git.kernel.dk/linux:
  io_uring: do not clamp read length for multishot read
  io_uring: do not allow multishot read to set addr or len
  io_uring: indicate if io_kbuf_recycle did recycle anything
  io_uring/rw: add separate prep handler for fixed read/write
  io_uring/rw: add separate prep handler for readv/writev
  io_uring/net: ensure socket is marked connected on connect retry
  io_uring/rw: don't attempt to allocate async data if opcode doesn't need it

23 months agoMerge tag 'block-6.7-2023-11-10' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 10 Nov 2023 19:20:33 +0000 (11:20 -0800)]
Merge tag 'block-6.7-2023-11-10' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - nvme keyring config compile fixes (Hannes and Arnd)
      - fabrics keep alive fixes (Hannes)
      - tcp authentication fixes (Mark)
      - io_uring_cmd error handling fix (Anuj)
      - stale firmware attribute fix (Daniel)
      - tcp memory leak (Christophe)
      - crypto library usage simplification (Eric)

 - nbd use-after-free fix. May need a followup, but at least it's better
   than what it was before (Li)

 - Rate limit write on read-only device warnings (Yu)

* tag 'block-6.7-2023-11-10' of git://git.kernel.dk/linux:
  nvme: keyring: fix conditional compilation
  nvme: common: make keyring and auth separate modules
  blk-core: use pr_warn_ratelimited() in bio_check_ro()
  nbd: fix uaf in nbd_open
  nvme: start keep-alive after admin queue setup
  nvme-loop: always quiesce and cancel commands before destroying admin q
  nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()
  nvme-auth: always set valid seq_num in dhchap reply
  nvme-auth: add flag for bi-directional auth
  nvme-auth: auth success1 msg always includes resp
  nvme: fix error-handling for io_uring nvme-passthrough
  nvme: update firmware version after commit
  nvme-tcp: Fix a memory leak
  nvme-auth: use crypto_shash_tfm_digest()

23 months agoMerge tag 'ata-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal...
Linus Torvalds [Fri, 10 Nov 2023 19:15:34 +0000 (11:15 -0800)]
Merge tag 'ata-6.7-rc1-2' of git://git./linux/kernel/git/dlemoal/libata

Pull ata fixes from Damien Le Moal:

 - Revert a change in ata_pci_shutdown_one() to suspend disks on
   shutdown as this is now done using the manage_shutdown scsi device
   flag (me)

 - Change the pata_falcon and pata_gayle drivers to stop using
   module_platform_driver_probe(). This makes these drivers more inline
   with all other drivers (allowing bind/unbind) and suppress a
   compilation warning (Uwe)

 - Convert the pata_falcon and pata_gayle drivers to the new
   .remove_new() void-return callback. These 2 drivers are the last ones
   needing this change (Uwe)

* tag 'ata-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
  ata: pata_gayle: Convert to platform remove callback returning void
  ata: pata_falcon: Convert to platform remove callback returning void
  ata: pata_gayle: Stop using module_platform_driver_probe()
  ata: pata_falcon: Stop using module_platform_driver_probe()
  ata: libata-core: Fix ata_pci_shutdown_one()

23 months agoMerge tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping
Linus Torvalds [Fri, 10 Nov 2023 19:09:07 +0000 (11:09 -0800)]
Merge tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - don't leave pages decrypted for DMA in encrypted memory setups linger
   around on failure (Petr Tesarik)

 - fix an out of bounds access in the new dynamic swiotlb code (Petr
   Tesarik)

 - fix dma_addressing_limited for systems with weird physical memory
   layouts (Jia He)

* tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC
  dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all system RAM
  dma-mapping: move dma_addressing_limited() out of line
  swiotlb: do not free decrypted pages if dynamic

23 months agoMerge tag 'lsm-pr-20231109' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Linus Torvalds [Fri, 10 Nov 2023 18:58:49 +0000 (10:58 -0800)]
Merge tag 'lsm-pr-20231109' of git://git./linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:
 "We've got two small patches to correct the default return
  value of two LSM hooks: security_vm_enough_memory_mm() and
  security_inode_getsecctx()"

* tag 'lsm-pr-20231109' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lsm: fix default return value for inode_getsecctx
  lsm: fix default return value for vm_enough_memory

23 months agoMerge tag '6.7-rc-smb3-server-part2' of git://git.samba.org/ksmbd
Linus Torvalds [Fri, 10 Nov 2023 18:23:53 +0000 (10:23 -0800)]
Merge tag '6.7-rc-smb3-server-part2' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

 - slab out of bounds fix in ACL handling

 - fix malformed request oops

 - minor doc fix

* tag '6.7-rc-smb3-server-part2' of git://git.samba.org/ksmbd:
  ksmbd: handle malformed smb1 message
  ksmbd: fix kernel-doc comment of ksmbd_vfs_kern_path_locked()
  ksmbd: fix slab out of bounds write in smb_inherit_dacl()

23 months agoMerge tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client
Linus Torvalds [Fri, 10 Nov 2023 17:52:56 +0000 (09:52 -0800)]
Merge tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

 - support for idmapped mounts in CephFS (Christian Brauner, Alexander
   Mikhalitsyn).

   The series was originally developed by Christian and later picked up
   and brought over the finish line by Alexander, who also contributed
   an enabler on the MDS side (separate owner_{u,g}id fields on the
   wire).

   The required exports for mnt_idmap_{get,put}() in VFS have been acked
   by Christian and received no objection from Christoph.

 - a churny change in CephFS logging to include cluster and client
   identifiers in log and debug messages (Xiubo Li).

   This would help in scenarios with dozens of CephFS mounts on the same
   node which are getting increasingly common, especially in the
   Kubernetes world.

* tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client:
  ceph: allow idmapped mounts
  ceph: allow idmapped atomic_open inode op
  ceph: allow idmapped set_acl inode op
  ceph: allow idmapped setattr inode op
  ceph: pass idmap to __ceph_setattr
  ceph: allow idmapped permission inode op
  ceph: allow idmapped getattr inode op
  ceph: pass an idmapping to mknod/symlink/mkdir
  ceph: add enable_unsafe_idmap module parameter
  ceph: handle idmapped mounts in create_request_message()
  ceph: stash idmapping in mdsc request
  fs: export mnt_idmap_get/mnt_idmap_put
  libceph, ceph: move mdsmap.h to fs/ceph
  ceph: print cluster fsid and client global_id in all debug logs
  ceph: rename _to_client() to _to_fs_client()
  ceph: pass the mdsc to several helpers
  libceph: add doutc and *_client debug macros support

23 months agoMerge tag 'riscv-for-linus-6.7-mw2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 10 Nov 2023 17:23:17 +0000 (09:23 -0800)]
Merge tag 'riscv-for-linus-6.7-mw2' of git://git./linux/kernel/git/riscv/linux

Pull more RISC-V updates from Palmer Dabbelt:

 - Support for handling misaligned accesses in S-mode

 - Probing for misaligned access support is now properly cached and
   handled in parallel

 - PTDUMP now reflects the SW reserved bits, as well as the PBMT and
   NAPOT extensions

 - Performance improvements for TLB flushing

 - Support for many new relocations in the module loader

 - Various bug fixes and cleanups

* tag 'riscv-for-linus-6.7-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits)
  riscv: Optimize bitops with Zbb extension
  riscv: Rearrange hwcap.h and cpufeature.h
  drivers: perf: Do not broadcast to other cpus when starting a counter
  drivers: perf: Check find_first_bit() return value
  of: property: Add fw_devlink support for msi-parent
  RISC-V: Don't fail in riscv_of_parent_hartid() for disabled HARTs
  riscv: Fix set_memory_XX() and set_direct_map_XX() by splitting huge linear mappings
  riscv: Don't use PGD entries for the linear mapping
  RISC-V: Probe misaligned access speed in parallel
  RISC-V: Remove __init on unaligned_emulation_finish()
  RISC-V: Show accurate per-hart isa in /proc/cpuinfo
  RISC-V: Don't rely on positional structure initialization
  riscv: Add tests for riscv module loading
  riscv: Add remaining module relocations
  riscv: Avoid unaligned access when relocating modules
  riscv: split cache ops out of dma-noncoherent.c
  riscv: Improve flush_tlb_kernel_range()
  riscv: Make __flush_tlb_range() loop over pte instead of flushing the whole tlb
  riscv: Improve flush_tlb_range() for hugetlb pages
  riscv: Improve tlb_flush()
  ...

23 months agoMerge tag 'mips_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Linus Torvalds [Fri, 10 Nov 2023 17:19:46 +0000 (09:19 -0800)]
Merge tag 'mips_6.7' of git://git./linux/kernel/git/mips/linux

Pull MIPS updates from Thomas Bogendoerfer:

 - removed AR7 platform support

 - cleanups and fixes

* tag 'mips_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: AR7: remove platform
  watchdog: ar7_wdt: remove driver to prepare for platform removal
  vlynq: remove bus driver
  mtd: parsers: ar7: remove support
  serial: 8250: remove AR7 support
  arch: mips: remove ReiserFS from defconfig
  MIPS: lantiq: Remove unnecessary include of <linux/of_irq.h>
  MIPS: lantiq: Fix pcibios_plat_dev_init() "no previous prototype" warning
  MIPS: KVM: Fix a build warning about variable set but not used
  MIPS: Remove dead code in relocate_new_kernel
  mips: dts: ralink: mt7621: rename to GnuBee GB-PC1 and GnuBee GB-PC2
  mips: dts: ralink: mt7621: define each reset as an item
  mips: dts: ingenic: Remove unneeded probe-type properties
  MIPS: loongson32: Remove dma.h and nand.h

23 months agodrm/amdgpu: fix error handling in amdgpu_vm_init
Christian König [Tue, 31 Oct 2023 14:35:27 +0000 (15:35 +0100)]
drm/amdgpu: fix error handling in amdgpu_vm_init

When clearing the root PD fails we need to properly release it again.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
23 months agodrm/amdgpu: Fix possible null pointer dereference
Felix Kuehling [Tue, 31 Oct 2023 17:30:00 +0000 (13:30 -0400)]
drm/amdgpu: Fix possible null pointer dereference

mem = bo->tbo.resource may be NULL in amdgpu_vm_bo_update.

Fixes: 180253782038 ("drm/ttm: stop allocating dummy resources during BO creation")
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
23 months agodrm/amdgpu: move UVD and VCE sched entity init after sched init
Alex Deucher [Wed, 8 Nov 2023 14:40:44 +0000 (09:40 -0500)]
drm/amdgpu: move UVD and VCE sched entity init after sched init

We need kernel scheduling entities to deal with handle clean up
if apps are not cleaned up properly.  With commit 56e449603f0ac5
("drm/sched: Convert the GPU scheduler to variable number of run-queues")
the scheduler entities have to be created after scheduler init, so
change the ordering to fix this.

v2: Leave logic in UVD and VCE code

Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable number of run-queues")
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Luben Tuikov <ltuikov89@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: ltuikov89@gmail.com
23 months agodrm/amdgpu: move kfd_resume before the ip late init
Tim Huang [Thu, 19 Oct 2023 07:50:43 +0000 (15:50 +0800)]
drm/amdgpu: move kfd_resume before the ip late init

The kfd_resume needs to touch GC registers to enable the interrupts,
it needs to be done before GFXOFF is enabled to ensure that the GFX is
not off and GC registers can be touched. So move kfd_resume before the
amdgpu_device_ip_late_init which enables the CGPG/GFXOFF.

Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
23 months agodrm/amd: Explicitly check for GFXOFF to be enabled for s0ix
Mario Limonciello [Thu, 9 Nov 2023 17:23:46 +0000 (11:23 -0600)]
drm/amd: Explicitly check for GFXOFF to be enabled for s0ix

If a user has disabled GFXOFF this may cause problems for the suspend
sequence.  Ensure that it is enabled in amdgpu_acpi_is_s0ix_active().

The system won't reach the deepest state but it also won't hang.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
23 months agoMerge tag 'drm-misc-fixes-2023-11-08' of git://anongit.freedesktop.org/drm/drm-misc...
Daniel Vetter [Fri, 10 Nov 2023 15:54:41 +0000 (16:54 +0100)]
Merge tag 'drm-misc-fixes-2023-11-08' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-fixes for v6.7-rc1:

qxl:
- qxl memory leak fix.
syncobj:
- Fix waiting for DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE
vc4:
- Fix UAF in mock helpers

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
[sima: Stitch together both changelogs from Maarten. Also because of
branch history this contains a few more bugfixes which are already in
v6.6, but I didn't feel like this justifies some backmerge since there
wasn't any real conflict.]
Link: https://patchwork.freedesktop.org/patch/msgid/bc8598ee-d427-4616-8ebd-64107ab9a2d8@linux.intel.com
23 months agoMerge tag 'drm-intel-next-fixes-2023-11-08' of git://anongit.freedesktop.org/drm...
Daniel Vetter [Fri, 10 Nov 2023 15:43:44 +0000 (16:43 +0100)]
Merge tag 'drm-intel-next-fixes-2023-11-08' of git://anongit.freedesktop.org/drm/drm-intel into drm-next

drm/i915 fixes for v6.7-rc1:
- Fix null dereference when perf interface is not available
- Fix a -Wstringop-overflow warning
- Fix a -Wformat-truncation warning in intel_tc_port_init
- Flush WC GGTT only on required platforms
- Fix MTL HBR3 rate support on C10 phy and eDP
- Fix MTL notify_guc for multi-GT
- Bump GLK CDCLK frequency when driving multiple pipes
- Fix potential spectre vulnerability

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/878r78xrxd.fsf@intel.com
23 months agocifs: update internal module version number for cifs.ko
Steve French [Thu, 20 Jul 2023 13:30:32 +0000 (08:30 -0500)]
cifs: update internal module version number for cifs.ko

From 2.45 to 2.46

Signed-off-by: Steve French <stfrench@microsoft.com>
23 months agocifs: handle when server stops supporting multichannel
Shyam Prasad N [Fri, 13 Oct 2023 11:40:09 +0000 (11:40 +0000)]
cifs: handle when server stops supporting multichannel

When a server stops supporting multichannel, we will
keep attempting reconnects to the secondary channels today.
Avoid this by freeing extra channels when negotiate
returns no multichannel support.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
23 months agocifs: handle when server starts supporting multichannel
Shyam Prasad N [Fri, 13 Oct 2023 11:33:21 +0000 (11:33 +0000)]
cifs: handle when server starts supporting multichannel

When the user mounts with multichannel option, but the
server does not support it, there can be a time in future
where it can be supported.

With this change, such a case is handled.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
23 months agoMissing field not being returned in ioctl CIFS_IOC_GET_MNT_INFO
Steve French [Fri, 10 Nov 2023 07:24:16 +0000 (01:24 -0600)]
Missing field not being returned in ioctl CIFS_IOC_GET_MNT_INFO

The tcon_flags field was always being set to zero in the information
about the mount returned by the ioctl CIFS_IOC_GET_MNT_INFO instead
of being set to the value of the Flags field in the tree connection
structure as intended.

Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
23 months agoparisc: Prevent booting 64-bit kernels on PA1.x machines
Helge Deller [Fri, 10 Nov 2023 15:13:15 +0000 (16:13 +0100)]
parisc: Prevent booting 64-bit kernels on PA1.x machines

Bail out early with error message when trying to boot a 64-bit kernel on
32-bit machines. This fixes the previous commit to include the check for
true 64-bit kernels as well.

Signed-off-by: Helge Deller <deller@gmx.de>
Fixes: 591d2108f3abc ("parisc: Add runtime check to prevent PA2.0 kernels on PA1.x machines")
Cc: <stable@vger.kernel.org> # v6.0+