linux.git
11 months agoKVM: x86: Fully re-initialize supported_mce_cap on vendor module load
Sean Christopherson [Tue, 23 Apr 2024 16:53:27 +0000 (09:53 -0700)]
KVM: x86: Fully re-initialize supported_mce_cap on vendor module load

Effectively reset supported_mce_cap on vendor module load to ensure that
capabilities aren't unintentionally preserved across module reload, e.g.
if kvm-intel.ko added a module param to control LMCE support, or if
someone somehow managed to load a vendor module that doesn't support LMCE
after loading and unloading kvm-intel.ko.

Practically speaking, this bug is a non-issue as kvm-intel.ko doesn't have
a module param for LMCE, and there is no system in the world that supports
both kvm-intel.ko and kvm-amd.ko.

Fixes: c45dcc71b794 ("KVM: VMX: enable guest access to LMCE related MSRs")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240423165328.2853870-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 months agoKVM: x86: Fully re-initialize supported_vm_types on vendor module load
Sean Christopherson [Tue, 23 Apr 2024 16:53:26 +0000 (09:53 -0700)]
KVM: x86: Fully re-initialize supported_vm_types on vendor module load

Recompute the entire set of supported VM types when a vendor module is
loaded, as preserving supported_vm_types across vendor module unload and
reload can result in VM types being incorrectly treated as supported.

E.g. if a vendor module is loaded with TDP enabled, unloaded, and then
reloaded with TDP disabled, KVM_X86_SW_PROTECTED_VM will be incorrectly
retained.  Ditto for SEV_VM and SEV_ES_VM and their respective module
params in kvm-amd.ko.

Fixes: 2a955c4db1dd ("KVM: x86: Add supported_vm_types to kvm_caps")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240423165328.2853870-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 months agoMerge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEAD
Paolo Bonzini [Tue, 7 May 2024 17:03:03 +0000 (13:03 -0400)]
Merge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEAD

 KVM/riscv changes for 6.10

- Support guest breakpoints using ebreak
- Introduce per-VCPU mp_state_lock and reset_cntx_lock
- Virtualize SBI PMU snapshot and counter overflow interrupts
- New selftests for SBI PMU and Guest ebreak

11 months agoKVM: riscv: selftests: Add commandline option for SBI PMU test
Atish Patra [Sat, 20 Apr 2024 15:17:40 +0000 (08:17 -0700)]
KVM: riscv: selftests: Add commandline option for SBI PMU test

SBI PMU test comprises of multiple tests and user may want to run
only a subset depending on the platform. The most common case would
be to run all to validate all the tests. However, some platform may
not support all events or all ISA extensions.

The commandline option allows user to disable any set of tests if
they want to.

Suggested-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240420151741.962500-25-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoKVM: riscv: selftests: Add a test for counter overflow
Atish Patra [Sat, 20 Apr 2024 15:17:39 +0000 (08:17 -0700)]
KVM: riscv: selftests: Add a test for counter overflow

Add a test for verifying overflow interrupt. Currently, it relies on
overflow support on cycle/instret events. This test works for cycle/
instret events which support sampling via hpmcounters on the platform.
There are no ISA extensions to detect if a platform supports that. Thus,
this test will fail on platform with virtualization but doesn't
support overflow on these two events.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-24-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoKVM: riscv: selftests: Add a test for PMU snapshot functionality
Atish Patra [Sat, 20 Apr 2024 15:17:38 +0000 (08:17 -0700)]
KVM: riscv: selftests: Add a test for PMU snapshot functionality

Verify PMU snapshot functionality by setting up the shared memory
correctly and reading the counter values from the shared memory
instead of the CSR.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-23-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoKVM: riscv: selftests: Add SBI PMU selftest
Atish Patra [Sat, 20 Apr 2024 15:17:37 +0000 (08:17 -0700)]
KVM: riscv: selftests: Add SBI PMU selftest

This test implements basic sanity test and cycle/instret event
counting tests.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-22-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoKVM: riscv: selftests: Add SBI PMU extension definitions
Atish Patra [Sat, 20 Apr 2024 15:17:36 +0000 (08:17 -0700)]
KVM: riscv: selftests: Add SBI PMU extension definitions

The SBI PMU extension definition is required for upcoming SBI PMU
selftests.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-21-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoKVM: riscv: selftests: Add Sscofpmf to get-reg-list test
Atish Patra [Sat, 20 Apr 2024 15:17:35 +0000 (08:17 -0700)]
KVM: riscv: selftests: Add Sscofpmf to get-reg-list test

The KVM RISC-V allows Sscofpmf extension for Guest/VM so let us
add this extension to get-reg-list test.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-20-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoKVM: riscv: selftests: Add helper functions for extension checks
Atish Patra [Sat, 20 Apr 2024 15:17:34 +0000 (08:17 -0700)]
KVM: riscv: selftests: Add helper functions for extension checks

__vcpu_has_ext can check both SBI and ISA extensions when the first
argument is properly converted to SBI/ISA extension IDs. Introduce
two helper functions to make life easier for developers so they
don't have to worry about the conversions.

Replace the current usages as well with new helpers.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240420151741.962500-19-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoKVM: riscv: selftests: Move sbi definitions to its own header file
Atish Patra [Sat, 20 Apr 2024 15:17:33 +0000 (08:17 -0700)]
KVM: riscv: selftests: Move sbi definitions to its own header file

The SBI definitions will continue to grow. Move the sbi related
definitions to its own header file from processor.h

Suggested-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240420151741.962500-18-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: KVM: Improve firmware counter read function
Atish Patra [Sat, 20 Apr 2024 15:17:32 +0000 (08:17 -0700)]
RISC-V: KVM: Improve firmware counter read function

Rename the function to indicate that it is meant for firmware
counter read. While at it, add a range sanity check for it as
well.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240420151741.962500-17-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: KVM: Support 64 bit firmware counters on RV32
Atish Patra [Sat, 20 Apr 2024 15:17:31 +0000 (08:17 -0700)]
RISC-V: KVM: Support 64 bit firmware counters on RV32

The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware
counters for RV32 based systems.

Add infrastructure to support that.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-16-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: KVM: Add perf sampling support for guests
Atish Patra [Sat, 20 Apr 2024 15:17:30 +0000 (08:17 -0700)]
RISC-V: KVM: Add perf sampling support for guests

KVM enables perf for guest via counter virtualization. However, the
sampling can not be supported as there is no mechanism to enabled
trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
to provide the counter overflow data via the shared memory.

In case of sampling event, the host first sets the guest's LCOFI
interrupt and injects to the guest via irq filtering mechanism defined
in AIA specification. Thus, ssaia must be enabled in the host in order
to use perf sampling in the guest. No other AIA dependency w.r.t kernel
is required.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-15-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: KVM: Implement SBI PMU Snapshot feature
Atish Patra [Sat, 20 Apr 2024 15:17:29 +0000 (08:17 -0700)]
RISC-V: KVM: Implement SBI PMU Snapshot feature

PMU Snapshot function allows to minimize the number of traps when the
guest access configures/access the hpmcounters. If the snapshot feature
is enabled, the hypervisor updates the shared memory with counter
data and state of overflown counters. The guest can just read the
shared memory instead of trap & emulate done by the hypervisor.

This patch doesn't implement the counter overflow yet.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-14-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: KVM: No need to exit to the user space if perf event failed
Atish Patra [Sat, 20 Apr 2024 15:17:28 +0000 (08:17 -0700)]
RISC-V: KVM: No need to exit to the user space if perf event failed

Currently, we return a linux error code if creating a perf event failed
in kvm. That shouldn't be necessary as guest can continue to operate
without perf profiling or profiling with firmware counters.

Return appropriate SBI error code to indicate that PMU configuration
failed. An error message in kvm already describes the reason for failure.

Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling")
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-13-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: KVM: No need to update the counter value during reset
Atish Patra [Sat, 20 Apr 2024 15:17:27 +0000 (08:17 -0700)]
RISC-V: KVM: No need to update the counter value during reset

The virtual counter value is updated during pmu_ctr_read. There is no need
to update it in reset case. Otherwise, it will be counted twice which is
incorrect.

Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling")
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-12-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: KVM: Fix the initial sample period value
Atish Patra [Sat, 20 Apr 2024 15:17:26 +0000 (08:17 -0700)]
RISC-V: KVM: Fix the initial sample period value

The initial sample period value when counter value is not assigned
should be set to maximum value supported by the counter width.
Otherwise, it may result in spurious interrupts.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240420151741.962500-11-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agodrivers/perf: riscv: Implement SBI PMU snapshot function
Atish Patra [Sat, 20 Apr 2024 15:17:25 +0000 (08:17 -0700)]
drivers/perf: riscv: Implement SBI PMU snapshot function

SBI v2.0 SBI introduced PMU snapshot feature which adds the following
features.

1. Read counter values directly from the shared memory instead of
csr read.
2. Start multiple counters with initial values with one SBI call.

These functionalities optimizes the number of traps to the higher
privilege mode. If the kernel is in VS mode while the hypervisor
deploy trap & emulate method, this would minimize all the hpmcounter
CSR read traps. If the kernel is running in S-mode, the benefits
reduced to CSR latency vs DRAM/cache latency as there is no trap
involved while accessing the hpmcounter CSRs.

In both modes, it does saves the number of ecalls while starting
multiple counter together with an initial values. This is a likely
scenario if multiple counters overflow at the same time.

Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-10-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agodrivers/perf: riscv: Fix counter mask iteration for RV32
Atish Patra [Sat, 20 Apr 2024 15:17:24 +0000 (08:17 -0700)]
drivers/perf: riscv: Fix counter mask iteration for RV32

For RV32, used_hw_ctrs can have more than 1 word if the firmware chooses
to interleave firmware/hardware counters indicies. Even though it's a
unlikely scenario, handle that case by iterating over all the words
instead of just using the first word.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-9-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: Use the minor version mask while computing sbi version
Atish Patra [Sat, 20 Apr 2024 15:17:23 +0000 (08:17 -0700)]
RISC-V: Use the minor version mask while computing sbi version

As per the SBI specification, minor version is encoded in the
lower 24 bits only. Make sure that the SBI version is computed
with the appropriate mask.

Currently, there is no minor version in use. Thus, it doesn't
change anything functionality but it is good to be compliant with
the specification.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-8-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: KVM: Rename the SBI_STA_SHMEM_DISABLE to a generic name
Atish Patra [Sat, 20 Apr 2024 15:17:22 +0000 (08:17 -0700)]
RISC-V: KVM: Rename the SBI_STA_SHMEM_DISABLE to a generic name

SBI_STA_SHMEM_DISABLE is a macro to invoke disable shared memory
commands. As this can be invoked from other SBI extension context
as well, rename it to more generic name as SBI_SHMEM_DISABLE.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-7-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: Add SBI PMU snapshot definitions
Atish Patra [Sat, 20 Apr 2024 15:17:21 +0000 (08:17 -0700)]
RISC-V: Add SBI PMU snapshot definitions

SBI PMU Snapshot function optimizes the number of traps to
higher privilege mode by leveraging a shared memory between the S/VS-mode
and the M/HS mode. Add the definitions for that extension and new error
codes.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-6-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agodrivers/perf: riscv: Use BIT macro for shifting operations
Atish Patra [Sat, 20 Apr 2024 15:17:20 +0000 (08:17 -0700)]
drivers/perf: riscv: Use BIT macro for shifting operations

It is a good practice to use BIT() instead of (1 << x).
Replace the current usages with BIT().

Take this opportunity to replace few (1UL << x) with BIT() as well
for consistency.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-5-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agodrivers/perf: riscv: Read upper bits of a firmware counter
Atish Patra [Sat, 20 Apr 2024 15:17:19 +0000 (08:17 -0700)]
drivers/perf: riscv: Read upper bits of a firmware counter

SBI v2.0 introduced a explicit function to read the upper 32 bits
for any firmware counter width that is longer than 32bits.
This is only applicable for RV32 where firmware counter can be
64 bit.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-4-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: Add FIRMWARE_READ_HI definition
Atish Patra [Sat, 20 Apr 2024 15:17:18 +0000 (08:17 -0700)]
RISC-V: Add FIRMWARE_READ_HI definition

SBI v2.0 added another function to SBI PMU extension to read
the upper bits of a counter with width larger than XLEN.

Add the definition for that function.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Clément Léger <cleger@rivosinc.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-3-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISC-V: Fix the typo in Scountovf CSR name
Atish Patra [Sat, 20 Apr 2024 15:17:17 +0000 (08:17 -0700)]
RISC-V: Fix the typo in Scountovf CSR name

The counter overflow CSR name is "scountovf" not "sscountovf".

Fix the csr name.

Fixes: 4905ec2fb7e6 ("RISC-V: Add sscofpmf extension support")
Reviewed-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-2-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISCV: KVM: Introduce vcpu->reset_cntx_lock
Yong-Xuan Wang [Wed, 17 Apr 2024 07:45:26 +0000 (15:45 +0800)]
RISCV: KVM: Introduce vcpu->reset_cntx_lock

Originally, the use of kvm->lock in SBI_EXT_HSM_HART_START also avoids
the simultaneous updates to the reset context of target VCPU. Since this
lock has been replace with vcpu->mp_state_lock, and this new lock also
protects the vcpu->mp_state. We have to add a separate lock for
vcpu->reset_cntx.

Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240417074528.16506-3-yongxuan.wang@sifive.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoRISCV: KVM: Introduce mp_state_lock to avoid lock inversion
Yong-Xuan Wang [Wed, 17 Apr 2024 07:45:25 +0000 (15:45 +0800)]
RISCV: KVM: Introduce mp_state_lock to avoid lock inversion

Documentation/virt/kvm/locking.rst advises that kvm->lock should be
acquired outside vcpu->mutex and kvm->srcu. However, when KVM/RISC-V
handling SBI_EXT_HSM_HART_START, the lock ordering is vcpu->mutex,
kvm->srcu then kvm->lock.

Although the lockdep checking no longer complains about this after commit
f0f44752f5f6 ("rcu: Annotate SRCU's update-side lockdep dependencies"),
it's necessary to replace kvm->lock with a new dedicated lock to ensure
only one hart can execute the SBI_EXT_HSM_HART_START call for the target
hart simultaneously.

Additionally, this patch also rename "power_off" to "mp_state" with two
possible values. The vcpu->mp_state_lock also protects the access of
vcpu->mp_state.

Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240417074528.16506-2-yongxuan.wang@sifive.com
Signed-off-by: Anup Patel <anup@brainfault.org>
11 months agoMerge x86 bugfixes from Linux 6.9-rc3
Paolo Bonzini [Fri, 19 Apr 2024 12:57:14 +0000 (08:57 -0400)]
Merge x86 bugfixes from Linux 6.9-rc3

Pull fix for SEV-SNP late disable bugs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SEV: use u64_to_user_ptr throughout
Paolo Bonzini [Mon, 26 Feb 2024 18:42:55 +0000 (13:42 -0500)]
KVM: SEV: use u64_to_user_ptr throughout

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument
Sean Christopherson [Mon, 22 Jan 2024 23:54:00 +0000 (15:54 -0800)]
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument

TDX uses different ABI to get information about VM exit.  Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.

When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS.  The eventual code will be

VMX:
  - get exit reason, intr_info, exit_qualification, and etc from VMCS
  - call NMI/INTR handlers (common code)

TDX:
  - get exit reason, intr_info, exit_qualification, and etc from guest
    registers
  - call NMI/INTR handlers (common code)

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX
Paolo Bonzini [Mon, 18 Mar 2024 15:39:16 +0000 (11:39 -0400)]
KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX

KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM.  TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures.  This means we must have a TDX
version of kvm_x86_ops.

The existing global struct kvm_x86_ops already defines an interface which
can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
structure.  To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
TDX at run time.

To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX.  Use 'vt' for the naming scheme as a nod to
VT-x and as a concatenation of VmxTdx.

The eventually converted code will look like this:

vmx.c:
  vmx_op() { ... }
  VMX initialization
tdx.c:
  tdx_op() { ... }
  TDX initialization
x86_ops.h:
  vmx_op();
  tdx_op();
main.c:
  static vt_op() { if (tdx) tdx_op() else vmx_op() }
  static struct kvm_x86_ops vt_x86_ops = {
        .op = vt_op,
  initialization functions call both VMX and TDX initialization

Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: x86: Split core of hypercall emulation to helper function
Sean Christopherson [Mon, 22 Jan 2024 23:54:02 +0000 (15:54 -0800)]
KVM: x86: Split core of hypercall emulation to helper function

By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoMerge branch 'kvm-sev-init2' into HEAD
Paolo Bonzini [Thu, 4 Apr 2024 13:17:09 +0000 (09:17 -0400)]
Merge branch 'kvm-sev-init2' into HEAD

The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic.  The first source of variability
that was encountered is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.

This series adds all the APIs that are needed to customize the features,
with room for future enhancements:

- a new /dev/kvm device attribute to retrieve the set of supported
  features (right now, only debug swap)

- a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct,
  replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT

It then puts the new op to work by including the VMSA features as a field
of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility; but I am considering
also making them use zero as the feature mask, and will gladly adjust the
patches if so requested.

In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that
I could as well make SEV and SEV-ES use VM types.  This allows SEV-SNP
to reuse the KVM_SEV_INIT2 ioctl.

And while at it, KVM_SEV_INIT2 also includes two bugfixes.  First of all,
SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT,
reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor
once the VMSA has been encrypted...  which is how the API should have
always behaved.  Second, they also synchronize the FPU and AVX state.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoMerge branch 'mm-delete-change-gpte' into HEAD
Paolo Bonzini [Thu, 11 Apr 2024 17:06:24 +0000 (13:06 -0400)]
Merge branch 'mm-delete-change-gpte' into HEAD

The .change_pte() MMU notifier callback was intended as an optimization
and for this reason it was initially called without a surrounding
mmu_notifier_invalidate_range_{start,end}() pair.  It was only ever
implemented by KVM (which was also the original user of MMU notifiers)
and the rules on when to call set_pte_at_notify() rather than set_pte_at()
have always been pretty obscure.

It may seem a miracle that it has never caused any hard to trigger
bugs, but there's a good reason for that: KVM's implementation has
been nonfunctional for a good part of its existence.  Already in
2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
.change_pte() callback to occur within an invalidate_range_start/end()
pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
.change_pte() has no hope of finding a sPTE to change.

Therefore, all the code for .change_pte() can be removed from both KVM
and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agomm: replace set_pte_at_notify() with just set_pte_at()
Paolo Bonzini [Fri, 5 Apr 2024 11:58:15 +0000 (07:58 -0400)]
mm: replace set_pte_at_notify() with just set_pte_at()

With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify().  It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agommu_notifier: remove the .change_pte() callback
Paolo Bonzini [Fri, 5 Apr 2024 11:58:14 +0000 (07:58 -0400)]
mmu_notifier: remove the .change_pte() callback

The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20240405115815.3226315-4-pbonzini@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: remove unused argument of kvm_handle_hva_range()
Paolo Bonzini [Fri, 5 Apr 2024 11:58:13 +0000 (07:58 -0400)]
KVM: remove unused argument of kvm_handle_hva_range()

The only user was kvm_mmu_notifier_change_pte(), which is now gone.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20240405115815.3226315-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: delete .change_pte MMU notifier callback
Paolo Bonzini [Fri, 5 Apr 2024 11:58:12 +0000 (07:58 -0400)]
KVM: delete .change_pte MMU notifier callback

The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoselftests: kvm: add test for transferring FPU state into VMSA
Paolo Bonzini [Thu, 4 Apr 2024 12:13:27 +0000 (08:13 -0400)]
selftests: kvm: add test for transferring FPU state into VMSA

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-18-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoselftests: kvm: split "launch" phase of SEV VM creation
Paolo Bonzini [Thu, 4 Apr 2024 12:13:26 +0000 (08:13 -0400)]
selftests: kvm: split "launch" phase of SEV VM creation

Allow the caller to set the initial state of the VM.  Doing this
before sev_vm_launch() matters for SEV-ES, since that is the
place where the VMSA is updated and after which the guest state
becomes sealed.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-17-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoselftests: kvm: switch to using KVM_X86_*_VM
Paolo Bonzini [Thu, 4 Apr 2024 12:13:25 +0000 (08:13 -0400)]
selftests: kvm: switch to using KVM_X86_*_VM

This removes the concept of "subtypes", instead letting the tests use proper
VM types that were recently added.  While the sev_init_vm() and sev_es_init_vm()
are still able to operate with the legacy KVM_SEV_INIT and KVM_SEV_ES_INIT
ioctls, this is limited to VMs that are created manually with
vm_create_barebones().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-16-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoselftests: kvm: add tests for KVM_SEV_INIT2
Paolo Bonzini [Thu, 4 Apr 2024 12:13:24 +0000 (08:13 -0400)]
selftests: kvm: add tests for KVM_SEV_INIT2

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-15-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SEV: allow SEV-ES DebugSwap again
Paolo Bonzini [Thu, 4 Apr 2024 12:13:23 +0000 (08:13 -0400)]
KVM: SEV: allow SEV-ES DebugSwap again

The DebugSwap feature of SEV-ES provides a way for confidential guests
to use data breakpoints.  Its status is record in VMSA, and therefore
attestation signatures depend on whether it is enabled or not.  In order
to avoid invalidating the signatures depending on the host machine, it
was disabled by default (see commit 5abf6dceb066, "SEV: disable SEV-ES
DebugSwap by default", 2024-03-09).

However, we now have a new API to create SEV VMs that allows enabling
DebugSwap based on what the user tells KVM to do, and we also changed the
legacy KVM_SEV_ES_INIT API to never enable DebugSwap.  It is therefore
possible to re-enable the feature without breaking compatibility with
kernels that pre-date the introduction of DebugSwap, so go ahead.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-14-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SEV: introduce KVM_SEV_INIT2 operation
Paolo Bonzini [Thu, 4 Apr 2024 12:13:22 +0000 (08:13 -0400)]
KVM: SEV: introduce KVM_SEV_INIT2 operation

The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic.  In fact, in some sense it's
already a parameter whether SEV or SEV-ES is desired.  Another possible
source of variability is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.

Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct,
and put the new op to work by including the VMSA features as a field of the
struct.  The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility.

The struct also includes the usual bells and whistles for future
extensibility: a flags field that must be zero for now, and some padding
at the end.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time
Paolo Bonzini [Thu, 4 Apr 2024 12:13:21 +0000 (08:13 -0400)]
KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time

SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA.
Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark
FPU contents as confidential after it has been copied and encrypted into
the VMSA.

Since the XSAVE state for AVX is the first, it does not need the
compacted-state handling of get_xsave_addr().  However, there are other
parts of XSAVE state in the VMSA that currently are not handled, and
the validation logic of get_xsave_addr() is pointless to duplicate
in KVM, so move get_xsave_addr() to public FPU API; it is really just
a facility to operate on XSAVE state and does not expose any internal
details of arch/x86/kernel/fpu.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SEV: define VM types for SEV and SEV-ES
Paolo Bonzini [Thu, 4 Apr 2024 12:13:20 +0000 (08:13 -0400)]
KVM: SEV: define VM types for SEV and SEV-ES

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-11-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SEV: introduce to_kvm_sev_info
Paolo Bonzini [Thu, 4 Apr 2024 12:13:19 +0000 (08:13 -0400)]
KVM: SEV: introduce to_kvm_sev_info

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: x86: Add supported_vm_types to kvm_caps
Paolo Bonzini [Thu, 4 Apr 2024 12:13:18 +0000 (08:13 -0400)]
KVM: x86: Add supported_vm_types to kvm_caps

This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES),
and also allows the vendor module to specify which VM types are supported.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-9-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: x86: add fields to struct kvm_arch for CoCo features
Paolo Bonzini [Thu, 4 Apr 2024 12:13:17 +0000 (08:13 -0400)]
KVM: x86: add fields to struct kvm_arch for CoCo features

Some VM types have characteristics in common; in fact, the only use
of VM types right now is kvm_arch_has_private_mem and it assumes that
_all_ nonzero VM types have private memory.

We will soon introduce a VM type for SEV and SEV-ES VMs, and at that
point we will have two special characteristics of confidential VMs
that depend on the VM type: not just if memory is private, but
also whether guest state is protected.  For the latter we have
kvm->arch.guest_state_protected, which is only set on a fully initialized
VM.

For VM types with protected guest state, we can actually fix a problem in
the SEV-ES implementation, where ioctls to set registers do not cause an
error even if the VM has been initialized and the guest state encrypted.
Make sure that when using VM types that will become an error.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SEV: store VMSA features in kvm_sev_info
Paolo Bonzini [Thu, 4 Apr 2024 12:13:16 +0000 (08:13 -0400)]
KVM: SEV: store VMSA features in kvm_sev_info

Right now, the set of features that are stored in the VMSA upon
initialization is fixed and depends on the module parameters for
kvm-amd.ko.  However, the hypervisor cannot really change it at will
because the feature word has to match between the hypervisor and whatever
computes a measurement of the VMSA for attestation purposes.

Add a field to kvm_sev_info that holds the set of features to be stored
in the VMSA; and query it instead of referring to the module parameters.

Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this
does not yet introduce any functional change, but it paves the way for
an API that allows customization of the features per-VM.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-6-pbonzini@redhat.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SEV: publish supported VMSA features
Paolo Bonzini [Thu, 4 Apr 2024 12:13:15 +0000 (08:13 -0400)]
KVM: SEV: publish supported VMSA features

Compute the set of features to be stored in the VMSA when KVM is
initialized; move it from there into kvm_sev_info when SEV is initialized,
and then into the initial VMSA.

The new variable can then be used to return the set of supported features
to userspace, via the KVM_GET_DEVICE_ATTR ioctl.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: introduce new vendor op for KVM_GET_DEVICE_ATTR
Paolo Bonzini [Thu, 4 Apr 2024 12:13:14 +0000 (08:13 -0400)]
KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR

Allow vendor modules to provide their own attributes on /dev/kvm.
To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR
and KVM_GET_DEVICE_ATTR in terms of the same function.  You're not
supposed to use KVM_GET_DEVICE_ATTR to do complicated computations,
especially on /dev/kvm.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: x86: use u64_to_user_ptr()
Paolo Bonzini [Thu, 4 Apr 2024 12:13:13 +0000 (08:13 -0400)]
KVM: x86: use u64_to_user_ptr()

There is no danger to the kernel if 32-bit userspace provides a 64-bit
value that has the high bits set, but for whatever reason happens to
resolve to an address that has something mapped there.  KVM uses the
checked version of get_user() and put_user(), so any faults are caught
properly.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y
Paolo Bonzini [Thu, 4 Apr 2024 12:13:12 +0000 (08:13 -0400)]
KVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y

Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs
in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers
is quite confusing.

To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks
and the #VMGEXIT handler. Stubs are also restricted to functions that
check sev_enabled and to the destruction functions sev_free_cpu() and
sev_vm_destroy(), where the style of their callers is to leave checks
to the callers.  Most call sites instead rely on dead code elimination
to take care of functions that are guarded with sev_guest() or
sev_es_guest().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoKVM: SVM: Invert handling of SEV and SEV_ES feature flags
Sean Christopherson [Thu, 4 Apr 2024 12:13:11 +0000 (08:13 -0400)]
KVM: SVM: Invert handling of SEV and SEV_ES feature flags

Leave SEV and SEV_ES '0' in kvm_cpu_caps by default, and instead set them
in sev_set_cpu_caps() if SEV and SEV-ES support are fully enabled.  Aside
from the fact that sev_set_cpu_caps() is wildly misleading when it *clears*
capabilities, this will allow compiling out sev.c without falsely
advertising SEV/SEV-ES support in KVM_GET_SUPPORTED_CPUID.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 months agoRISC-V: KVM: selftests: Add ebreak test support
Chao Du [Tue, 2 Apr 2024 06:26:28 +0000 (06:26 +0000)]
RISC-V: KVM: selftests: Add ebreak test support

Initial support for RISC-V KVM ebreak test. Check the exit reason and
the PC when guest debug is enabled. Also to make sure the guest could
handle the ebreak exception without exiting to the VMM when guest debug
is not enabled.

Signed-off-by: Chao Du <duchao@eswincomputing.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20240402062628.5425-4-duchao@eswincomputing.com
Signed-off-by: Anup Patel <anup@brainfault.org>
12 months agoRISC-V: KVM: Handle breakpoint exits for VCPU
Chao Du [Tue, 2 Apr 2024 06:26:27 +0000 (06:26 +0000)]
RISC-V: KVM: Handle breakpoint exits for VCPU

Exit to userspace for breakpoint traps. Set the exit_reason as
KVM_EXIT_DEBUG before exit.

Signed-off-by: Chao Du <duchao@eswincomputing.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240402062628.5425-3-duchao@eswincomputing.com
Signed-off-by: Anup Patel <anup@brainfault.org>
12 months agoRISC-V: KVM: Implement kvm_arch_vcpu_ioctl_set_guest_debug()
Chao Du [Tue, 2 Apr 2024 06:26:26 +0000 (06:26 +0000)]
RISC-V: KVM: Implement kvm_arch_vcpu_ioctl_set_guest_debug()

kvm_vm_ioctl_check_extension(): Return 1 if KVM_CAP_SET_GUEST_DEBUG is
been checked.

kvm_arch_vcpu_ioctl_set_guest_debug(): Update the guest_debug flags
from userspace accordingly. Route the breakpoint exceptions to HS mode
if the VCPU is being debugged by userspace, by clearing the
corresponding bit in hedeleg.

Initialize the hedeleg configuration in kvm_riscv_vcpu_setup_config().
Write the actual CSR in kvm_arch_vcpu_load().

Signed-off-by: Chao Du <duchao@eswincomputing.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240402062628.5425-2-duchao@eswincomputing.com
Signed-off-by: Anup Patel <anup@brainfault.org>
12 months agoLinux 6.9-rc3
Linus Torvalds [Sun, 7 Apr 2024 20:22:46 +0000 (13:22 -0700)]
Linux 6.9-rc3

12 months agoMerge tag 'x86-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 7 Apr 2024 16:33:21 +0000 (09:33 -0700)]
Merge tag 'x86-urgent-2024-04-07' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Fix MCE timer reinit locking

 - Fix/improve CoCo guest random entropy pool init

 - Fix SEV-SNP late disable bugs

 - Fix false positive objtool build warning

 - Fix header dependency bug

 - Fix resctrl CPU offlining bug

* tag 'x86-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk
  x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()
  x86/CPU/AMD: Track SNP host status with cc_platform_*()
  x86/cc: Add cc_platform_set/_clear() helpers
  x86/kvm/Kconfig: Have KVM_AMD_SEV select ARCH_HAS_CC_PLATFORM
  x86/coco: Require seeding RNG with RDRAND on CoCo systems
  x86/numa/32: Include missing <asm/pgtable_areas.h>
  x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offline

12 months agoMerge tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 7 Apr 2024 16:20:50 +0000 (09:20 -0700)]
Merge tag 'timers-urgent-2024-04-07' of git://git./linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Fix various timer bugs:

   - Fix a timer migration bug that may result in missed events

   - Fix timer migration group hierarchy event updates

   - Fix a PowerPC64 build warning

   - Fix a handful of DocBook annotation bugs"

* tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Return early on deactivation
  timers/migration: Fix ignored event due to missing CPU update
  vdso: Use CONFIG_PAGE_SHIFT in vdso/datapage.h
  timers: Fix text inconsistencies and spelling
  tick/sched: Fix struct tick_sched doc warnings
  tick/sched: Fix various kernel-doc warnings
  timers: Fix kernel-doc format and add Return values
  time/timekeeping: Fix kernel-doc warnings and typos
  time/timecounter: Fix inline documentation

12 months agoMerge tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 7 Apr 2024 16:14:46 +0000 (09:14 -0700)]
Merge tag 'perf-urgent-2024-04-07' of git://git./linux/kernel/git/tip/tip

Pull x86 perf fix from Ingo Molnar:
 "Fix a combined PEBS events bug on x86 Intel CPUs"

* tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event

12 months agoMerge tag 'nfsd-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Sat, 6 Apr 2024 16:37:50 +0000 (09:37 -0700)]
Merge tag 'nfsd-6.9-2' of git://git./linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Address a slow memory leak with RPC-over-TCP

 - Prevent another NFS4ERR_DELAY loop during CREATE_SESSION

* tag 'nfsd-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: hold a lighter-weight client reference over CB_RECALL_ANY
  SUNRPC: Fix a slow server-side memory leak with RPC-over-TCP

12 months agoMerge tag 'i2c-for-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 6 Apr 2024 16:27:36 +0000 (09:27 -0700)]
Merge tag 'i2c-for-6.9-rc3' of git://git./linux/kernel/git/wsa/linux

Pull i2c fix from Wolfram Sang:
 "A host driver build fix"

* tag 'i2c-for-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: pxa: hide unused icr_bits[] variable

12 months agoMerge tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Sat, 6 Apr 2024 16:14:18 +0000 (09:14 -0700)]
Merge tag 'xfs-6.9-fixes-2' of git://git./fs/xfs/xfs-linux

Pull xfs fix from Chandan Babu:

 - Allow creating new links to special files which were not associated
   with a project quota

* tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: allow cross-linking special files without project quota

12 months agoMerge tag '6.9-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 6 Apr 2024 16:06:17 +0000 (09:06 -0700)]
Merge tag '6.9-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - fix to retry close to avoid potential handle leaks when server
   returns EBUSY

 - DFS fixes including a fix for potential use after free

 - fscache fix

 - minor strncpy cleanup

 - reconnect race fix

 - deal with various possible UAF race conditions tearing sessions down

* tag '6.9-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: fix potential UAF in cifs_signal_cifsd_for_reconnect()
  smb: client: fix potential UAF in smb2_is_network_name_deleted()
  smb: client: fix potential UAF in is_valid_oplock_break()
  smb: client: fix potential UAF in smb2_is_valid_oplock_break()
  smb: client: fix potential UAF in smb2_is_valid_lease_break()
  smb: client: fix potential UAF in cifs_stats_proc_show()
  smb: client: fix potential UAF in cifs_stats_proc_write()
  smb: client: fix potential UAF in cifs_dump_full_key()
  smb: client: fix potential UAF in cifs_debug_files_proc_show()
  smb3: retrying on failed server close
  smb: client: serialise cifs_construct_tcon() with cifs_mount_mutex
  smb: client: handle DFS tcons in cifs_construct_tcon()
  smb: client: refresh referral without acquiring refpath_lock
  smb: client: guarantee refcounted children from parent session
  cifs: Fix caching to try to do open O_WRONLY as rdwr on server
  smb: client: fix UAF in smb2_reconnect_server()
  smb: client: replace deprecated strncpy with strscpy

12 months agox86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk
Borislav Petkov (AMD) [Fri, 5 Apr 2024 14:46:37 +0000 (16:46 +0200)]
x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk

srso_alias_untrain_ret() is special code, even if it is a dummy
which is called in the !SRSO case, so annotate it like its real
counterpart, to address the following objtool splat:

  vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0

Fixes: 4535e1a4174c ("x86/bugs: Fix the SRSO mitigation on Zen3/4")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240405144637.17908-1-bp@kernel.org
12 months agoMerge branch 'linus' into x86/urgent, to pick up dependent commit
Ingo Molnar [Sat, 6 Apr 2024 11:00:32 +0000 (13:00 +0200)]
Merge branch 'linus' into x86/urgent, to pick up dependent commit

We want to fix:

  0e110732473e ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO")

So merge in Linus's latest into x86/urgent to have it available.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
12 months agoMerge tag 'i2c-host-fixes-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Wolfram Sang [Sat, 6 Apr 2024 09:29:15 +0000 (11:29 +0200)]
Merge tag 'i2c-host-fixes-6.9-rc3' of git://git./linux/kernel/git/andi.shyti/linux into i2c/for-current

An unused const variable kind of error has been fixed by placing
the definition of icr_bits[] inside the ifdef block where it is
used.

12 months agoMerge tag 'firewire-fixes-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 6 Apr 2024 04:25:31 +0000 (21:25 -0700)]
Merge tag 'firewire-fixes-6.9-rc2' of git://git./linux/kernel/git/ieee1394/linux1394

Pull firewire fixes from Takashi Sakamoto:
 "The firewire-ohci kernel module has a parameter for verbose kernel
  logging. It is well-known that it logs the spurious IRQ for bus-reset
  event due to the unmasked register for IRQ event. This update fixes
  the issue"

* tag 'firewire-fixes-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: ohci: mask bus reset interrupts between ISR and bottom half

12 months agofirewire: ohci: mask bus reset interrupts between ISR and bottom half
Adam Goldman [Sun, 24 Mar 2024 22:38:41 +0000 (07:38 +0900)]
firewire: ohci: mask bus reset interrupts between ISR and bottom half

In the FireWire OHCI interrupt handler, if a bus reset interrupt has
occurred, mask bus reset interrupts until bus_reset_work has serviced and
cleared the interrupt.

Normally, we always leave bus reset interrupts masked. We infer the bus
reset from the self-ID interrupt that happens shortly thereafter. A
scenario where we unmask bus reset interrupts was introduced in 2008 in
a007bb857e0b26f5d8b73c2ff90782d9c0972620: If
OHCI_PARAM_DEBUG_BUSRESETS (8) is set in the debug parameter bitmask, we
will unmask bus reset interrupts so we can log them.

irq_handler logs the bus reset interrupt. However, we can't clear the bus
reset event flag in irq_handler, because we won't service the event until
later. irq_handler exits with the event flag still set. If the
corresponding interrupt is still unmasked, the first bus reset will
usually freeze the system due to irq_handler being called again each
time it exits. This freeze can be reproduced by loading firewire_ohci
with "modprobe firewire_ohci debug=-1" (to enable all debugging output).
Apparently there are also some cases where bus_reset_work will get called
soon enough to clear the event, and operation will continue normally.

This freeze was first reported a few months after a007bb85 was committed,
but until now it was never fixed. The debug level could safely be set
to -1 through sysfs after the module was loaded, but this would be
ineffectual in logging bus reset interrupts since they were only
unmasked during initialization.

irq_handler will now leave the event flag set but mask bus reset
interrupts, so irq_handler won't be called again and there will be no
freeze. If OHCI_PARAM_DEBUG_BUSRESETS is enabled, bus_reset_work will
unmask the interrupt after servicing the event, so future interrupts
will be caught as desired.

As a side effect to this change, OHCI_PARAM_DEBUG_BUSRESETS can now be
enabled through sysfs in addition to during initial module loading.
However, when enabled through sysfs, logging of bus reset interrupts will
be effective only starting with the second bus reset, after
bus_reset_work has executed.

Signed-off-by: Adam Goldman <adamg@pobox.com>
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
12 months agoMerge tag 'spi-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Linus Torvalds [Sat, 6 Apr 2024 00:26:43 +0000 (17:26 -0700)]
Merge tag 'spi-fix-v6.9-rc2' of git://git./linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A few small driver specific fixes, the most important being the
  s3c64xx change which is likely to be hit during normal operation"

* tag 'spi-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: mchp-pci1xxx: Fix a possible null pointer dereference in pci1xxx_spi_probe
  spi: spi-fsl-lpspi: remove redundant spi_controller_put call
  spi: s3c64xx: Use DMA mode from fifo size

12 months agoMerge tag 'regulator-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 6 Apr 2024 00:24:04 +0000 (17:24 -0700)]
Merge tag 'regulator-fix-v6.9-rc2' of git://git./linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
 "One simple regualtor fix, fixing module autoloading on tps65132"

* tag 'regulator-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: tps65132: Add of_match table

12 months agoMerge tag 'regmap-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 6 Apr 2024 00:21:16 +0000 (17:21 -0700)]
Merge tag 'regmap-fix-v6.9-rc2' of git://git./linux/kernel/git/broonie/regmap

Pull regmap fixes from Mark Brown:
 "Richard found a nasty corner case in the maple tree code which he
  fixed, and also fixed a compiler warning which was showing up with the
  toolchain he uses and helpfully identified a possible incorrect error
  code which could have runtime impacts"

* tag 'regmap-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: maple: Fix uninitialized symbol 'ret' warnings
  regmap: maple: Fix cache corruption in regcache_maple_drop()

12 months agoMerge tag 'block-6.9-20240405' of git://git.kernel.dk/linux
Linus Torvalds [Sat, 6 Apr 2024 00:04:11 +0000 (17:04 -0700)]
Merge tag 'block-6.9-20240405' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - Atomic queue limits fixes (Christoph)
      - Fabrics fixes (Hannes, Daniel)

 - Discard overflow fix (Li)

 - Cleanup fix for null_blk (Damien)

* tag 'block-6.9-20240405' of git://git.kernel.dk/linux:
  nvme-fc: rename free_ctrl callback to match name pattern
  nvmet-fc: move RCU read lock to nvmet_fc_assoc_exists
  nvmet: implement unique discovery NQN
  nvme: don't create a multipath node for zero capacity devices
  nvme: split nvme_update_zone_info
  nvme-multipath: don't inherit LBA-related fields for the multipath node
  block: fix overflow in blk_ioctl_discard()
  nullblk: Fix cleanup order in null_add_dev() error path

12 months agoMerge tag 'io_uring-6.9-20240405' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 5 Apr 2024 23:58:52 +0000 (16:58 -0700)]
Merge tag 'io_uring-6.9-20240405' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Backport of some fixes that came up during development of the 6.10
   io_uring patches. This includes some kbuf cleanups and reference
   fixes.

 - Disable multishot read if we don't have NOWAIT support on the target

 - Fix for a dependency issue with workqueue flushing

* tag 'io_uring-6.9-20240405' of git://git.kernel.dk/linux:
  io_uring/kbuf: hold io_buffer_list reference over mmap
  io_uring/kbuf: protect io_buffer_list teardown with a reference
  io_uring/kbuf: get rid of bl->is_ready
  io_uring/kbuf: get rid of lower BGID lists
  io_uring: use private workqueue for exit work
  io_uring: disable io-wq execution of multishot NOWAIT requests
  io_uring/rw: don't allow multishot reads without NOWAIT support

12 months agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Fri, 5 Apr 2024 23:54:54 +0000 (16:54 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "The most important is the libsas fix, which is a problem for DMA to a
  kmalloc'd structure too small causing cache line interference. The
  other fixes (all in drivers) are mostly for allocation length fixes,
  error leg unwinding, suspend races and a missing retry"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: core: Fix MCQ mode dev command timeout
  scsi: libsas: Align SMP request allocation to ARCH_DMA_MINALIGN
  scsi: sd: Unregister device if device_add_disk() failed in sd_probe()
  scsi: ufs: core: WLUN suspend dev/link state error recovery
  scsi: mylex: Fix sysfs buffer lengths

12 months agoMerge tag 'devicetree-fixes-for-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 5 Apr 2024 21:07:22 +0000 (14:07 -0700)]
Merge tag 'devicetree-fixes-for-6.9-1' of git://git./linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:

 - Fix NIOS2 boot with external DTB

 - Add missing synchronization needed between fw_devlink and DT overlay
   removals

 - Fix some unit-address regex's to be hex only

 - Drop some 10+ year old "unstable binding" statements

 - Add new SoCs to QCom UFS binding

 - Add TPM bindings to TPM maintainers

* tag 'devicetree-fixes-for-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  nios2: Only use built-in devicetree blob if configured to do so
  dt-bindings: timer: narrow regex for unit address to hex numbers
  dt-bindings: soc: fsl: narrow regex for unit address to hex numbers
  dt-bindings: remoteproc: ti,davinci: remove unstable remark
  dt-bindings: clock: ti: remove unstable remark
  dt-bindings: clock: keystone: remove unstable remark
  of: module: prevent NULL pointer dereference in vsnprintf()
  dt-bindings: ufs: qcom: document SM6125 UFS
  dt-bindings: ufs: qcom: document SC7180 UFS
  dt-bindings: ufs: qcom: document SC8180X UFS
  of: dynamic: Synchronize of_changeset_destroy() with the devlink removals
  driver core: Introduce device_link_wait_removal()
  docs: dt-bindings: add missing address/size-cells to example
  MAINTAINERS: Add TPM DT bindings to TPM maintainers

12 months agoMerge tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Fri, 5 Apr 2024 20:30:01 +0000 (13:30 -0700)]
Merge tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git./linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "8 hotfixes, 3 are cc:stable

  There are a couple of fixups for this cycle's vmalloc changes and one
  for the stackdepot changes. And a fix for a very old x86 PAT issue
  which can cause a warning splat"

* tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  stackdepot: rename pool_index to pool_index_plus_1
  x86/mm/pat: fix VM_PAT handling in COW mappings
  MAINTAINERS: change vmware.com addresses to broadcom.com
  selftests/mm: include strings.h for ffsl
  mm: vmalloc: fix lockdep warning
  mm: vmalloc: bail out early in find_vmap_area() if vmap is not init
  init: open output files from cpio unpacking with O_LARGEFILE
  mm/secretmem: fix GUP-fast succeeding on secretmem folios

12 months agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 5 Apr 2024 20:12:35 +0000 (13:12 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
 "arm64/ptrace fix to use the correct SVE layout based on the saved
  floating point state rather than the TIF_SVE flag. The latter may be
  left on during syscalls even if the SVE state is discarded"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64/ptrace: Use saved floating point state type to determine SVE layout

12 months agoMerge tag 'riscv-for-linus-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 5 Apr 2024 20:09:48 +0000 (13:09 -0700)]
Merge tag 'riscv-for-linus-6.9-rc3' of git://git./linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - A fix for an __{get,put}_kernel_nofault to avoid an uninitialized
   value causing spurious failures

 - compat_vdso.so.dbg is now installed to the standard install location

 - A fix to avoid initializing PERF_SAMPLE_BRANCH_*-related events, as
   they aren't supported and will just later fail

 - A fix to make AT_VECTOR_SIZE_ARCH correct now that we're providing
   AT_MINSIGSTKSZ

 - pgprot_nx() is now implemented, which fixes vmap W^X protection

 - A fix for the vector save/restore code, which at least manifests as
   corrupted vector state when a signal is taken

 - A fix for a race condition in instruction patching

 - A fix to avoid leaking the kernel-mode GP to userspace, which is a
   kernel pointer leak that can be used to defeat KASLR in various ways

 - A handful of smaller fixes to build warnings, an overzealous printk,
   and some missing tracing annotations

* tag 'riscv-for-linus-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: process: Fix kernel gp leakage
  riscv: Disable preemption when using patch_map()
  riscv: Fix warning by declaring arch_cpu_idle() as noinstr
  riscv: use KERN_INFO in do_trap
  riscv: Fix vector state restore in rt_sigreturn()
  riscv: mm: implement pgprot_nx
  riscv: compat_vdso: align VDSOAS build log
  RISC-V: Update AT_VECTOR_SIZE_ARCH for new AT_MINSIGSTKSZ
  riscv: Mark __se_sys_* functions __used
  drivers/perf: riscv: Disable PERF_SAMPLE_BRANCH_* while not supported
  riscv: compat_vdso: install compat_vdso.so.dbg to /lib/modules/*/vdso/
  riscv: hwprobe: do not produce frtace relocation
  riscv: Fix spurious errors from __get/put_kernel_nofault
  riscv: mm: Fix prototype to avoid discarding const

12 months agoMerge tag 's390-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Fri, 5 Apr 2024 20:07:25 +0000 (13:07 -0700)]
Merge tag 's390-6.9-3' of git://git./linux/kernel/git/s390/linux

Pull s390 fixes from Alexander Gordeev:

 - Fix missing NULL pointer check when determining guest/host fault

 - Mark all functions in asm/atomic_ops.h, asm/atomic.h and
   asm/preempt.h as __always_inline to avoid unwanted instrumentation

 - Fix removal of a Processor Activity Instrumentation (PAI) sampling
   event in PMU device driver

 - Align system call table on 8 bytes

* tag 's390-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/entry: align system call table on 8 bytes
  s390/pai: fix sampling event removal for PMU device driver
  s390/preempt: mark all functions __always_inline
  s390/atomic: mark all functions __always_inline
  s390/mm: fix NULL pointer dereference

12 months agoMerge tag 'pm-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Fri, 5 Apr 2024 19:55:40 +0000 (12:55 -0700)]
Merge tag 'pm-6.9-rc3' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Fix a recent Energy Model change that went against a recent scheduler
  change made independently (Vincent Guittot)"

* tag 'pm-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: EM: fix wrong utilization estimation in em_cpu_energy()

12 months agoMerge tag 'thermal-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Fri, 5 Apr 2024 19:51:32 +0000 (12:51 -0700)]
Merge tag 'thermal-6.9-rc3' of git://git./linux/kernel/git/rafael/linux-pm

Pull thermal control fixes from Rafael Wysocki:
 "These fix two power allocator thermal governor issues and an ACPI
  thermal driver regression that all were introduced during the 6.8
  development cycle.

  Specifics:

   - Allow the power allocator thermal governor to bind to a thermal
     zone without cooling devices and/or without trip points (Nikita
     Travkin)

   - Make the ACPI thermal driver register a tripless thermal zone when
     it cannot find any usable trip points instead of returning an error
     from acpi_thermal_add() (Stephen Horvath)"

* tag 'thermal-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: gov_power_allocator: Allow binding without trip points
  thermal: gov_power_allocator: Allow binding without cooling devices
  ACPI: thermal: Register thermal zones without valid trip points

12 months agoMerge tag 'gpio-fixes-for-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 5 Apr 2024 19:12:19 +0000 (12:12 -0700)]
Merge tag 'gpio-fixes-for-v6.9-rc3' of git://git./linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

 - make sure GPIO devices are registered with the subsystem before
   trying to return them to a caller of gpio_device_find()

 - fix two issues with incorrect sanitization of the interrupt labels

* tag 'gpio-fixes-for-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpio: cdev: fix missed label sanitizing in debounce_setup()
  gpio: cdev: check for NULL labels when sanitizing them for irqs
  gpiolib: Fix triggering "kobject: 'gpiochipX' is not initialized, yet" kobject_get() errors

12 months agoMerge tag 'ata-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Linus Torvalds [Fri, 5 Apr 2024 19:09:16 +0000 (12:09 -0700)]
Merge tag 'ata-6.9-rc3' of git://git./linux/kernel/git/libata/linux

Pull ata fixes from Damien Le Moal:

 - Compilation warning fixes from Arnd: one in the sata_sx4 driver due
   to an incorrect calculation of the parameters passed to memcpy() and
   another one in the sata_mv driver when CONFIG_PCI is not set

 - Drop the owner driver field assignment in the pata_macio driver. That
   is not needed as the PCI core code does that already (Krzysztof)

 - Remove an unusued field in struct st_ahci_drv_data of the ahci_st
   driver (Christophe)

 - Add a missing clock probe error check in the sata_gemini driver
   (Chen)

* tag 'ata-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: sata_gemini: Check clk_enable() result
  ata: sata_mv: Fix PCI device ID table declaration compilation warning
  ata: ahci_st: Remove an unused field in struct st_ahci_drv_data
  ata: pata_macio: drop driver owner assignment
  ata: sata_sx4: fix pdc20621_get_from_dimm() on 64-bit

12 months agoMerge tag 'sound-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 5 Apr 2024 18:58:55 +0000 (11:58 -0700)]
Merge tag 'sound-6.9-rc3' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "This became a bit bigger collection of patches, but almost all are
  about device-specific fixes, and should be safe for 6.9:

   - Lots of ASoC Intel SOF-related fixes/updates

   - Locking fixes in SoundWire drivers

   - ASoC AMD ACP/SOF updates

   - ASoC ES8326 codec fixes

   - HD-audio codec fixes and quirks

   - A regression fix in emu10k1 synth code"

* tag 'sound-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (49 commits)
  ASoC: SOF: Core: Add remove_late() to sof_init_environment failure path
  ASoC: SOF: amd: fix for false dsp interrupts
  ASoC: SOF: Intel: lnl: Disable DMIC/SSP offload on remove
  ASoC: Intel: avs: boards: Add modules description
  ASoC: codecs: ES8326: Removing the control of ADC_SCALE
  ASoC: codecs: ES8326: Solve a headphone detection issue after suspend and resume
  ASoC: codecs: ES8326: modify clock table
  ASoC: codecs: ES8326: Solve error interruption issue
  ALSA: line6: Zero-initialize message buffers
  ALSA: hda/realtek: cs35l41: Support ASUS ROG G634JYR
  ALSA: hda/realtek: Update Panasonic CF-SZ6 quirk to support headset with microphone
  ALSA: hda/realtek: Add sound quirks for Lenovo Legion slim 7 16ARHA7 models
  Revert "ALSA: emu10k1: fix synthesizer sample playback position and caching"
  OSS: dmasound/paula: Mark driver struct with __refdata to prevent section mismatch
  ALSA: hda/realtek: Add quirks for ASUS Laptops using CS35L56
  ASoC: amd: acp: fix for acp_init function error handling
  ASoC: tas2781: mark dvc_tlv with __maybe_unused
  ASoC: ops: Fix wraparound for mask in snd_soc_get_volsw
  ASoC: rt-sdw*: add __func__ to all error logs
  ASoC: rt722-sdca-sdw: fix locking sequence
  ...

12 months agoMerge tag 'drm-fixes-2024-04-05' of https://gitlab.freedesktop.org/drm/kernel
Linus Torvalds [Fri, 5 Apr 2024 18:53:46 +0000 (11:53 -0700)]
Merge tag 'drm-fixes-2024-04-05' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Weekly fixes, mostly xe and i915, amdgpu on a week off, otherwise a
  nouveau fix for a crash with new vulkan cts tests, and a couple of
  cleanups and misc fixes.

  display:
   - fix typos in kerneldoc

  prime:
   - unbreak dma-buf export for virt-gpu

  nouveau:
   - uvmm: fix remap address calculation
   - minor cleanups

  panfrost:
   - fix power-transition timeouts

  xe:
   - Stop using system_unbound_wq for preempt fences
   - Fix saving unordered rebinding fences by attaching them as kernel
     feces to the vm's resv
   - Fix TLB invalidation fences completing out of order
   - Move rebind TLB invalidation to the ring ops to reduce the latency

  i915:
   - A few DisplayPort related fixes
   - eDP PSR fixes
   - Remove some VM space restrictions on older platforms
   - Disable automatic load CCS load balancing"

* tag 'drm-fixes-2024-04-05' of https://gitlab.freedesktop.org/drm/kernel: (22 commits)
  drm/xe: Use ordered wq for preempt fence waiting
  drm/xe: Move vma rebinding to the drm_exec locking loop
  drm/xe: Make TLB invalidation fences unordered
  drm/xe: Rework rebinding
  drm/xe: Use ring ops TLB invalidation for rebinds
  drm/i915/mst: Reject FEC+MST on ICL
  drm/i915/mst: Limit MST+DSC to TGL+
  drm/i915/dp: Fix the computation for compressed_bpp for DISPLAY < 13
  drm/i915/gt: Enable only one CCS for compute workload
  drm/i915/gt: Do not generate the command streamer for all the CCS
  drm/i915/gt: Disable HW load balancing for CCS
  drm/i915/gt: Limit the reserved VM space to only the platforms that need it
  drm/i915/psr: Fix intel_psr2_sel_fetch_et_alignment usage
  drm/i915/psr: Move writing early transport pipe src
  drm/i915/psr: Calculate PIPE_SRCSZ_ERLY_TPT value
  drm/i915/dp: Remove support for UHBR13.5
  drm/i915/dp: Fix DSC state HW readout for SST connectors
  drm/display: fix typo
  drm/prime: Unbreak virtgpu dma-buf export
  nouveau/uvmm: fix addr/range calcs for remap operations
  ...

12 months agostackdepot: rename pool_index to pool_index_plus_1
Peter Collingbourne [Tue, 2 Apr 2024 00:14:58 +0000 (17:14 -0700)]
stackdepot: rename pool_index to pool_index_plus_1

Commit 3ee34eabac2a ("lib/stackdepot: fix first entry having a 0-handle")
changed the meaning of the pool_index field to mean "the pool index plus
1".  This made the code accessing this field less self-documenting, as
well as causing debuggers such as drgn to not be able to easily remain
compatible with both old and new kernels, because they typically do that
by testing for presence of the new field.  Because stackdepot is a
debugging tool, we should make sure that it is debugger friendly.
Therefore, give the field a different name to improve readability as well
as enabling debugger backwards compatibility.

This is needed in 6.9, which would otherwise become an odd release with
the new semantics and old name so debuggers wouldn't recognize the new
semantics there.

Fixes: 3ee34eabac2a ("lib/stackdepot: fix first entry having a 0-handle")
Link: https://lkml.kernel.org/r/20240402001500.53533-1-pcc@google.com
Link: https://linux-review.googlesource.com/id/Ib3e70c36c1d230dd0a118dc22649b33e768b9f88
Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 months agox86/mm/pat: fix VM_PAT handling in COW mappings
David Hildenbrand [Wed, 3 Apr 2024 21:21:30 +0000 (23:21 +0200)]
x86/mm/pat: fix VM_PAT handling in COW mappings

PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs) can be replaced during write faults to point at anon
folios.  Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.

Using follow_phys(), we might just get the address+protection of the anon
folio (which is very wrong), or fail on swap/nonswap entries, failing
follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
track_pfn_copy(), not properly calling free_pfn_range().

In free_pfn_range(), we either wouldn't call memtype_free() or would call
it with the wrong range, possibly leaking memory.

To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
if we run into that.

We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode.  We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.

For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.

Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():

<--- C reproducer --->
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
 #include <liburing.h>

 int main(void)
 {
         struct io_uring_params p = {};
         int ring_fd;
         size_t size;
         char *map;

         ring_fd = io_uring_setup(1, &p);
         if (ring_fd < 0) {
                 perror("io_uring_setup");
                 return 1;
         }
         size = p.sq_off.array + p.sq_entries * sizeof(unsigned);

         /* Map the submission queue ring MAP_PRIVATE */
         map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
                    ring_fd, IORING_OFF_SQ_RING);
         if (map == MAP_FAILED) {
                 perror("mmap");
                 return 1;
         }

         /* We have at least one page. Let's COW it. */
         *map = 0;
         pause();
         return 0;
 }
<--- C reproducer --->

On a system with 16 GiB RAM and swap configured:
 # ./iouring &
 # memhog 16G
 # killall iouring
[  301.552930] ------------[ cut here ]------------
[  301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
[  301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
[  301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
[  301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
[  301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
[  301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
[  301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
[  301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
[  301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
[  301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
[  301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
[  301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
[  301.564186] FS:  0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
[  301.564773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
[  301.565725] PKRU: 55555554
[  301.565944] Call Trace:
[  301.566148]  <TASK>
[  301.566325]  ? untrack_pfn+0xf4/0x100
[  301.566618]  ? __warn+0x81/0x130
[  301.566876]  ? untrack_pfn+0xf4/0x100
[  301.567163]  ? report_bug+0x171/0x1a0
[  301.567466]  ? handle_bug+0x3c/0x80
[  301.567743]  ? exc_invalid_op+0x17/0x70
[  301.568038]  ? asm_exc_invalid_op+0x1a/0x20
[  301.568363]  ? untrack_pfn+0xf4/0x100
[  301.568660]  ? untrack_pfn+0x65/0x100
[  301.568947]  unmap_single_vma+0xa6/0xe0
[  301.569247]  unmap_vmas+0xb5/0x190
[  301.569532]  exit_mmap+0xec/0x340
[  301.569801]  __mmput+0x3e/0x130
[  301.570051]  do_exit+0x305/0xaf0
...

Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 months agoMAINTAINERS: change vmware.com addresses to broadcom.com
Alexey Makhalov [Tue, 2 Apr 2024 23:23:34 +0000 (16:23 -0700)]
MAINTAINERS: change vmware.com addresses to broadcom.com

Update all remaining vmware.com email addresses to actual broadcom.com.

Add corresponding .mailmap entries for maintainers who contributed in the
past as the vmware.com address will start bouncing soon.

Maintainership update. Jeff Sipek has left VMware, Nick Shi will be
maintaining VMware PTP.

Link: https://lkml.kernel.org/r/20240402232334.33167-1-alexey.makhalov@broadcom.com
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Acked-by: Ajay Kaher <ajay.kaher@broadcom.com>
Acked-by: Ronak Doshi <ronak.doshi@broadcom.com>
Acked-by: Nick Shi <nick.shi@broadcom.com>
Acked-by: Bryan Tan <bryan-bt.tan@broadcom.com>
Acked-by: Vishnu Dasa <vishnu.dasa@broadcom.com>
Acked-by: Vishal Bhakta <vishal.bhakta@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 months agoselftests/mm: include strings.h for ffsl
Edward Liaw [Fri, 29 Mar 2024 18:58:10 +0000 (18:58 +0000)]
selftests/mm: include strings.h for ffsl

Got a compilation error on Android for ffsl after 91b80cc5b39f
("selftests: mm: fix map_hugetlb failure on 64K page size systems")
included vm_util.h.

Link: https://lkml.kernel.org/r/20240329185814.16304-1-edliaw@google.com
Fixes: af605d26a8f2 ("selftests/mm: merge util.h into vm_util.h")
Signed-off-by: Edward Liaw <edliaw@google.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 months agomm: vmalloc: fix lockdep warning
Uladzislau Rezki (Sony) [Thu, 28 Mar 2024 14:03:30 +0000 (15:03 +0100)]
mm: vmalloc: fix lockdep warning

A lockdep reports a possible deadlock in the find_vmap_area_exceed_addr_lock()
function:

============================================
WARNING: possible recursive locking detected
6.9.0-rc1-00060-ged3ccc57b108-dirty #6140 Not tainted
--------------------------------------------
drgn/455 is trying to acquire lock:
ffff0000c00131d0 (&vn->busy.lock/1){+.+.}-{2:2}, at: find_vmap_area_exceed_addr_lock+0x64/0x124

but task is already holding lock:
ffff0000c0011878 (&vn->busy.lock/1){+.+.}-{2:2}, at: find_vmap_area_exceed_addr_lock+0x64/0x124

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&vn->busy.lock/1);
  lock(&vn->busy.lock/1);

 *** DEADLOCK ***

indeed it can happen if the find_vmap_area_exceed_addr_lock() gets called
concurrently because it tries to acquire two nodes locks.  It was done to
prevent removing a lowest VA found on a previous step.

To address this a lowest VA is found first without holding a node lock
where it resides.  As a last step we check if a VA still there because it
can go away, if removed, proceed with next lowest.

[akpm@linux-foundation.org: fix comment typos, per Baoquan]
Link: https://lkml.kernel.org/r/20240328140330.4747-1-urezki@gmail.com
Fixes: 53becf32aec1 ("mm: vmalloc: support multiple nodes in vread_iter")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Omar Sandoval <osandov@fb.com>
Reported-by: Jens Axboe <axboe@kernel.dk>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 months agomm: vmalloc: bail out early in find_vmap_area() if vmap is not init
Uladzislau Rezki (Sony) [Sat, 23 Mar 2024 14:15:44 +0000 (15:15 +0100)]
mm: vmalloc: bail out early in find_vmap_area() if vmap is not init

During the boot the s390 system triggers "spinlock bad magic" messages
if the spinlock debugging is enabled:

[    0.465445] BUG: spinlock bad magic on CPU#0, swapper/0
[    0.465490]  lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[    0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
[    0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[    0.466270] Call Trace:
[    0.466470]  [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8
[    0.466516]  [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108
[    0.466545]  [<000000000042146c>] find_vmap_area+0x6c/0x108
[    0.466572]  [<000000000042175a>] find_vm_area+0x22/0x40
[    0.466597]  [<000000000012f152>] __set_memory+0x132/0x150
[    0.466624]  [<0000000001cc0398>] vmem_map_init+0x40/0x118
[    0.466651]  [<0000000001cc0092>] paging_init+0x22/0x68
[    0.466677]  [<0000000001cbbed2>] setup_arch+0x52a/0x708
[    0.466702]  [<0000000001cb6140>] start_kernel+0x80/0x5c8
[    0.466727]  [<0000000000100036>] startup_continue+0x36/0x40

it happens because such system tries to access some vmap areas
whereas the vmalloc initialization is not even yet done:

[    0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[    0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
[    0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[    0.466270] Call Trace:
[    0.466470] dump_stack_lvl (lib/dump_stack.c:117)
[    0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115)
[    0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364)
[    0.466572] find_vm_area (mm/vmalloc.c:3150)
[    0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393)
[    0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660)
[    0.466651] paging_init (arch/s390/mm/init.c:97)
[    0.466677] setup_arch (arch/s390/kernel/setup.c:972)
[    0.466702] start_kernel (init/main.c:899)
[    0.466727] startup_continue (arch/s390/kernel/head64.S:35)
[    0.466811] INFO: lockdep is turned off.
...
[    0.718250] vmalloc init - busy lock init 0000000002871860
[    0.718328] vmalloc init - busy lock init 00000000028731b8

Some background. It worked before because the lock that is in question
was statically defined and initialized. As of now, the locks and data
structures are initialized in the vmalloc_init() function.

To address that issue add the check whether the "vmap_initialized"
variable is set, if not find_vmap_area() bails out on entry returning NULL.

Link: https://lkml.kernel.org/r/20240323141544.4150-1-urezki@gmail.com
Fixes: 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 months agoinit: open output files from cpio unpacking with O_LARGEFILE
John Sperbeck [Sat, 23 Mar 2024 15:29:34 +0000 (08:29 -0700)]
init: open output files from cpio unpacking with O_LARGEFILE

If a member of a cpio archive for an initrd or initrams is larger than
2Gb, we'll eventually fail to write to that file when we get to that
limit, unless O_LARGEFILE is set.

The problem can be seen with this recipe, assuming that BLK_DEV_RAM
is not configured:

cd /tmp
dd if=/dev/zero of=BIGFILE bs=1048576 count=2200
echo BIGFILE | cpio -o -H newc -R root:root > initrd.img
kexec -l /boot/vmlinuz-$(uname -r) --initrd=initrd.img --reuse-cmdline
kexec -e

The console will show 'Initramfs unpacking failed: write error'.  With
the patch, the error is gone.

Link: https://lkml.kernel.org/r/20240323152934.3307391-1-jsperbeck@google.com
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 months agomm/secretmem: fix GUP-fast succeeding on secretmem folios
David Hildenbrand [Tue, 26 Mar 2024 14:32:08 +0000 (15:32 +0100)]
mm/secretmem: fix GUP-fast succeeding on secretmem folios

folio_is_secretmem() currently relies on secretmem folios being LRU
folios, to save some cycles.

However, folios might reside in a folio batch without the LRU flag set, or
temporarily have their LRU flag cleared.  Consequently, the LRU flag is
unreliable for this purpose.

In particular, this is the case when secretmem_fault() allocates a fresh
page and calls filemap_add_folio()->folio_add_lru().  The folio might be
added to the per-cpu folio batch and won't get the LRU flag set until the
batch was drained using e.g., lru_add_drain().

Consequently, folio_is_secretmem() might not detect secretmem folios and
GUP-fast can succeed in grabbing a secretmem folio, crashing the kernel
when we would later try reading/writing to the folio, because the folio
has been unmapped from the directmap.

Fix it by removing that unreliable check.

Link: https://lkml.kernel.org/r/20240326143210.291116-2-david@redhat.com
Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yue sun <samsun1006219@gmail.com>
Closes: https://lore.kernel.org/lkml/CABOYnLyevJeravW=QrH0JUPYEcDN160aZFb7kwndm-J2rmz0HQ@mail.gmail.com/
Debugged-by: Miklos Szeredi <miklos@szeredi.hu>
Tested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 months agoMerge branch 'acpi-thermal'
Rafael J. Wysocki [Fri, 5 Apr 2024 18:17:48 +0000 (20:17 +0200)]
Merge branch 'acpi-thermal'

* acpi-thermal:
  ACPI: thermal: Register thermal zones without valid trip points

12 months agonfsd: hold a lighter-weight client reference over CB_RECALL_ANY
Jeff Layton [Fri, 5 Apr 2024 17:56:18 +0000 (13:56 -0400)]
nfsd: hold a lighter-weight client reference over CB_RECALL_ANY

Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
client. While a callback job is technically an RPC that counter is
really more for client-driven RPCs, and this has the effect of
preventing the client from being unhashed until the callback completes.

If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
can end up in a situation where the callback can't complete on the (now
dead) callback channel, but the new client can't connect because the old
client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
return on the CREATE_SESSION operation.

The job is only holding a reference to the client so it can clear a flag
after the RPC completes. Fix this by having CB_RECALL_ANY instead hold a
reference to the cl_nfsdfs.cl_ref. Typically we only take that sort of
reference when dealing with the nfsdfs info files, but it should work
appropriately here to ensure that the nfs4_client doesn't disappear.

Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition")
Reported-by: Vladimir Benes <vbenes@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>