linux.git
13 months agox86/fred: Fix init_task thread stack pointer initialization
Xin Li (Intel) [Mon, 4 Mar 2024 08:33:33 +0000 (00:33 -0800)]
x86/fred: Fix init_task thread stack pointer initialization

As TOP_OF_KERNEL_STACK_PADDING was defined as 0 on x86_64, it went
unnoticed that the initialization of the .sp field in INIT_THREAD and some
calculations in the low level startup code do not take the padding into
account.

FRED enabled kernels require a 16 byte padding, which means that the init
task initialization and the low level startup code use the wrong stack
offset.

Subtract TOP_OF_KERNEL_STACK_PADDING in all affected places to adjust for
this.

Fixes: 65c9cc9e2c14 ("x86/fred: Reserve space for the FRED stack frame")
Fixes: 3adee777ad0d ("x86/smpboot: Remove initial_stack on 64-bit")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel.org/oe-lkp/202402262159.183c2a37-lkp@intel.com
Link: https://lore.kernel.org/r/20240304083333.449322-1-xin@zytor.com
14 months agoMAINTAINERS: Add a maintainer entry for FRED
Xin Li (Intel) [Wed, 31 Jan 2024 08:01:43 +0000 (00:01 -0800)]
MAINTAINERS: Add a maintainer entry for FRED

Add H. Peter Anvin and myself as FRED maintainers.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/r/20240131080143.3259642-1-xin@zytor.com
14 months agox86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline...
Xin Li (Intel) [Fri, 2 Feb 2024 09:02:24 +0000 (01:02 -0800)]
x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly

Change array_index_mask_nospec() to __always_inline because "inline" is
broken as https://www.kernel.org/doc/local/inline.html.

Fixes: 6786137bf8fd ("x86/fred: FRED entry/exit and dispatch code")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240202090225.322544-1-xin@zytor.com
14 months agox86/fred: Invoke FRED initialization code to enable FRED
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:24 +0000 (02:50 -0800)]
x86/fred: Invoke FRED initialization code to enable FRED

Let cpu_init_exception_handling() call cpu_init_fred_exceptions() to
initialize FRED. However if FRED is unavailable or disabled, it falls
back to set up TSS IST and initialize IDT.

Co-developed-by: Xin Li <xin3.li@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-36-xin3.li@intel.com
14 months agox86/fred: Add FRED initialization functions
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:23 +0000 (02:50 -0800)]
x86/fred: Add FRED initialization functions

Add cpu_init_fred_exceptions() to:
  - Set FRED entrypoints for events happening in ring 0 and 3.
  - Specify the stack level for IRQs occurred ring 0.
  - Specify dedicated event stacks for #DB/NMI/#MCE/#DF.
  - Enable FRED and invalidtes IDT.
  - Force 32-bit system calls to use "int $0x80" only.

Add fred_complete_exception_setup() to:
  - Initialize system_vectors as done for IDT systems.
  - Set unused sysvec_table entries to fred_handle_spurious_interrupt().

Co-developed-by: Xin Li <xin3.li@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-35-xin3.li@intel.com
14 months agox86/syscall: Split IDT syscall setup code into idt_syscall_init()
Xin Li [Tue, 5 Dec 2023 10:50:22 +0000 (02:50 -0800)]
x86/syscall: Split IDT syscall setup code into idt_syscall_init()

Because FRED uses the ring 3 FRED entrypoint for SYSCALL and SYSENTER and
ERETU is the only legit instruction to return to ring 3, there is NO need
to setup SYSCALL and SYSENTER MSRs for FRED, except the IA32_STAR MSR.

Split IDT syscall setup code into idt_syscall_init() to make it easy to
skip syscall setup code when FRED is enabled.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-34-xin3.li@intel.com
14 months agoKVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
Xin Li [Tue, 5 Dec 2023 10:50:21 +0000 (02:50 -0800)]
KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling

When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in
IRQ/NMI induced VM exits.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231205105030.8698-33-xin3.li@intel.com
14 months agox86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
Xin Li [Tue, 5 Dec 2023 10:50:20 +0000 (02:50 -0800)]
x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI

In IRQ/NMI induced VM exits, KVM VMX needs to execute the respective
handlers, which requires the software to create a FRED stack frame,
and use it to invoke the handlers. Add fred_irq_entry_from_kvm() for
this job.

Export fred_entry_from_kvm() because VMX can be compiled as a module.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-32-xin3.li@intel.com
14 months agox86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
Peter Zijlstra (Intel) [Tue, 5 Dec 2023 10:50:19 +0000 (02:50 -0800)]
x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code

PUSH_AND_CLEAR_REGS could be used besides actual entry code; in that case
%rbp shouldn't be cleared (otherwise the frame pointer is destroyed) and
UNWIND_HINT shouldn't be added.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-31-xin3.li@intel.com
14 months agox86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
Xin Li [Tue, 5 Dec 2023 10:50:18 +0000 (02:50 -0800)]
x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user

If the stack frame contains an invalid user context (e.g. due to invalid SS,
a non-canonical RIP, etc.) the ERETU instruction will trap (#SS or #GP).

From a Linux point of view, this really should be considered a user space
failure, so use the standard fault fixup mechanism to intercept the fault,
fix up the exception frame, and redirect execution to fred_entrypoint_user.
The end result is that it appears just as if the hardware had taken the
exception immediately after completing the transition to user space.

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-30-xin3.li@intel.com
14 months agox86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:17 +0000 (02:50 -0800)]
x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled

Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled,
otherwise the existing IDT code is chosen.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-29-xin3.li@intel.com
14 months agox86/traps: Add sysvec_install() to install a system interrupt handler
Xin Li [Tue, 5 Dec 2023 10:50:16 +0000 (02:50 -0800)]
x86/traps: Add sysvec_install() to install a system interrupt handler

Add sysvec_install() to install a system interrupt handler into the IDT
or the FRED system interrupt handler table.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-28-xin3.li@intel.com
14 months agox86/fred: FRED entry/exit and dispatch code
H. Peter Anvin (Intel) [Sat, 9 Dec 2023 21:42:14 +0000 (13:42 -0800)]
x86/fred: FRED entry/exit and dispatch code

The code to actually handle kernel and event entry/exit using
FRED. It is split up into two files thus:

 - entry_64_fred.S contains the actual entrypoints and exit code, and
   saves and restores registers.

 - entry_fred.c contains the two-level event dispatch code for FRED.
   The first-level dispatch is on the event type, and the second-level
   is on the event vector.

  [ bp: Fold in an allmodconfig clang build fix:
    https://lore.kernel.org/r/20240129064521.5168-1-xin3.li@intel.com
    and a CONFIG_IA32_EMULATION=n build fix:
    https://lore.kernel.org/r/20240127093728.1323-3-xin3.li@intel.com]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Originally-by: Megha Dey <megha.dey@intel.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231209214214.2932-1-xin3.li@intel.com
14 months agox86/fred: Add a machine check entry stub for FRED
Xin Li [Tue, 5 Dec 2023 10:50:14 +0000 (02:50 -0800)]
x86/fred: Add a machine check entry stub for FRED

Like #DB, when occurred on different ring level, i.e., from user or kernel
context, #MCE needs to be handled on different stack: User #MCE on current
task stack, while kernel #MCE on a dedicated stack.

This is exactly how FRED event delivery invokes an exception handler: ring
3 event on level 0 stack, i.e., current task stack; ring 0 event on the
the FRED machine check entry stub doesn't do stack switch.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-26-xin3.li@intel.com
14 months agox86/fred: Add a NMI entry stub for FRED
H. Peter Anvin (Intel) [Sat, 16 Dec 2023 06:31:39 +0000 (22:31 -0800)]
x86/fred: Add a NMI entry stub for FRED

On a FRED system, NMIs nest both with themselves and faults, transient
information is saved into the stack frame, and NMI unblocking only
happens when the stack frame indicates that so should happen.

Thus, the NMI entry stub for FRED is really quite small...

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231216063139.25567-1-xin3.li@intel.com
14 months agox86/fred: Add a debug fault entry stub for FRED
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:12 +0000 (02:50 -0800)]
x86/fred: Add a debug fault entry stub for FRED

When occurred on different ring level, i.e., from user or kernel context,
stack, while kernel #DB on a dedicated stack. This is exactly how FRED
event delivery invokes an exception handler: ring 3 event on level 0
stack, i.e., current task stack; ring 0 event on the #DB dedicated stack
specified in the IA32_FRED_STKLVLS MSR. So unlike IDT, the FRED debug
exception entry stub doesn't do stack switch.

On a FRED system, the debug trap status information (DR6) is passed on
the stack, to avoid the problem of transient state. Furthermore, FRED
transitions avoid a lot of ugly corner cases the handling of which can,
and should be, skipped.

The FRED debug trap status information saved on the stack differs from
DR6 in both stickiness and polarity; it is exactly in the format which
debug_read_clear_dr6() returns for the IDT entry points.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-24-xin3.li@intel.com
14 months agox86/idtentry: Incorporate definitions/declarations of the FRED entries
Xin Li [Tue, 5 Dec 2023 10:50:11 +0000 (02:50 -0800)]
x86/idtentry: Incorporate definitions/declarations of the FRED entries

FRED and IDT can share most of the definitions and declarations so
that in the majority of cases the actual handler implementation is the
same.

The differences are the exceptions where FRED stores exception related
information on the stack and the sysvec implementations as FRED can
handle irqentry/exit() in the dispatcher instead of having it in each
handler.

Also add stub defines for vectors which are not used due to Kconfig
decisions to spare the ifdeffery in the actual FRED dispatch code.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-23-xin3.li@intel.com
14 months agox86/fred: Make exc_page_fault() work for FRED
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:10 +0000 (02:50 -0800)]
x86/fred: Make exc_page_fault() work for FRED

On a FRED system, the faulting address (CR2) is passed on the stack,
to avoid the problem of transient state.  Thus the page fault address
is read from the FRED stack frame instead of CR2 when FRED is enabled.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-22-xin3.li@intel.com
14 months agox86/fred: Allow single-step trap and NMI when starting a new task
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:09 +0000 (02:50 -0800)]
x86/fred: Allow single-step trap and NMI when starting a new task

Entering a new task is logically speaking a return from a system call
(exec, fork, clone, etc.). As such, if ptrace enables single stepping
a single step exception should be allowed to trigger immediately upon
entering user space. This is not optional.

NMI should *never* be disabled in user space. As such, this is an
optional, opportunistic way to catch errors.

Allow single-step trap and NMI when starting a new task, thus once
the new task enters user space, single-step trap and NMI are both
enabled immediately.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-21-xin3.li@intel.com
14 months agox86/fred: No ESPFIX needed when FRED is enabled
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:08 +0000 (02:50 -0800)]
x86/fred: No ESPFIX needed when FRED is enabled

Because FRED always restores the full value of %rsp, ESPFIX is
no longer needed when it's enabled.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-20-xin3.li@intel.com
14 months agox86/fred: Disallow the swapgs instruction when FRED is enabled
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:07 +0000 (02:50 -0800)]
x86/fred: Disallow the swapgs instruction when FRED is enabled

SWAPGS is no longer needed thus NOT allowed with FRED because FRED
transitions ensure that an operating system can _always_ operate
with its own GS base address:

  - For events that occur in ring 3, FRED event delivery swaps the GS
    base address with the IA32_KERNEL_GS_BASE MSR.

  - ERETU (the FRED transition that returns to ring 3) also swaps the
    GS base address with the IA32_KERNEL_GS_BASE MSR.

And the operating system can still setup the GS segment for a user
thread without the need of loading a user thread GS with:

  - Using LKGS, available with FRED, to modify other attributes of the
    GS segment without compromising its ability always to operate with
    its own GS base address.

  - Accessing the GS segment base address for a user thread as before
    using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR.

Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE MSR
instead of the GS segment's descriptor cache. As such, the operating
system never changes its runtime GS base address.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-19-xin3.li@intel.com
14 months agox86/fred: Update MSR_IA32_FRED_RSP0 during task switch
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:06 +0000 (02:50 -0800)]
x86/fred: Update MSR_IA32_FRED_RSP0 during task switch

MSR_IA32_FRED_RSP0 is used during ring 3 event delivery, and needs to
be updated to point to the top of next task stack during task switch.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-18-xin3.li@intel.com
14 months agox86/fred: Reserve space for the FRED stack frame
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:05 +0000 (02:50 -0800)]
x86/fred: Reserve space for the FRED stack frame

When using FRED, reserve space at the top of the stack frame, just
like i386 does.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-17-xin3.li@intel.com
14 months agox86/fred: Add a new header file for FRED definitions
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:04 +0000 (02:50 -0800)]
x86/fred: Add a new header file for FRED definitions

Add a header file for FRED prototypes and definitions.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-16-xin3.li@intel.com
14 months agox86/ptrace: Add FRED additional information to the pt_regs structure
Xin Li [Tue, 5 Dec 2023 10:50:03 +0000 (02:50 -0800)]
x86/ptrace: Add FRED additional information to the pt_regs structure

FRED defines additional information in the upper 48 bits of cs/ss
fields. Therefore add the information definitions into the pt_regs
structure.

Specifically introduce a new structure fred_ss to denote the FRED flags
above SS selector, which avoids FRED_SSX_ macros and makes the code
simpler and easier to read.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Originally-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-15-xin3.li@intel.com
14 months agox86/ptrace: Cleanup the definition of the pt_regs structure
Xin Li [Tue, 5 Dec 2023 10:50:02 +0000 (02:50 -0800)]
x86/ptrace: Cleanup the definition of the pt_regs structure

struct pt_regs is hard to read because the member or section related
comments are not aligned with the members.

The 'cs' and 'ss' members of pt_regs are type of 'unsigned long' while
in reality they are only 16-bit wide. This works so far as the
remaining space is unused, but FRED will use the remaining bits for
other purposes.

To prepare for FRED:

  - Cleanup the formatting
  - Convert 'cs' and 'ss' to u16 and embed them into an union
    with a u64
  - Fixup the related printk() format strings

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Originally-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-14-xin3.li@intel.com
14 months agox86/cpu: Add MSR numbers for FRED configuration
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:01 +0000 (02:50 -0800)]
x86/cpu: Add MSR numbers for FRED configuration

Add MSR numbers for the FRED configuration registers per FRED spec 5.0.

Originally-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-13-xin3.li@intel.com
14 months agox86/cpu: Add X86_CR4_FRED macro
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:50:00 +0000 (02:50 -0800)]
x86/cpu: Add X86_CR4_FRED macro

Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit must not be
changed after initialization, so add it to the pinned CR4 bits.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-12-xin3.li@intel.com
14 months agox86/objtool: Teach objtool about ERET[US]
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:49:59 +0000 (02:49 -0800)]
x86/objtool: Teach objtool about ERET[US]

Update the objtool decoder to know about the ERET[US] instructions
(type INSN_CONTEXT_SWITCH).

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-11-xin3.li@intel.com
14 months agox86/opcode: Add ERET[US] instructions to the x86 opcode map
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:49:58 +0000 (02:49 -0800)]
x86/opcode: Add ERET[US] instructions to the x86 opcode map

ERETU returns from an event handler while making a transition to ring 3,
and ERETS returns from an event handler while staying in ring 0.

Add instruction opcodes used by ERET[US] to the x86 opcode map; opcode
numbers are per FRED spec v5.0.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20231205105030.8698-10-xin3.li@intel.com
14 months agox86/fred: Add a fred= cmdline param
Xin Li [Tue, 5 Dec 2023 10:49:57 +0000 (02:49 -0800)]
x86/fred: Add a fred= cmdline param

Let command line option "fred" accept multiple options to make it
easier to tweak its behavior.

Currently, two options 'on' and 'off' are allowed, and the default
behavior is to disable FRED. To enable FRED, append "fred=on" to the
kernel command line.

  [ bp: Use cpu_feature_enabled(), touch ups. ]

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-9-xin3.li@intel.com
14 months agox86/fred: Disable FRED support if CONFIG_X86_FRED is disabled
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:49:56 +0000 (02:49 -0800)]
x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled

Add CONFIG_X86_FRED to <asm/disabled-features.h> to make
cpu_feature_enabled() work correctly with FRED.

Originally-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-8-xin3.li@intel.com
14 months agox86/cpufeatures: Add the CPU feature bit for FRED
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:49:55 +0000 (02:49 -0800)]
x86/cpufeatures: Add the CPU feature bit for FRED

Any FRED enabled CPU will always have the following features as its
baseline:

  1) LKGS, load attributes of the GS segment but the base address into
     the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor
     cache.

  2) WRMSRNS, non-serializing WRMSR for faster MSR writes.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-7-xin3.li@intel.com
14 months agox86/fred: Add Kconfig option for FRED (CONFIG_X86_FRED)
H. Peter Anvin (Intel) [Tue, 5 Dec 2023 10:49:54 +0000 (02:49 -0800)]
x86/fred: Add Kconfig option for FRED (CONFIG_X86_FRED)

Add the configuration option CONFIG_X86_FRED to enable FRED.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-6-xin3.li@intel.com
14 months agoDocumentation/x86/64: Add documentation for FRED
Xin Li [Tue, 5 Dec 2023 10:49:53 +0000 (02:49 -0800)]
Documentation/x86/64: Add documentation for FRED

Briefly introduce FRED, and its advantages compared to IDT.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20231205105030.8698-5-xin3.li@intel.com
14 months agox86/trapnr: Add event type macros to <asm/trapnr.h>
Xin Li [Tue, 5 Dec 2023 10:49:52 +0000 (02:49 -0800)]
x86/trapnr: Add event type macros to <asm/trapnr.h>

Intel VT-x classifies events into eight different types, which is inherited
by FRED for event identification. As such, event types becomes a common x86
concept, and should be defined in a common x86 header.

Add event type macros to <asm/trapnr.h>, and use them in <asm/vmx.h>.

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-4-xin3.li@intel.com
14 months agox86/entry: Remove idtentry_sysvec from entry_{32,64}.S
Xin Li [Tue, 5 Dec 2023 10:49:51 +0000 (02:49 -0800)]
x86/entry: Remove idtentry_sysvec from entry_{32,64}.S

idtentry_sysvec is really just DECLARE_IDTENTRY defined in
<asm/idtentry.h>, no need to define it separately.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-3-xin3.li@intel.com
14 months agox86/cpufeatures,opcode,msr: Add the WRMSRNS instruction support
Xin Li [Tue, 5 Dec 2023 10:49:50 +0000 (02:49 -0800)]
x86/cpufeatures,opcode,msr: Add the WRMSRNS instruction support

WRMSRNS is an instruction that behaves exactly like WRMSR, with
the only difference being that it is not a serializing instruction
by default. Under certain conditions, WRMSRNS may replace WRMSR to
improve performance.

Add its CPU feature bit, opcode to the x86 opcode map, and an
always inline API __wrmsrns() to embed WRMSRNS into the code.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231205105030.8698-2-xin3.li@intel.com
14 months agoLinux 6.8-rc1
Linus Torvalds [Sun, 21 Jan 2024 22:11:32 +0000 (14:11 -0800)]
Linux 6.8-rc1

14 months agoMerge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs
Linus Torvalds [Sun, 21 Jan 2024 22:01:12 +0000 (14:01 -0800)]
Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs

Pull more bcachefs updates from Kent Overstreet:
 "Some fixes, Some refactoring, some minor features:

   - Assorted prep work for disk space accounting rewrite

   - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
     makes our trigger context more explicit

   - A few fixes to avoid excessive transaction restarts on
     multithreaded workloads: fstests (in addition to ktest tests) are
     now checking slowpath counters, and that's shaking out a few bugs

   - Assorted tracepoint improvements

   - Starting to break up bcachefs_format.h and move on disk types so
     they're with the code they belong to; this will make room to start
     documenting the on disk format better.

   - A few minor fixes"

* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
  bcachefs: Improve inode_to_text()
  bcachefs: logged_ops_format.h
  bcachefs: reflink_format.h
  bcachefs; extents_format.h
  bcachefs: ec_format.h
  bcachefs: subvolume_format.h
  bcachefs: snapshot_format.h
  bcachefs: alloc_background_format.h
  bcachefs: xattr_format.h
  bcachefs: dirent_format.h
  bcachefs: inode_format.h
  bcachefs; quota_format.h
  bcachefs: sb-counters_format.h
  bcachefs: counters.c -> sb-counters.c
  bcachefs: comment bch_subvolume
  bcachefs: bch_snapshot::btime
  bcachefs: add missing __GFP_NOWARN
  bcachefs: opts->compression can now also be applied in the background
  bcachefs: Prep work for variable size btree node buffers
  bcachefs: grab s_umount only if snapshotting
  ...

14 months agoMerge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 21 Jan 2024 19:14:40 +0000 (11:14 -0800)]
Merge tag 'timers-core-2024-01-21' of git://git./linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Updates for time and clocksources:

   - A fix for the idle and iowait time accounting vs CPU hotplug.

     The time is reset on CPU hotplug which makes the accumulated
     systemwide time jump backwards.

   - Assorted fixes and improvements for clocksource/event drivers"

* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
  clocksource/drivers/ep93xx: Fix error handling during probe
  clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
  clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
  clocksource/timer-riscv: Add riscv_clock_shutdown callback
  dt-bindings: timer: Add StarFive JH8100 clint
  dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs

14 months agoMerge tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 21 Jan 2024 19:04:29 +0000 (11:04 -0800)]
Merge tag 'powerpc-6.8-2' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Aneesh Kumar:

 - Increase default stack size to 32KB for Book3S

Thanks to Michael Ellerman.

* tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Increase default stack size to 32KB

14 months agobcachefs: Improve inode_to_text()
Kent Overstreet [Sun, 21 Jan 2024 17:19:01 +0000 (12:19 -0500)]
bcachefs: Improve inode_to_text()

Add line breaks - inode_to_text() is now much easier to read.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: logged_ops_format.h
Kent Overstreet [Sun, 21 Jan 2024 07:57:45 +0000 (02:57 -0500)]
bcachefs: logged_ops_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: reflink_format.h
Kent Overstreet [Sun, 21 Jan 2024 07:54:47 +0000 (02:54 -0500)]
bcachefs: reflink_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs; extents_format.h
Kent Overstreet [Sun, 21 Jan 2024 07:51:56 +0000 (02:51 -0500)]
bcachefs; extents_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: ec_format.h
Kent Overstreet [Sun, 21 Jan 2024 07:47:14 +0000 (02:47 -0500)]
bcachefs: ec_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: subvolume_format.h
Kent Overstreet [Sun, 21 Jan 2024 07:42:53 +0000 (02:42 -0500)]
bcachefs: subvolume_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: snapshot_format.h
Kent Overstreet [Sun, 21 Jan 2024 07:41:06 +0000 (02:41 -0500)]
bcachefs: snapshot_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: alloc_background_format.h
Kent Overstreet [Sun, 21 Jan 2024 05:01:52 +0000 (00:01 -0500)]
bcachefs: alloc_background_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: xattr_format.h
Kent Overstreet [Sun, 21 Jan 2024 04:59:15 +0000 (23:59 -0500)]
bcachefs: xattr_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: dirent_format.h
Kent Overstreet [Sun, 21 Jan 2024 04:57:10 +0000 (23:57 -0500)]
bcachefs: dirent_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: inode_format.h
Kent Overstreet [Sun, 21 Jan 2024 04:55:39 +0000 (23:55 -0500)]
bcachefs: inode_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs; quota_format.h
Kent Overstreet [Sun, 21 Jan 2024 04:53:52 +0000 (23:53 -0500)]
bcachefs; quota_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: sb-counters_format.h
Kent Overstreet [Sun, 21 Jan 2024 04:50:56 +0000 (23:50 -0500)]
bcachefs: sb-counters_format.h

bcachefs_format.h has gotten too big; let's do some organizing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: counters.c -> sb-counters.c
Kent Overstreet [Sun, 21 Jan 2024 04:46:35 +0000 (23:46 -0500)]
bcachefs: counters.c -> sb-counters.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: comment bch_subvolume
Kent Overstreet [Sun, 21 Jan 2024 04:44:17 +0000 (23:44 -0500)]
bcachefs: comment bch_subvolume

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: bch_snapshot::btime
Kent Overstreet [Sun, 21 Jan 2024 04:35:41 +0000 (23:35 -0500)]
bcachefs: bch_snapshot::btime

Add a field to bch_snapshot for creation time; this will be important
when we start exposing the snapshot tree to userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: add missing __GFP_NOWARN
Kent Overstreet [Wed, 17 Jan 2024 22:16:07 +0000 (17:16 -0500)]
bcachefs: add missing __GFP_NOWARN

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: opts->compression can now also be applied in the background
Kent Overstreet [Tue, 16 Jan 2024 21:20:21 +0000 (16:20 -0500)]
bcachefs: opts->compression can now also be applied in the background

The "apply this compression method in the background" paths now use the
compression option if background_compression is not set; this means that
setting or changing the compression option will cause existing data to
be compressed accordingly in the background.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Prep work for variable size btree node buffers
Kent Overstreet [Tue, 16 Jan 2024 18:29:59 +0000 (13:29 -0500)]
bcachefs: Prep work for variable size btree node buffers

bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.

And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.

Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: grab s_umount only if snapshotting
Su Yue [Mon, 15 Jan 2024 02:21:25 +0000 (10:21 +0800)]
bcachefs: grab s_umount only if snapshotting

When I was testing mongodb over bcachefs with compression,
there is a lockdep warning when snapshotting mongodb data volume.

$ cat test.sh
prog=bcachefs

$prog subvolume create /mnt/data
$prog subvolume create /mnt/data/snapshots

while true;do
    $prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s)
    sleep 1s
done

$ cat /etc/mongodb.conf
systemLog:
  destination: file
  logAppend: true
  path: /mnt/data/mongod.log

storage:
  dbPath: /mnt/data/

lockdep reports:
[ 3437.452330] ======================================================
[ 3437.452750] WARNING: possible circular locking dependency detected
[ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G            E
[ 3437.453562] ------------------------------------------------------
[ 3437.453981] bcachefs/35533 is trying to acquire lock:
[ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190
[ 3437.454875]
               but task is already holding lock:
[ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.456009]
               which lock already depends on the new lock.

[ 3437.456553]
               the existing dependency chain (in reverse order) is:
[ 3437.457054]
               -> #3 (&type->s_umount_key#48){.+.+}-{3:3}:
[ 3437.457507]        down_read+0x3e/0x170
[ 3437.457772]        bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.458206]        __x64_sys_ioctl+0x93/0xd0
[ 3437.458498]        do_syscall_64+0x42/0xf0
[ 3437.458779]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.459155]
               -> #2 (&c->snapshot_create_lock){++++}-{3:3}:
[ 3437.459615]        down_read+0x3e/0x170
[ 3437.459878]        bch2_truncate+0x82/0x110 [bcachefs]
[ 3437.460276]        bchfs_truncate+0x254/0x3c0 [bcachefs]
[ 3437.460686]        notify_change+0x1f1/0x4a0
[ 3437.461283]        do_truncate+0x7f/0xd0
[ 3437.461555]        path_openat+0xa57/0xce0
[ 3437.461836]        do_filp_open+0xb4/0x160
[ 3437.462116]        do_sys_openat2+0x91/0xc0
[ 3437.462402]        __x64_sys_openat+0x53/0xa0
[ 3437.462701]        do_syscall_64+0x42/0xf0
[ 3437.462982]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.463359]
               -> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}:
[ 3437.463843]        down_write+0x3b/0xc0
[ 3437.464223]        bch2_write_iter+0x5b/0xcc0 [bcachefs]
[ 3437.464493]        vfs_write+0x21b/0x4c0
[ 3437.464653]        ksys_write+0x69/0xf0
[ 3437.464839]        do_syscall_64+0x42/0xf0
[ 3437.465009]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.465231]
               -> #0 (sb_writers#10){.+.+}-{0:0}:
[ 3437.465471]        __lock_acquire+0x1455/0x21b0
[ 3437.465656]        lock_acquire+0xc6/0x2b0
[ 3437.465822]        mnt_want_write+0x46/0x1a0
[ 3437.465996]        filename_create+0x62/0x190
[ 3437.466175]        user_path_create+0x2d/0x50
[ 3437.466352]        bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
[ 3437.466617]        __x64_sys_ioctl+0x93/0xd0
[ 3437.466791]        do_syscall_64+0x42/0xf0
[ 3437.466957]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.467180]
               other info that might help us debug this:

[ 3437.469670] 2 locks held by bcachefs/35533:
               other info that might help us debug this:

[ 3437.467507] Chain exists of:
                 sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48

[ 3437.467979]  Possible unsafe locking scenario:

[ 3437.468223]        CPU0                    CPU1
[ 3437.468405]        ----                    ----
[ 3437.468585]   rlock(&type->s_umount_key#48);
[ 3437.468758]                                lock(&c->snapshot_create_lock);
[ 3437.469030]                                lock(&type->s_umount_key#48);
[ 3437.469291]   rlock(sb_writers#10);
[ 3437.469434]
                *** DEADLOCK ***

[ 3437.469670] 2 locks held by bcachefs/35533:
[ 3437.469838]  #0: ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs]
[ 3437.470294]  #1: ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.470744]
               stack backtrace:
[ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G            E      6.7.0-rc7-custom+ #85
[ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 3437.471694] Call Trace:
[ 3437.471795]  <TASK>
[ 3437.471884]  dump_stack_lvl+0x57/0x90
[ 3437.472035]  check_noncircular+0x132/0x150
[ 3437.472202]  __lock_acquire+0x1455/0x21b0
[ 3437.472369]  lock_acquire+0xc6/0x2b0
[ 3437.472518]  ? filename_create+0x62/0x190
[ 3437.472683]  ? lock_is_held_type+0x97/0x110
[ 3437.472856]  mnt_want_write+0x46/0x1a0
[ 3437.473025]  ? filename_create+0x62/0x190
[ 3437.473204]  filename_create+0x62/0x190
[ 3437.473380]  user_path_create+0x2d/0x50
[ 3437.473555]  bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
[ 3437.473819]  ? lock_acquire+0xc6/0x2b0
[ 3437.474002]  ? __fget_files+0x2a/0x190
[ 3437.474195]  ? __fget_files+0xbc/0x190
[ 3437.474380]  ? lock_release+0xc5/0x270
[ 3437.474567]  ? __x64_sys_ioctl+0x93/0xd0
[ 3437.474764]  ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs]
[ 3437.475090]  __x64_sys_ioctl+0x93/0xd0
[ 3437.475277]  do_syscall_64+0x42/0xf0
[ 3437.475454]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.475691] RIP: 0033:0x7f2743c313af
======================================================

In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
and unlock it at the end of the function. There is a comment
"why do we need this lock?" about the lock coming from
commit 42d237320e98 ("bcachefs: Snapshot creation, deletion")
The reason is that __bch2_ioctl_subvolume_create() calls
sync_inodes_sb() which enforce locked s_umount to writeback all dirty
nodes before doing snapshot works.

Fix it by read locking s_umount for snapshotting only and unlocking
s_umount after sync_inodes_sb().

Signed-off-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exit
Su Yue [Tue, 16 Jan 2024 11:05:37 +0000 (19:05 +0800)]
bcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exit

bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut.
It should be freed by kvfree not kfree.
Or umount will triger:

[  406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008
[  406.830676 ] #PF: supervisor read access in kernel mode
[  406.831643 ] #PF: error_code(0x0000) - not-present page
[  406.832487 ] PGD 0 P4D 0
[  406.832898 ] Oops: 0000 [#1] PREEMPT SMP PTI
[  406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G           OE      6.7.0-rc7-custom+ #90
[  406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[  406.835796 ] RIP: 0010:kfree+0x62/0x140
[  406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6
[  406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286
[  406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4
[  406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000
[  406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001
[  406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80
[  406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000
[  406.840451 ] FS:  00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000
[  406.840851 ] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0
[  406.841464 ] Call Trace:
[  406.841583 ]  <TASK>
[  406.841682 ]  ? __die+0x1f/0x70
[  406.841828 ]  ? page_fault_oops+0x159/0x470
[  406.842014 ]  ? fixup_exception+0x22/0x310
[  406.842198 ]  ? exc_page_fault+0x1ed/0x200
[  406.842382 ]  ? asm_exc_page_fault+0x22/0x30
[  406.842574 ]  ? bch2_fs_release+0x54/0x280 [bcachefs]
[  406.842842 ]  ? kfree+0x62/0x140
[  406.842988 ]  ? kfree+0x104/0x140
[  406.843138 ]  bch2_fs_release+0x54/0x280 [bcachefs]
[  406.843390 ]  kobject_put+0xb7/0x170
[  406.843552 ]  deactivate_locked_super+0x2f/0xa0
[  406.843756 ]  cleanup_mnt+0xba/0x150
[  406.843917 ]  task_work_run+0x59/0xa0
[  406.844083 ]  exit_to_user_mode_prepare+0x197/0x1a0
[  406.844302 ]  syscall_exit_to_user_mode+0x16/0x40
[  406.844510 ]  do_syscall_64+0x4e/0xf0
[  406.844675 ]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  406.844907 ] RIP: 0033:0x7f0a2664e4fb

Signed-off-by: Su Yue <glass.su@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: bios must be 512 byte algined
Kent Overstreet [Tue, 16 Jan 2024 16:38:04 +0000 (11:38 -0500)]
bcachefs: bios must be 512 byte algined

Fixes: 023f9ac9f70f bcachefs: Delete dio read alignment check
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: remove redundant variable tmp
Colin Ian King [Tue, 16 Jan 2024 11:07:23 +0000 (11:07 +0000)]
bcachefs: remove redundant variable tmp

The variable tmp is being assigned a value but it isn't being
read afterwards. The assignment is redundant and so tmp can be
removed.

Cleans up clang scan build warning:
warning: Although the value stored to 'ret' is used in the enclosing
expression, the value is never actually read from 'ret'
[deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Improve trace_trans_restart_relock
Kent Overstreet [Tue, 16 Jan 2024 01:40:06 +0000 (20:40 -0500)]
bcachefs: Improve trace_trans_restart_relock

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Fix excess transaction restarts in __bchfs_fallocate()
Kent Overstreet [Tue, 16 Jan 2024 01:37:23 +0000 (20:37 -0500)]
bcachefs: Fix excess transaction restarts in __bchfs_fallocate()

drop_locks_do() should not be used in a fastpath without first trying
the do in nonblocking mode - the unlock and relock will cause excessive
transaction restarts and potentially livelocking with other threads that
are contending for the same locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: extents_to_bp_state
Kent Overstreet [Mon, 15 Jan 2024 23:19:52 +0000 (18:19 -0500)]
bcachefs: extents_to_bp_state

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: bkey_and_val_eq()
Kent Overstreet [Mon, 15 Jan 2024 23:08:32 +0000 (18:08 -0500)]
bcachefs: bkey_and_val_eq()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Better journal tracepoints
Kent Overstreet [Mon, 15 Jan 2024 22:59:51 +0000 (17:59 -0500)]
bcachefs: Better journal tracepoints

Factor out bch2_journal_bufs_to_text(), and use it in the
journal_entry_full() tracepoint; when we can't get a journal reservation
we need to know the outstanding journal entry sizes to know if the
problem is due to excessive flushing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Print size of superblock with space allocated
Kent Overstreet [Mon, 15 Jan 2024 22:57:44 +0000 (17:57 -0500)]
bcachefs: Print size of superblock with space allocated

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Avoid flushing the journal in the discard path
Kent Overstreet [Mon, 15 Jan 2024 22:56:22 +0000 (17:56 -0500)]
bcachefs: Avoid flushing the journal in the discard path

When issuing discards, we may need to flush the journal if there's too
many buckets that can't be discarded until a journal flush.

But the heuristic was bad; we should be comparing the number of buckets
that need to flushes against the number of free buckets, not the number
of buckets we saw.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Improve move_extent tracepoint
Kent Overstreet [Mon, 15 Jan 2024 20:33:39 +0000 (15:33 -0500)]
bcachefs: Improve move_extent tracepoint

Also print out the data_opts, so that we can see what specifically is
being done to an extent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Add missing bch2_moving_ctxt_flush_all()
Kent Overstreet [Mon, 15 Jan 2024 20:06:43 +0000 (15:06 -0500)]
bcachefs: Add missing bch2_moving_ctxt_flush_all()

This fixes a bug with rebalance IOs getting stuck with reads completed,
but writes never being issued.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Re-add move_extent_write tracepoint
Kent Overstreet [Mon, 15 Jan 2024 20:04:40 +0000 (15:04 -0500)]
bcachefs: Re-add move_extent_write tracepoint

It appears this was accidentally deleted at some point - also, do a bit
of cleanup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: bch2_kthread_io_clock_wait() no longer sleeps until full amount
Kent Overstreet [Mon, 15 Jan 2024 19:15:26 +0000 (14:15 -0500)]
bcachefs: bch2_kthread_io_clock_wait() no longer sleeps until full amount

Drop t he loop in bch2_kthread_io_clock_wait(): this allows the code
that uses it to be woken up for other reasons, and fixes a bug where
rebalance wouldn't wake up when a scan was requested.

This raises the possibility of spurious wakeups, but callers should
always be able to handle that reasonably well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Add .val_to_text() for KEY_TYPE_cookie
Kent Overstreet [Mon, 15 Jan 2024 19:15:03 +0000 (14:15 -0500)]
bcachefs: Add .val_to_text() for KEY_TYPE_cookie

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Don't pass memcmp() as a pointer
Kent Overstreet [Mon, 15 Jan 2024 19:12:43 +0000 (14:12 -0500)]
bcachefs: Don't pass memcmp() as a pointer

Some (buggy!) compilers have issues with this.

Fixes: https://github.com/koverstreet/bcachefs/issues/625
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agoMerge tag 'header_cleanup-2024-01-20' of https://evilpiepirate.org/git/bcachefs
Linus Torvalds [Sun, 21 Jan 2024 18:21:43 +0000 (10:21 -0800)]
Merge tag 'header_cleanup-2024-01-20' of https://evilpiepirate.org/git/bcachefs

Pull header fix from Kent Overstreet:
 "Just one small fixup for the RT build"

* tag 'header_cleanup-2024-01-20' of https://evilpiepirate.org/git/bcachefs:
  spinlock: Fix failing build for PREEMPT_RT

14 months agobcachefs: Reduce would_deadlock restarts
Kent Overstreet [Thu, 11 Jan 2024 04:47:04 +0000 (23:47 -0500)]
bcachefs: Reduce would_deadlock restarts

We don't have to take locks in any particular ordering - we'll make
forward progress just fine - but if we try to stick to an ordering, it
can help to avoid excessive would_deadlock transaction restarts.

This tweaks the reflink path to take extents btree locks in the right
order.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: bch2_trans_account_disk_usage_change()
Kent Overstreet [Sat, 11 Nov 2023 20:08:36 +0000 (15:08 -0500)]
bcachefs: bch2_trans_account_disk_usage_change()

The disk space accounting rewrite is splitting out accounting for each
replicas set - those are moving to btree keys, instead of percpu
counters.

This breaks bch2_trans_fs_usage_apply() up, splitting out the part we
will still need.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: bch_fs_usage_base
Kent Overstreet [Fri, 17 Nov 2023 05:03:45 +0000 (00:03 -0500)]
bcachefs: bch_fs_usage_base

Split out base filesystem usage into its own type; prep work for
breaking up bch2_trans_fs_usage_apply().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: bch2_prt_compression_type()
Kent Overstreet [Sun, 7 Jan 2024 02:01:47 +0000 (21:01 -0500)]
bcachefs: bch2_prt_compression_type()

bounds checking helper, since compression types are extensible

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: helpers for printing data types
Kent Overstreet [Sun, 7 Jan 2024 01:57:43 +0000 (20:57 -0500)]
bcachefs: helpers for printing data types

We need bounds checking since new versions may introduce new data types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: BTREE_TRIGGER_ATOMIC
Kent Overstreet [Sun, 7 Jan 2024 22:14:46 +0000 (17:14 -0500)]
bcachefs: BTREE_TRIGGER_ATOMIC

Add a new flag to be explicit about when we're running atomic triggers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: drop to_text code for obsolete bps in alloc keys
Kent Overstreet [Sun, 7 Jan 2024 00:47:09 +0000 (19:47 -0500)]
bcachefs: drop to_text code for obsolete bps in alloc keys

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: eytzinger_for_each() declares loop iter
Kent Overstreet [Sun, 7 Jan 2024 00:29:14 +0000 (19:29 -0500)]
bcachefs: eytzinger_for_each() declares loop iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: Don't log errors if BCH_WRITE_ALLOC_NOWAIT
Kent Overstreet [Thu, 11 Jan 2024 04:08:30 +0000 (23:08 -0500)]
bcachefs: Don't log errors if BCH_WRITE_ALLOC_NOWAIT

Previously, we added logging in the write path to ensure that any
unexpected errors getting reported to userspace have a log message; but
BCH_WRITE_ALLOC_NOWAIT is a special case, it's used for promotes where
errors are expected and not reported out to userspace - so we need to
silence those.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agobcachefs: fix memleak in bch2_split_devs
Su Yue [Mon, 8 Jan 2024 15:11:08 +0000 (23:11 +0800)]
bcachefs: fix memleak in bch2_split_devs

The pointer dev_name can be modified by strseq(),
then causes the memleak:

unreferenced object 0xffff9d08a2916c80 (size 32):
  comm "mount.bcachefs", pid 9090, jiffies 4295856224 (age 17.564s)
  hex dump (first 32 bytes):
    2f 64 65 76 2f 6d 61 70 70 65 72 2f 74 65 73 74  /dev/mapper/test
    2d 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00  -0..............
  backtrace:
    [<00000000c5d3be7d>] __kmem_cache_alloc_node+0x1f3/0x2c0
    [<0000000052215d26>] __kmalloc_node_track_caller+0x51/0x150
    [<0000000069fea956>] kstrdup+0x32/0x60
    [<000000000877fcf1>] bch2_split_devs+0x3f/0x150 [bcachefs]
    [<000000007ee93204>] bch2_mount+0xcb/0x640 [bcachefs]
    [<000000002dd1e04b>] legacy_get_tree+0x30/0x60
    [<000000006afc31d3>] vfs_get_tree+0x28/0xf0
    [<000000007b0c538e>] path_mount+0x475/0xb60
    [<0000000092de5882>] __x64_sys_mount+0x105/0x140
    [<0000000054fc05d8>] do_syscall_64+0x42/0xf0
    [<00000000df584910>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

Fix it by copy pointer dev_name at beginning and free the copied
pointer at end.

Signed-off-by: Su Yue <glass.su@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
14 months agoMerge tag 'v6.8-rc-part2-smb-client' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 21 Jan 2024 00:48:07 +0000 (16:48 -0800)]
Merge tag 'v6.8-rc-part2-smb-client' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client updates from Steve French:
 "Various smb client fixes, including multichannel and for SMB3.1.1
  POSIX extensions:

   - debugging improvement (display start time for stats)

   - two reparse point handling fixes

   - various multichannel improvements and fixes

   - SMB3.1.1 POSIX extensions open/create parsing fix

   - retry (reconnect) improvement including new retrans mount parm, and
     handling of two additional return codes that need to be retried on

   - two minor cleanup patches and another to remove duplicate query
     info code

   - two documentation cleanup, and one reviewer email correction"

* tag 'v6.8-rc-part2-smb-client' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update iface_last_update on each query-and-update
  cifs: handle servers that still advertise multichannel after disabling
  cifs: new mount option called retrans
  cifs: reschedule periodic query for server interfaces
  smb: client: don't clobber ->i_rdev from cached reparse points
  smb: client: get rid of smb311_posix_query_path_info()
  smb: client: parse owner/group when creating reparse points
  smb: client: fix parsing of SMB3.1.1 POSIX create context
  cifs: update known bugs mentioned in kernel docs for cifs
  cifs: new nt status codes from MS-SMB2
  cifs: pick channel for tcon and tdis
  cifs: open_cached_dir should not rely on primary channel
  smb3: minor documentation updates
  Update MAINTAINERS email address
  cifs: minor comment cleanup
  smb3: show beginning time for per share stats
  cifs: remove redundant variable tcon_exist

14 months agoMerge tag 'dmaengine-fix-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 20 Jan 2024 23:03:25 +0000 (15:03 -0800)]
Merge tag 'dmaengine-fix-6.8-rc1' of git://git./linux/kernel/git/vkoul/dmaengine

Pull dmaengine updates from Vinod Koul:
 "New support:
   - Loongson LS2X APB DMA controller
   - sf-pdma: mpfs-pdma support
   - Qualcomm X1E80100 GPI dma controller support

  Updates:
   - Xilinx XDMA updates to support interleaved DMA transfers
   - TI PSIL threads for AM62P and J722S and cfg register regions
     description
   - axi-dmac Improving the cyclic DMA transfers
   - Tegra Support dma-channel-mask property
   - Remaining platform remove callback returning void conversions

 Driver fixes for:
   - Xilinx xdma driver operator precedence and initialization fix
   - Excess kernel-doc warning fix in imx-sdma xilinx xdma drivers
   - format-overflow warning fix for rz-dmac, sh usb dmac drivers
   - 'output may be truncated' fix for shdma, fsl-qdma and dw-edma
     drivers"

* tag 'dmaengine-fix-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (58 commits)
  dmaengine: dw-edma: increase size of 'name' in debugfs code
  dmaengine: fsl-qdma: increase size of 'irq_name'
  dmaengine: shdma: increase size of 'dev_id'
  dmaengine: xilinx: xdma: Fix kernel-doc warnings
  dmaengine: usb-dmac: Avoid format-overflow warning
  dmaengine: sh: rz-dmac: Avoid format-overflow warning
  dmaengine: imx-sdma: fix Excess kernel-doc warnings
  dmaengine: xilinx: xdma: Fix initialization location of desc in xdma_channel_isr()
  dmaengine: xilinx: xdma: Fix operator precedence in xdma_prep_interleaved_dma()
  dmaengine: xilinx: xdma: statify xdma_prep_interleaved_dma
  dmaengine: xilinx: xdma: Workaround truncation compilation error
  dmaengine: pl330: issue_pending waits until WFP state
  dmaengine: xilinx: xdma: Implement interleaved DMA transfers
  dmaengine: xilinx: xdma: Prepare the introduction of interleaved DMA transfers
  dmaengine: xilinx: xdma: Add transfer error reporting
  dmaengine: xilinx: xdma: Add error checking in xdma_channel_isr()
  dmaengine: xilinx: xdma: Rework xdma_terminate_all()
  dmaengine: xilinx: xdma: Ease dma_pool alignment requirements
  dmaengine: xilinx: xdma: Add necessary macro definitions
  dmaengine: xilinx: xdma: Get rid of unused code
  ...

14 months agoMerge tag 'coccinelle-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawa...
Linus Torvalds [Sat, 20 Jan 2024 22:20:34 +0000 (14:20 -0800)]
Merge tag 'coccinelle-for-6.8' of git://git./linux/kernel/git/jlawall/linux

Pull coccinelle updates from Julia Lawall:
 "Updates to the device_attr_show semantic patch to reflect the new
  guidelines of the Linux kernel documentation.

  The problem was identified by Li Zhijian <lizhijian@fujitsu.com>, who
  proposed an initial fix"

* tag 'coccinelle-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux:
  coccinelle: device_attr_show: simplify patch case
  coccinelle: device_attr_show: Adapt to the latest Documentation/filesystems/sysfs.rst

14 months agomedia: solo6x10: replace max(a, min(b, c)) by clamp(b, a, c)
Aurelien Jarno [Sat, 13 Jan 2024 18:33:31 +0000 (19:33 +0100)]
media: solo6x10: replace max(a, min(b, c)) by clamp(b, a, c)

This patch replaces max(a, min(b, c)) by clamp(b, a, c) in the solo6x10
driver.  This improves the readability and more importantly, for the
solo6x10-p2m.c file, this reduces on my system (x86-64, gcc 13):

 - the preprocessed size from 121 MiB to 4.5 MiB;

 - the build CPU time from 46.8 s to 1.6 s;

 - the build memory from 2786 MiB to 98MiB.

In fine, this allows this relatively simple C file to be built on a
32-bit system.

Reported-by: Jiri Slaby <jirislaby@gmail.com>
Closes: https://lore.kernel.org/lkml/18c6df0d-45ed-450c-9eda-95160a2bbb8e@gmail.com/
Cc: <stable@vger.kernel.org> # v6.7+
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: David Laight <David.Laight@ACULAB.COM>
Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 months agococcinelle: device_attr_show: simplify patch case
Julia Lawall [Sat, 20 Jan 2024 20:56:11 +0000 (21:56 +0100)]
coccinelle: device_attr_show: simplify patch case

Replacing the final expression argument by ... allows the format
string to have multiple arguments.

It also has the advantage of allowing the change to be recognized as
a change in a single statement, thus avoiding adding unneeded braces.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
14 months agoexecve: open the executable file before doing anything else
Linus Torvalds [Tue, 9 Jan 2024 00:43:04 +0000 (16:43 -0800)]
execve: open the executable file before doing anything else

No point in allocating a new mm, counting arguments and environment
variables etc if we're just going to return ENOENT.

This patch does expose the fact that 'do_filp_open()' that execve() uses
is still unnecessarily expensive in the failure case, because it
allocates the 'struct file *' early, even if the path lookup (which is
heavily optimized) fails.

So that remains an unnecessary cost in the "no such executable" case,
but it's a separate issue.  Regardless, I do not want to do _both_ a
filename_lookup() and a later do_filp_open() like the origin patch by
Josh Triplett did in [1].

Reported-by: Josh Triplett <josh@joshtriplett.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/lkml/5c7333ea4bec2fad1b47a8fa2db7c31e4ffc4f14.1663334978.git.josh@joshtriplett.org/
Link: https://lore.kernel.org/lkml/202209161637.9EDAF6B18@keescook/
Link: https://lore.kernel.org/lkml/CAHk-=wgznerM-xs+x+krDfE7eVBiy_HOam35rbsFMMOwvYuEKQ@mail.gmail.com/
Link: https://lore.kernel.org/lkml/CAHk-=whf9qLO8ipps4QhmS0BkM8mtWJhvnuDSdtw5gFjhzvKNA@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 months agoMerge tag 'riscv-for-linus-6.8-mw4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 20 Jan 2024 19:06:04 +0000 (11:06 -0800)]
Merge tag 'riscv-for-linus-6.8-mw4' of git://git./linux/kernel/git/riscv/linux

Pull more RISC-V updates from Palmer Dabbelt:

 - Support for tuning for systems with fast misaligned accesses.

 - Support for SBI-based suspend.

 - Support for the new SBI debug console extension.

 - The T-Head CMOs now use PA-based flushes.

 - Support for enabling the V extension in kernel code.

 - Optimized IP checksum routines.

 - Various ftrace improvements.

 - Support for archrandom, which depends on the Zkr extension.

 - The build is no longer broken under NET=n, KUNIT=y for ports that
   don't define their own ipv6 checksum.

* tag 'riscv-for-linus-6.8-mw4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (56 commits)
  lib: checksum: Fix build with CONFIG_NET=n
  riscv: lib: Check if output in asm goto supported
  riscv: Fix build error on rv32 + XIP
  riscv: optimize ELF relocation function in riscv
  RISC-V: Implement archrandom when Zkr is available
  riscv: Optimize hweight API with Zbb extension
  riscv: add dependency among Image(.gz), loader(.bin), and vmlinuz.efi
  samples: ftrace: Add RISC-V support for SAMPLE_FTRACE_DIRECT[_MULTI]
  riscv: ftrace: Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support
  riscv: ftrace: Make function graph use ftrace directly
  riscv: select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY
  lib/Kconfig.debug: Update AS_HAS_NON_CONST_LEB128 comment and name
  riscv: Restrict DWARF5 when building with LLVM to known working versions
  riscv: Hoist linker relaxation disabling logic into Kconfig
  kunit: Add tests for csum_ipv6_magic and ip_fast_csum
  riscv: Add checksum library
  riscv: Add checksum header
  riscv: Add static key for misaligned accesses
  asm-generic: Improve csum_fold
  RISC-V: selftests: cbo: Ensure asm operands match constraints
  ...

14 months agoMerge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 20 Jan 2024 17:42:32 +0000 (09:42 -0800)]
Merge tag 'scsi-misc' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "Final round of fixes that came in too late to send in the first
  request.

  It's nine bug fixes and one version update (because of a bug fix) and
  one set of PCI ID additions. There's one bug fix in the core which is
  really a one liner (except that an additional sdev pointer was added
  for convenience) and the rest are in drivers"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: target: core: Add TMF to tmr_list handling
  scsi: core: Kick the requeue list after inserting when flushing
  scsi: fnic: unlock on error path in fnic_queuecommand()
  scsi: fcoe: Fix unsigned comparison with zero in store_ctlr_mode()
  scsi: mpi3mr: Fix mpi3mr_fw.c kernel-doc warnings
  scsi: smartpqi: Bump driver version to 2.1.26-030
  scsi: smartpqi: Fix logical volume rescan race condition
  scsi: smartpqi: Add new controller PCI IDs
  scsi: ufs: qcom: Remove unnecessary goto statement from ufs_qcom_config_esi()
  scsi: ufs: core: Remove the ufshcd_hba_exit() call from ufshcd_async_scan()
  scsi: ufs: core: Simplify power management during async scan

14 months agoMerge tag 'sh-for-v6.8-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubit...
Linus Torvalds [Sat, 20 Jan 2024 17:24:06 +0000 (09:24 -0800)]
Merge tag 'sh-for-v6.8-tag1' of git://git./linux/kernel/git/glaubitz/sh-linux

Pull sh updates from John Paul Adrian Glaubitz:
 "Since the large patch series to convert arch/sh to device tree support
  has not been finalized yet due to various maintainers still asking for
  changes to the series, this ended up being rather small consisting of
  just two fixes.

  The first patch by Geert Uytterhoeven addresses a build failure in the
  EcoVec platform code. And the second patch by Masahiro Yamada removes
  an unnecessary $(foreach ...) found in a Makefile of the vsyscall
  code.

   - Rename missed backlight field from fbdev to dev

   - Remove unnecessary $(foreach ...)"

* tag 'sh-for-v6.8-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
  sh: vsyscall: Remove unnecessary $(foreach ...)
  sh: ecovec24: Rename missed backlight field from fbdev to dev

14 months agoMerge tag 'fbdev-for-6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 20 Jan 2024 17:14:04 +0000 (09:14 -0800)]
Merge tag 'fbdev-for-6.8-rc1-2' of git://git./linux/kernel/git/deller/linux-fbdev

Pull fbdev fix from Helge Deller:
 "There were various reports from people without any graphics output on
  the screen and it turns out one commit triggers the problem.

   - Revert 'firmware/sysfb: Clear screen_info state after consuming it'"

* tag 'fbdev-for-6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  Revert "firmware/sysfb: Clear screen_info state after consuming it"

14 months agoMerge tag 'perf-tools-for-v6.8-1-2024-01-09' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Fri, 19 Jan 2024 22:25:23 +0000 (14:25 -0800)]
Merge tag 'perf-tools-for-v6.8-1-2024-01-09' of git://git./linux/kernel/git/perf/perf-tools

Pull perf tools updates from Arnaldo Carvalho de Melo:
 "Add Namhyung Kim as tools/perf/ co-maintainer, we're taking turns
  processing patches, switching roles from perf-tools to perf-tools-next
  at each Linux release.

  Data profiling:

   - Associate samples that identify loads and stores with data
     structures. This uses events available on Intel, AMD and others and
     DWARF info:

       # To get memory access samples in kernel for 1 second (on Intel)
       $ perf mem record -a -K --ldlat=4 -- sleep 1

       # Similar for the AMD (but it requires 6.3+ kernel for BPF filters)
       $ perf mem record -a --filter 'mem_op == load || mem_op == store, ip > 0x8000000000000000' -- sleep 1

     Then, amongst several modes of post processing, one can do things like:

       $ perf report -s type,typeoff --hierarchy --group --stdio
       ...
       #
       # Samples: 10K of events 'cpu/mem-loads,ldlat=4/P, cpu/mem-stores/P, dummy:u'
       # Event count (approx.): 602758064
       #
       #                    Overhead  Data Type / Data Type Offset
       # ...........................  ............................
       #
           26.09%   3.28%   0.00%     long unsigned int
              26.09%   3.28%   0.00%     long unsigned int +0 (no field)
           18.48%   0.73%   0.00%     struct page
              10.83%   0.02%   0.00%     struct page +8 (lru.next)
               3.90%   0.28%   0.00%     struct page +0 (flags)
               3.45%   0.06%   0.00%     struct page +24 (mapping)
               0.25%   0.28%   0.00%     struct page +48 (_mapcount.counter)
               0.02%   0.06%   0.00%     struct page +32 (index)
               0.02%   0.00%   0.00%     struct page +52 (_refcount.counter)
               0.02%   0.01%   0.00%     struct page +56 (memcg_data)
               0.00%   0.01%   0.00%     struct page +16 (lru.prev)
           15.37%  17.54%   0.00%     (stack operation)
              15.37%  17.54%   0.00%     (stack operation) +0 (no field)
           11.71%  50.27%   0.00%     (unknown)
              11.71%  50.27%   0.00%     (unknown) +0 (no field)

       $ perf annotate --data-type
       ...
       Annotate type: 'struct cfs_rq' in [kernel.kallsyms] (13 samples):
       ============================================================================
           samples     offset       size  field
                13          0        640  struct cfs_rq         {
                 2          0         16      struct load_weight       load {
                 2          0          8          unsigned long        weight;
                 0          8          4          u32  inv_weight;
                                              };
                 0         16          8      unsigned long    runnable_weight;
                 0         24          4      unsigned int     nr_running;
                 1         28          4      unsigned int     h_nr_running;
       ...

       $ perf annotate --data-type=page --group
       Annotate type: 'struct page' in [kernel.kallsyms] (480 samples):
        event[0] = cpu/mem-loads,ldlat=4/P
        event[1] = cpu/mem-stores/P
        event[2] = dummy:u
       ===================================================================================
                samples  offset  size  field
       447  33        0       0    64  struct page     {
       108   8        0       0     8  long unsigned int  flags;
       319  13        0       8    40  union       {
       319  13        0       8    40          struct          {
       236   2        0       8    16              union       {
       236   2        0       8    16                  struct list_head       lru {
       236   1        0       8     8                      struct list_head*  next;
         0   1        0      16     8                      struct list_head*  prev;
                                                       };
       236   2        0       8    16                  struct          {
       236   1        0       8     8                      void*      __filler;
         0   1        0      16     4                      unsigned int       mlock_count;
                                                       };
       236   2        0       8    16                  struct list_head       buddy_list {
       236   1        0       8     8                      struct list_head*  next;
         0   1        0      16     8                      struct list_head*  prev;
                                                       };
       236   2        0       8    16                  struct list_head       pcp_list {
       236   1        0       8     8                      struct list_head*  next;
         0   1        0      16     8                      struct list_head*  prev;
                                                       };
                                                   };
        82   4        0      24     8              struct address_space*      mapping;
         1   7        0      32     8              union       {
         1   7        0      32     8                  long unsigned int      index;
         1   7        0      32     8                  long unsigned int      share;
                                                   };
         0   0        0      40     8              long unsigned int  private;
                                                                 };

     This uses the existing annotate code, calling objdump to do the
     disassembly, with improvements to avoid having this take too long,
     but longer term a switch to a disassembler library, possibly
     reusing code in the kernel will be pursued.

     This is the initial implementation, please use it and report
     impressions and bugs. Make sure the kernel-debuginfo packages match
     the running kernel. The 'perf report' phase for non short perf.data
     files may take a while.

     There is a great article about it on LWN:

       https://lwn.net/Articles/955709/ - "Data-type profiling for perf"

     One last test I did while writing this text, on a AMD Ryzen 5950X,
     using a distro kernel, while doing a simple 'find /' on an
     otherwise idle system resulted in:

     # uname -r
     6.6.9-100.fc38.x86_64
     # perf -vv | grep BPF_
                      bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
            bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
     # rpm -qa | grep kernel-debuginfo
     kernel-debuginfo-common-x86_64-6.6.9-100.fc38.x86_64
     kernel-debuginfo-6.6.9-100.fc38.x86_64
     #
     # perf mem record -a --filter 'mem_op == load || mem_op == store, ip > 0x8000000000000000'
     ^C[ perf record: Woken up 1 times to write data ]
     [ perf record: Captured and wrote 2.199 MB perf.data (2913 samples) ]
     #
     # ls -la perf.data
     -rw-------. 1 root root 2346486 Jan  9 18:36 perf.data
     # perf evlist
     ibs_op//
     dummy:u
     # perf evlist -v
     ibs_op//: type: 11, size: 136, config: 0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|CPU|PERIOD|IDENTIFIER|DATA_SRC|WEIGHT, read_format: ID, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1
     dummy:u: type: 1 (PERF_TYPE_SOFTWARE), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|ADDR|CPU|IDENTIFIER|DATA_SRC|WEIGHT, read_format: ID, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, mmap_data: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
     #
     # perf report -s type,typeoff --hierarchy --group --stdio
     # Total Lost Samples: 0
     #
     # Samples: 2K of events 'ibs_op//, dummy:u'
     # Event count (approx.): 1904553038
     #
     #            Overhead  Data Type / Data Type Offset
     # ...................  ............................
     #
         73.70%   0.00%     (unknown)
            73.70%   0.00%     (unknown) +0 (no field)
          3.01%   0.00%     long unsigned int
             3.00%   0.00%     long unsigned int +0 (no field)
             0.01%   0.00%     long unsigned int +2 (no field)
          2.73%   0.00%     struct task_struct
             1.71%   0.00%     struct task_struct +52 (on_cpu)
             0.38%   0.00%     struct task_struct +2104 (rcu_read_unlock_special.b.blocked)
             0.23%   0.00%     struct task_struct +2100 (rcu_read_lock_nesting)
             0.14%   0.00%     struct task_struct +2384 ()
             0.06%   0.00%     struct task_struct +3096 (signal)
             0.05%   0.00%     struct task_struct +3616 (cgroups)
             0.05%   0.00%     struct task_struct +2344 (active_mm)
             0.02%   0.00%     struct task_struct +46 (flags)
             0.02%   0.00%     struct task_struct +2096 (migration_disabled)
             0.01%   0.00%     struct task_struct +24 (__state)
             0.01%   0.00%     struct task_struct +3956 (mm_cid_active)
             0.01%   0.00%     struct task_struct +1048 (cpus_ptr)
             0.01%   0.00%     struct task_struct +184 (se.group_node.next)
             0.01%   0.00%     struct task_struct +20 (thread_info.cpu)
             0.00%   0.00%     struct task_struct +104 (on_rq)
             0.00%   0.00%     struct task_struct +2456 (pid)
          1.36%   0.00%     struct module
             0.59%   0.00%     struct module +952 (kallsyms)
             0.42%   0.00%     struct module +0 (state)
             0.23%   0.00%     struct module +8 (list.next)
             0.12%   0.00%     struct module +216 (syms)
          0.95%   0.00%     struct inode
             0.41%   0.00%     struct inode +40 (i_sb)
             0.22%   0.00%     struct inode +0 (i_mode)
             0.06%   0.00%     struct inode +76 (i_rdev)
             0.06%   0.00%     struct inode +56 (i_security)
     <SNIP>

  perf top/report:

   - Don't ignore job control, allowing control+Z + bg to work.

   - Add s390 raw data interpretation for PAI (Processor Activity
     Instrumentation) counters.

  perf archive:

   - Add new option '--all' to pack perf.data with DSOs.

   - Add new option '--unpack' to expand tarballs.

  Initialization speedups:

   - Lazily initialize zstd streams to save memory when not using it.

   - Lazily allocate/size mmap event copy.

   - Lazy load kernel symbols in 'perf record'.

   - Be lazier in allocating lost samples buffer in 'perf record'.

   - Don't synthesize BPF events when disabled via the command line
     (perf record --no-bpf-event).

  Assorted improvements:

   - Show note on AMD systems that the :p, :pp, :ppp and :P are all the
     same, as IBS (Instruction Based Sampling) is used and it is
     inherentely precise, not having levels of precision like in Intel
     systems.

   - When 'cycles' isn't available, fall back to the "task-clock" event
     when not system wide, not to 'cpu-clock'.

   - Add --debug-file option to redirect debug output, e.g.:

       $ perf --debug-file /tmp/perf.log record -v true

   - Shrink 'struct map' to under one cacheline by avoiding function
     pointers for selecting if addresses are identity or DSO relative,
     and using just a byte for some boolean struct members.

   - Resolve the arch specific strerrno just once to use in
     perf_env__arch_strerrno().

   - Reduce memory for recording PERF_RECORD_LOST_SAMPLES event.

  Assorted fixes:

   - Fix the default 'perf top' usage on Intel hybrid systems, now it
     starts with a browser showing the number of samples for Efficiency
     (cpu_atom/cycles/P) and Performance (cpu_core/cycles/P). This
     behaviour is similar on ARM64, with its respective set of
     big.LITTLE processors.

   - Fix segfault on build_mem_topology() error path.

   - Fix 'perf mem' error on hybrid related to availability of mem event
     in a PMU.

   - Fix missing reference count gets (map, maps) in the db-export code.

   - Avoid recursively taking env->bpf_progs.lock in the 'perf_env'
     code.

   - Use the newly introduced maps__for_each_map() to add missing
     locking around iteration of 'struct map' entries.

   - Parse NOTE segments until the build id is found, don't stop on the
     first one, ELF files may have several such NOTE segments.

   - Remove 'egrep' usage, its deprecated, use 'grep -E' instead.

   - Warn first about missing libelf, not libbpf, that depends on
     libelf.

   - Use alternative to 'find ... -printf' as this isn't supported in
     busybox.

   - Address python 3.6 DeprecationWarning for string scapes.

   - Fix memory leak in uniq() in libsubcmd.

   - Fix man page formatting for 'perf lock'

   - Fix some spelling mistakes.

  perf tests:

   - Fail shell tests that needs some symbol in perf itself if it is
     stripped. These tests check if a symbol is resolved, if some hot
     function is indeed detected by profiling, etc.

   - The 'perf test sigtrap' test is currently failing on PREEMPT_RT,
     skip it if sleeping spinlocks are detected (using BTF) and point to
     the mailing list discussion about it. This test is also being
     skipped on several architectures (powerpc, s390x, arm and aarch64)
     due to other pending issues with intruction breakpoints.

   - Adjust test case perf record offcpu profiling tests for s390.

   - Fix 'Setup struct perf_event_attr' fails on s390 on z/VM guest,
     addressing issues caused by the fallback from cycles to task-clock
     done in this release.

   - Fix mask for VG register in the user-regs test.

   - Use shellcheck on 'perf test' shell scripts automatically to make
     sure changes don't introduce things it flags as problematic.

   - Add option to change objdump binary and allow it to be set via
     'perf config'.

   - Add basic 'perf script', 'perf list --json" and 'perf diff' tests.

   - Basic branch counter support.

   - Make DSO tests a suite rather than individual.

   - Remove atomics from test_loop to avoid test failures.

   - Fix call chain match on powerpc for the record+probe_libc_inet_pton
     test.

   - Improve Intel hybrid tests.

  Vendor event files (JSON):

  powerpc:

   - Update datasource event name to fix duplicate events on IBM's
     Power10.

   - Add PVN for HX-C2000 CPU with Power8 Architecture.

  Intel:

   - Alderlake/rocketlake metric fixes.

   - Update emeraldrapids events to v1.02.

   - Update icelakex events to v1.23.

   - Update sapphirerapids events to v1.17.

   - Add skx, clx, icx and spr upi bandwidth metric.

  AMD:

   - Add Zen 4 memory controller events.

  RISC-V:

   - Add StarFive Dubhe-80 and Dubhe-90 JSON files.
       https://www.starfivetech.com/en/site/cpu-u

   - Add T-HEAD C9xx JSON file.
       https://github.com/riscv-software-src/opensbi/blob/master/docs/platform/thead-c9xx.md

  ARM64:

   - Remove UTF-8 characters from cmn.json, that were causing build
     failure in some distros.

   - Add core PMU events and metrics for Ampere One X.

   - Rename Ampere One's BPU_FLUSH_MEM_FAULT to GPC_FLUSH_MEM_FAULT

  libperf:

   - Rename several perf_cpu_map constructor names to clarify what they
     really do.

   - Ditto for some other methods, coping with some issues in their
     semantics, like perf_cpu_map__empty() ->
     perf_cpu_map__has_any_cpu_or_is_empty().

   - Document perf_cpu_map__nr()'s behavior

  perf stat:

   - Exit if parse groups fails.

   - Combine the -A/--no-aggr and --no-merge options.

   - Fix help message for --metric-no-threshold option.

  Hardware tracing:

  ARM64 CoreSight:

   - Bump minimum OpenCSD version to ensure a bugfix is present.

   - Add 'T' itrace option for timestamp trace

   - Set start vm addr of exectable file to 0 and don't ignore first
     sample on the arm-cs-trace-disasm.py 'perf script'"

* tag 'perf-tools-for-v6.8-1-2024-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (179 commits)
  MAINTAINERS: Add Namhyung as tools/perf/ co-maintainer
  perf test: test case 'Setup struct perf_event_attr' fails on s390 on z/vm
  perf db-export: Fix missing reference count get in call_path_from_sample()
  perf tests: Add perf script test
  libsubcmd: Fix memory leak in uniq()
  perf TUI: Don't ignore job control
  perf vendor events intel: Update sapphirerapids events to v1.17
  perf vendor events intel: Update icelakex events to v1.23
  perf vendor events intel: Update emeraldrapids events to v1.02
  perf vendor events intel: Alderlake/rocketlake metric fixes
  perf x86 test: Add hybrid test for conflicting legacy/sysfs event
  perf x86 test: Update hybrid expectations
  perf vendor events amd: Add Zen 4 memory controller events
  perf stat: Fix hard coded LL miss units
  perf record: Reduce memory for recording PERF_RECORD_LOST_SAMPLES event
  perf env: Avoid recursively taking env->bpf_progs.lock
  perf annotate: Add --insn-stat option for debugging
  perf annotate: Add --type-stat option for debugging
  perf annotate: Support event group display
  perf annotate: Add --data-type option
  ...