git.maquefel.me Git - linux.git/log

Merge branch 'bpf-allow-bpf_for_each_map_elem-helper-with-different-input-maps'

Philo Lu says:

====================
bpf: allow bpf_for_each_map_elem() helper with different input maps

Currently, taking different maps within a single bpf_for_each_map_elem
call is not allowed. For example the following codes cannot pass the
verifier (with error "tail_call abusing map_ptr"):
```
static void test_by_pid(int pid)
{
if (pid <= 100)
bpf_for_each_map_elem(&map1, map_elem_cb, NULL, 0);
else
bpf_for_each_map_elem(&map2, map_elem_cb, NULL, 0);
}
```

This is because during bpf_for_each_map_elem verifying,
bpf_insn_aux_data->map_ptr_state is expected as map_ptr (instead of poison
state), which is then needed by set_map_elem_callback_state. However, as
there are two different map ptr input, map_ptr_state is marked as
BPF_MAP_PTR_POISON, and thus the second map_ptr would be lost.
BPF_MAP_PTR_POISON is also needed by bpf_for_each_map_elem to skip
retpoline optimization in do_misc_fixups(). Therefore, map_ptr_state and
map_ptr are both needed for bpf_for_each_map_elem.

This patchset solves it by transform bpf_insn_aux_data->map_ptr_state as a
new struct, storing poison/unpriv state and map pointer together without
additional memory overhead. Then bpf_for_each_map_elem works well with
different input maps. It also makes map_ptr_state logic clearer.

A test case is added to selftest, which would fail to load without this
patchset.

Changelogs
-> v1:
- PATCH 1/3:
  - make the commit log clearer
  - change poison and unpriv to bool in struct bpf_map_ptr_state, also the
    return value in bpf_map_ptr_poisoned() and bpf_map_ptr_unpriv()
- PATCH 2/3:
  - change the comments in set_map_elem_callback_state()
- PATCH 3/3:
  - remove the "skipping the last element" logic during map updating
  - change if() to ASSERT_OK()

Please review, thanks.
====================

Link: https://lore.kernel.org/r/20240405025536.18113-1-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add test for bpf_for_each_map_elem() with different maps

A test is added for bpf_for_each_map_elem() with either an arraymap or a
hashmap.
$ tools/testing/selftests/bpf/test_progs -t for_each
#93/1    for_each/hash_map:OK
#93/2    for_each/array_map:OK
#93/3    for_each/write_map_key:OK
#93/4    for_each/multi_maps:OK
#93      for_each:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-4-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: allow invoking bpf_for_each_map_elem with different maps

Taking different maps within a single bpf_for_each_map_elem call is not
allowed before, because from the second map,
bpf_insn_aux_data->map_ptr_state will be marked as *poison*. In fact
both map_ptr and state are needed to support this use case: map_ptr is
used by set_map_elem_callback_state() while poison state is needed to
determine whether to use direct call.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-3-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: store both map ptr and state in bpf_insn_aux_data

Currently, bpf_insn_aux_data->map_ptr_state is used to store either
map_ptr or its poison state (i.e., BPF_MAP_PTR_POISON). Thus
BPF_MAP_PTR_POISON must be checked before reading map_ptr. In certain
cases, we may need valid map_ptr even in case of poison state.
This will be explained in next patch with bpf_for_each_map_elem()
helper.

This patch changes map_ptr_state into a new struct including both map
pointer and its state (poison/unpriv). It's in the same union with
struct bpf_loop_inline_state, so there is no extra memory overhead.
Besides, macros BPF_MAP_PTR_UNPRIV/BPF_MAP_PTR_POISON/BPF_MAP_PTR are no
longer needed.

This patch does not change any existing functionality.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-2-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: fix perf_snapshot_branch_stack link failure

The newly added code to handle bpf_get_branch_snapshot fails to link when
CONFIG_PERF_EVENTS is disabled:

aarch64-linux-ld: kernel/bpf/verifier.o: in function `do_misc_fixups':
verifier.c:(.text+0x1090c): undefined reference to `__SCK__perf_snapshot_branch_stack'

Add a build-time check for that Kconfig symbol around the code to
remove the link time dependency.

Fixes: 314a53623cd4 ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240405142637.577046-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add fp-leaking precise subprog result tests

Add selftests validating that BPF verifier handles precision marking
for SCALAR registers derived from r10 (fp) register correctly.

Given `r0 = (s8)r10;` syntax is not supported by older Clang compilers,
use the raw BPF instruction syntax to maximize compatibility.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404214536.3551295-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: prevent r10 register from being marked as precise

r10 is a special register that is not under BPF program's control and is
always effectively precise. The rest of precision logic assumes that
only r0-r9 SCALAR registers are marked as precise, so prevent r10 from
being marked precise.

This can happen due to signed cast instruction allowing to do something
like `r0 = (s8)r10;`, which later, if r0 needs to be precise, would lead
to an attempt to mark r10 as precise.

Prevent this with an extra check during instruction backtracking.

Fixes: 8100928c8814 ("bpf: Support new sign-extension mov insns")
Reported-by: syzbot+148110ee7cf72f39f33e@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404214536.3551295-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Pack struct bpf_fib_lookup

The struct bpf_fib_lookup is supposed to be of size 64. A recent commit
59b418c7063d ("bpf: Add a check for struct bpf_fib_lookup size") added
a static assertion to check this property so that future changes to the
structure will not accidentally break this assumption.

As it immediately turned out, on some 32-bit arm systems, when AEABI=n,
the total size of the structure was equal to 68, see [1]. This happened
because the bpf_fib_lookup structure contains a union of two 16-bit
fields:

    union {
            __u16 tot_len;
            __u16 mtu_result;
    };

which was supposed to compile to a 16-bit-aligned 16-bit field. On the
aforementioned setups it was instead both aligned and padded to 32-bits.

Declare this inner union as __attribute__((packed, aligned(2))) such
that it always is of size 2 and is aligned to 16 bits.

  [1] https://lore.kernel.org/all/CA+G9fYtsoP51f-oP_Sp5MOq-Ffv8La2RztNpwvE6+R1VtFiLrw@mail.gmail.com/#t

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: e1850ea9bd9e ("bpf: bpf_fib_lookup return MTU value as output when looked up")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240403123303.1452184-1-aspsk@isovalent.com

bpftool: Mount bpffs on provided dir instead of parent dir

When pinning programs/objects under PATH (eg: during "bpftool prog
loadall") the bpffs is mounted on the parent dir of PATH in the
following situations:
- the given dir exists but it is not bpffs.
- the given dir doesn't exist and the parent dir is not bpffs.

Mounting on the parent dir can also have the unintentional side-
effect of hiding other files located under the parent dir.

If the given dir exists but is not bpffs, then the bpffs should
be mounted on the given dir and not its parent dir.

Similarly, if the given dir doesn't exist and its parent dir is not
bpffs, then the given dir should be created and the bpffs should be
mounted on this new dir.

Fixes: 2a36c26fe3b8 ("bpftool: Support bpffs mountpoint as pin path for prog loadall")
Signed-off-by: Sahil Siddiq <icegambit91@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/2da44d24-74ae-a564-1764-afccf395eeec@isovalent.com/T/#t
Link: https://lore.kernel.org/bpf/20240404192219.52373-1-icegambit91@gmail.com
Closes: https://github.com/libbpf/bpftool/issues/100
Changes since v1:
- Split "mount_bpffs_for_pin" into two functions.
   This is done to improve maintainability and readability.

Changes since v2:
- mount_bpffs_for_pin: rename to "create_and_mount_bpffs_dir".
- mount_bpffs_given_file: rename to "mount_bpffs_given_file".
- create_and_mount_bpffs_dir:
  - introduce "dir_exists" boolean.
  - remove new dir if "mnt_fs" fails.
- improve error handling and error messages.

Changes since v3:
- Rectify function name.
- Improve error messages and formatting.
- mount_bpffs_for_file:
  - Check if dir exists before block_mount check.

Changes since v4:
- Use strdup instead of strcpy.
- create_and_mount_bpffs_dir:
  - Use S_IRWXU instead of 0700.
- Improve error handling and formatting.

Merge branch 'inline-bpf_get_branch_snapshot-bpf-helper'

Andrii Nakryiko says:

====================
Inline bpf_get_branch_snapshot() BPF helper

Implement inlining of bpf_get_branch_snapshot() BPF helper using generic BPF
assembly approach. This allows to reduce LBR record usage right before LBR
records are captured from inside BPF program.

See v1 cover letter ([0]) for some visual examples. I dropped them from v2
because there are multiple independent changes landing and being reviewed, all
of which remove different parts of LBR record waste, so presenting final state
of LBR "waste" gets more complicated until all of the pieces land.

  [0] https://lore.kernel.org/bpf/20240321180501.734779-1-andrii@kernel.org/

v2->v3:
  - fix BPF_MUL instruction definition;
v1->v2:
  - inlining of bpf_get_smp_processor_id() split out into a separate patch set
    implementing internal per-CPU BPF instruction;
  - add efficient divide-by-24 through multiplication logic, and leave
    comments to explain the idea behind it; this way inlined version of
    bpf_get_branch_snapshot() has no compromises compared to non-inlined
    version of the helper  (Alexei).
====================

Link: https://lore.kernel.org/r/20240404002640.1774210-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: inline bpf_get_branch_snapshot() helper

Inline bpf_get_branch_snapshot() helper using architecture-agnostic
inline BPF code which calls directly into underlying callback of
perf_snapshot_branch_stack static call. This callback is set early
during kernel initialization and is never updated or reset, so it's ok
to fetch actual implementation using static_call_query() and call
directly into it.

This change eliminates a full function call and saves one LBR entry
in PERF_SAMPLE_BRANCH_ANY LBR mode.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: make bpf_get_branch_snapshot() architecture-agnostic

perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
there is no reason for BPF subsystem to keep track of which
architectures do support LBR or not. E.g., it looks like ARM64 might soon
get support for BRBE ([0]), which (with proper integration) should be
possible to utilize using this BPF helper.

perf_snapshot_branch_stack static call will point to
__static_call_return0() by default, which just returns zero, which will
lead to -ENOENT, as expected. So no need to guard anything here.

[0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, riscv: Implement bpf_addr_space_cast instruction

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))). The addr_space=0 is reserved as
bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY

rY = addr_space_cast(rX, 1, 0) has to be converted by JIT.

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20240404114203.105970-3-puranjay12@gmail.com

bpf, riscv: Implement PROBE_MEM32 pseudo instructions

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW]
instructions. They are similar to PROBE_MEM instructions with the
following differences:

- PROBE_MEM32 supports store.
- PROBE_MEM32 relies on the verifier to clear upper 32-bit of the
  src/dst register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in S7
  in the prologue). Due to bpf_arena constructions such S7 + reg +
  off16 access is guaranteed to be within arena virtual range, so no
  address check at run-time.
- S11 is a free callee-saved register, so it is used to store kern_vm_start
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When
  LDX faults the destination register is zeroed.

To support these on riscv, we do tmp = S7 + src/dst reg and then use
tmp2 as the new src/dst register. This allows us to reuse most of the
code for normal [LDX | STX | ST].

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20240404114203.105970-2-puranjay12@gmail.com

bpf: Optimize emit_mov_imm64().

Turned out that bpf prog callback addresses, bpf prog addresses
used in bpf_trampoline, and in other cases the 64-bit address
can be represented as sign extended 32-bit value.

According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
"Skylake has 0.64c throughput for mov r64, imm64, vs. 0.25 for mov r32, imm32."
So use shorter encoding and faster instruction when possible.

Special care is needed in jit_subprogs(), since bpf_pseudo_func()
instruction cannot change its size during the last step of JIT.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/CAADnVQKFfpY-QZBrOU2CG8v2du8Lgyb7MNVmOZVK_yTyOdNbBA@mail.gmail.com
Link: https://lore.kernel.org/bpf/20240401233800.42737-1-alexei.starovoitov@gmail.com

bpf: handle CONFIG_SMP=n configuration in x86 BPF JIT

On non-SMP systems, there is no "per-CPU" data, it's just global data.
So in such case just don't do this_cpu_off-based per-CPU address adjustment.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404040951.d4CUx5S6-lkp@intel.com/
Fixes: 7bdbf7446305 ("bpf: add special internal-only MOV instruction to resolve per-CPU addrs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240404034726.2766740-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'add-internal-only-bpf-per-cpu-instruction'

Andrii Nakryiko says:

====================
Add internal-only BPF per-CPU instruction

Add a new BPF instruction for resolving per-CPU memory addresses.

New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_X, with
insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
to an absolute address where per-CPU data resides for "this" CPU.

This patch set implements support for it in x86-64 BPF JIT only.

Using the new instruction, we also implement inlining for three cases:
  - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
    function call, saving a bit of performance and also not polluting LBR
    records with unnecessary function call/return records;
  - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
    performance to implementing per-CPU data structures using global variables
    in BPF (which is an awesome improvement, see benchmarks below);
  - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
    same for non-PERCPU HASH map; this still saves a bit of overhead.

To validate performance benefits, I hacked together a tiny benchmark doing
only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
(arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
on global variable array using bpf_get_smp_processor_id() to index array for
current CPU (glob-arr-inc benchmark below).

BEFORE
======
glob-arr-inc   :  163.685 ± 0.092M/s
arr-inc        :  138.096 ± 0.160M/s
hash-inc       :   66.855 ± 0.123M/s

AFTER
=====
glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
arr-inc        :  170.729 ± 0.210M/s (+23.7%)
hash-inc       :   68.673 ± 0.070M/s (+2.7%)

As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().

But what's really important is that arr-inc benchmark basically catches up
with glob-arr-inc, resulting in +23.7% improvement. This means that in
practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
critical (e.g., high-frequent stats collection, which is often a practical use
for PERCPU_ARRAY today).

v1->v2:
  - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
  - dropped the direct per-CPU memory read instruction, it can always be added
    back, if necessary;
  - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
  - switched all per-cpu addr casts to (unsigned long) to avoid sparse
    warnings.
====================

Link: https://lore.kernel.org/r/20240402021307.1012571-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map

Using new per-CPU BPF instruction, partially inline
bpf_map_lookup_elem() helper for per-CPU hashmap BPF map. Just like for
normal HASH map, we still generate a call into __htab_map_lookup_elem(),
but after that we resolve per-CPU element address using a new
instruction, saving on extra functions calls.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps

Using new per-CPU BPF instruction implement inlining for per-CPU ARRAY
map lookup helper, if BPF JIT support is present.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: inline bpf_get_smp_processor_id() helper

If BPF JIT supports per-CPU MOV instruction, inline bpf_get_smp_processor_id()
to eliminate unnecessary function calls.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: add special internal-only MOV instruction to resolve per-CPU addrs

Add a new BPF instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF JITs.

We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field
set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish
them from positive ones used by user-exposed instructions.

Such instruction performs a resolution of a per-CPU offset stored in
a register to a valid kernel address which can be dereferenced. It is
useful in any use case where absolute address of a per-CPU data has to
be resolved (e.g., in inlining bpf_map_lookup_elem()).

BPF disassembler is also taught to recognize them to support dumping
final BPF assembly code (non-JIT'ed version).

Add arch-specific way for BPF JITs to mark support for this instructions.

This patch also adds support for these instructions in x86-64 BPF JIT.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Replace deprecated strncpy with strscpy

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

bpf sym names get looked up and compared/cleaned with various string
apis. This suggests they need to be NUL-terminated (strncpy() suggests
this but does not guarantee it).

| static int compare_symbol_name(const char *name, char *namebuf)
| {
| cleanup_symbol_name(namebuf);
| return strcmp(name, namebuf);
| }

| static void cleanup_symbol_name(char *s)
| {
| ...
| res = strstr(s, ".llvm.");
| ...
| }

Use strscpy() as this method guarantees NUL-termination on the
destination buffer.

This patch also replaces two uses of strncpy() used in log.c. These are
simple replacements as postfix has been zero-initialized on the stack
and has source arguments with a size less than the destination's size.

Note that this patch uses the new 2-argument version of strscpy
introduced in commit e6584c3964f2f ("string: Allow 2-argument strscpy()").

Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/bpf/20240402-strncpy-kernel-bpf-core-c-v1-1-7cb07a426e78@google.com

selftests/xsk: Add new test case for AF_XDP under max ring sizes

Introduce a test case to evaluate AF_XDP's robustness by pushing hardware
and software ring sizes to their limits. This test ensures AF_XDP's
reliability amidst potential producer/consumer throttling due to maximum
ring utilization. The testing strategy includes:

1. Configuring rings to their maximum allowable sizes.
2. Executing a series of tests across diverse batch sizes to assess
system's behavior under different configurations.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-8-tushar.vyavahare@intel.com

selftests/xsk: Test AF_XDP functionality under minimal ring configurations

Add a new test case that stresses AF_XDP and the driver by configuring
small hardware and software ring sizes. This verifies that AF_XDP continues
to function properly even with insufficient ring space that could lead
to frequent producer/consumer throttling. The test procedure involves:

1. Set the minimum possible ring configuration(tx 64 and rx 128).
2. Run tests with various batch sizes(1 and 63) to validate the system's
behavior under different configurations.

Update Makefile to include network_helpers.o in the build process for
xskxceiver.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-7-tushar.vyavahare@intel.com

selftests/xsk: Introduce set_ring_size function with a retry mechanism for handling AF_XDP socket closures

Introduce a new function, set_ring_size(), to manage asynchronous AF_XDP
socket closure. Retry set_hw_ring_size up to SOCK_RECONF_CTR times if it
fails due to an active AF_XDP socket. Return an error immediately for
non-EBUSY errors. This enhances robustness against asynchronous AF_XDP
socket closures during ring size changes.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-6-tushar.vyavahare@intel.com

selftests/bpf: Implement set_hw_ring_size function to configure interface ring size

Introduce a new function called set_hw_ring_size that allows for the
dynamic configuration of the ring size within the interface.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-5-tushar.vyavahare@intel.com

selftests/bpf: Implement get_hw_ring_size function to retrieve current and max interface size

Introduce a new function called get_hw_size that retrieves both the
current and maximum size of the interface and stores this information
in the 'ethtool_ringparam' structure.

Remove ethtool_channels struct from xdp_hw_metadata.c due to redefinition
error. Remove unused linux/if.h include from flow_dissector BPF test to
address CI pipeline failure.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-4-tushar.vyavahare@intel.com

selftests/xsk: Make batch size variable

Convert the constant BATCH_SIZE into a variable named batch_size to allow
dynamic modification at runtime. This is required for the forthcoming
changes to support testing different hardware ring sizes.

While running these tests, a bug was identified when the batch size is
roughly the same as the NIC ring size. This has now been addressed by
Maciej's fix in commit 913eda2b08cc ("i40e: xsk: remove count_mask").

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-3-tushar.vyavahare@intel.com

tools: Add ethtool.h header to tooling infra

This commit duplicates the ethtool.h file from the include/uapi/linux
directory in the kernel source to the tools/include/uapi/linux directory.

This action ensures that the ethtool.h file used in the tools directory
is in sync with the kernel's version, maintaining consistency across the
codebase.

There are some checkpatch warnings in this file that could be cleaned up,
but I preferred to move it over as-is for now to avoid disrupting the code.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-2-tushar.vyavahare@intel.com

Merge branch 'bpf-arm64-add-support-for-bpf-arena'

Puranjay Mohan says:

====================
bpf,arm64: Add support for BPF Arena

Changes in V4
V3: https://lore.kernel.org/bpf/20240323103057.26499-1-puranjay12@gmail.com/
- Use more descriptive variable names.
- Use insn_is_cast_user() helper.

Changes in V3
V2: https://lore.kernel.org/bpf/20240321153102.103832-1-puranjay12@gmail.com/
- Optimize bpf_addr_space_cast as suggested by Xu Kuohai

Changes in V2
V1: https://lore.kernel.org/bpf/20240314150003.123020-1-puranjay12@gmail.com/
- Fix build warnings by using 5 in place of 32 as DONT_CLEAR marker.
  R5 is not mapped to any BPF register so it can safely be used here.

This series adds the support for PROBE_MEM32 and bpf_addr_space_cast
instructions to the ARM64 BPF JIT. These two instructions allow the
enablement of BPF Arena.

All arena related selftests are passing.

  [root@ip-172-31-6-62 bpf]# ./test_progs -a "*arena*"
  #3/1     arena_htab/arena_htab_llvm:OK
  #3/2     arena_htab/arena_htab_asm:OK
  #3       arena_htab:OK
  #4/1     arena_list/arena_list_1:OK
  #4/2     arena_list/arena_list_1000:OK
  #4       arena_list:OK
  #434/1   verifier_arena/basic_alloc1:OK
  #434/2   verifier_arena/basic_alloc2:OK
  #434/3   verifier_arena/basic_alloc3:OK
  #434/4   verifier_arena/iter_maps1:OK
  #434/5   verifier_arena/iter_maps2:OK
  #434/6   verifier_arena/iter_maps3:OK
  #434     verifier_arena:OK
  Summary: 3/10 PASSED, 0 SKIPPED, 0 FAILED

This will need the patch [1] that introduced insn_is_cast_user() helper to
build.

The verifier_arena selftest could fail in the CI because the following
commit[2] is missing from bpf-next:

[1] https://lore.kernel.org/bpf/20240324183226.29674-1-puranjay12@gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=fa3550dca8f02ec312727653a94115ef3ab68445

Here is a CI run with all dependencies added: https://github.com/kernel-patches/bpf/pull/6641
====================

Link: https://lore.kernel.org/r/20240325150716.4387-1-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add arm64 JIT support for bpf_addr_space_cast instruction.

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))). The addr_space=0 is reserved as
bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY.

rY = addr_space_cast(rX, 1, 0) : used to convert a bpf arena pointer to
a pointer in the userspace vma. This has to be converted by the JIT.

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-3-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add arm64 JIT support for PROBE_MEM32 pseudo instructions.

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW]
instructions.  They are similar to PROBE_MEM instructions with the
following differences:
- PROBE_MEM32 supports store.
- PROBE_MEM32 relies on the verifier to clear upper 32-bit of the
  src/dst register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in R28
  in the prologue). Due to bpf_arena constructions such R28 + reg +
  off16 access is guaranteed to be within arena virtual range, so no
  address check at run-time.
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When
  LDX faults the destination register is zeroed.

To support these on arm64, we do tmp2 = R28 + src/dst reg and then use
tmp2 as the new src/dst register. This allows us to reuse most of the
code for normal [LDX | STX | ST].

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-2-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add pid limit for mptcpify prog

In order to prevent mptcpify prog from affecting the running results
of other BPF tests, a pid limit was added to restrict it from only
modifying its own program.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/8987e2938e15e8ec390b85b5dcbee704751359dc.1712054986.git.tanggeliang@kylinos.cn

libbpf: Use local bpf_helpers.h include

Commit 20d59ee55172fdf6 ("libbpf: add bpf_core_cast() macro") added a
bpf_helpers include in bpf_core_read.h as a system include. Usually, the
includes are local, though, like in bpf_tracing.h. This commit adjusts
the include to be local as well.

Signed-off-by: Tobias Böhm <tobias@aibor.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/q5d5bgc6vty2fmaazd5e73efd6f5bhiru2le6fxn43vkw45bls@fhlw2s5ootdb

bpf: Improve program stats run-time calculation

This patch improves the run-time calculation for program stats by
capturing the duration as soon as possible after the program returns.

Previously, the duration included u64_stats_t operations. While the
instrumentation overhead is part of the total time spent when stats are
enabled, distinguishing between the program's native execution time and
the time spent due to instrumentation is crucial for accurate
performance analysis.

By making this change, the patch facilitates more precise optimization
of BPF programs, enabling users to understand their performance in
environments without stats enabled.

I used a virtualized environment to measure the run-time over one minute
for a basic raw_tracepoint/sys_enter program, which just increments a
local counter. Although the virtualization introduced some performance
degradation that could affect the results, I observed approximately a
16% decrease in average run-time reported by stats with this change
(310 -> 260 nsec).

Signed-off-by: Jose Fernandez <josef@netflix.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402034010.25060-1-josef@netflix.com

selftests/bpf: Skip test when perf_event_open returns EOPNOTSUPP

When testing send_signal and stacktrace_build_id_nmi using the riscv sbi
pmu driver without the sscofpmf extension or the riscv legacy pmu driver,
then failures as follows are encountered:

    test_send_signal_common:FAIL:perf_event_open unexpected perf_event_open: actual -1 < expected 0
    #272/3   send_signal/send_signal_nmi:FAIL

    test_stacktrace_build_id_nmi:FAIL:perf_event_open err -1 errno 95
    #304     stacktrace_build_id_nmi:FAIL

The reason is that the above pmu driver or hardware does not support
sampling events, that is, PERF_PMU_CAP_NO_INTERRUPT is set to pmu
capabilities, and then perf_event_open returns EOPNOTSUPP. Since
PERF_PMU_CAP_NO_INTERRUPT is not only set in the riscv-related pmu driver,
it is better to skip testing when this capability is set.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402073029.1299085-1-pulehui@huaweicloud.com

bpftool: Use __typeof__() instead of typeof() in BPF skeleton

When generated BPF skeleton header is included in C++ code base, some
compiler setups will emit warning about using language extensions due to
typeof() usage, resulting in something like:

  error: extension used [-Werror,-Wlanguage-extension-token]
  obj->struct_ops.empty_tcp_ca = (typeof(obj->struct_ops.empty_tcp_ca))
                                  ^

It looks like __typeof__() is a preferred way to do typeof() with better
C++ compatibility behavior, so switch to that. With __typeof__() we get
no such warning.

Fixes: c2a0257c1edf ("bpftool: Cast pointers for shadow types explicitly.")
Fixes: 00389c58ffe9 ("bpftool: Add support for subskeletons")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Quentin Monnet <qmo@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240401170713.2081368-1-andrii@kernel.org

selftests/bpf: Using llvm may_goto inline asm for cond_break macro

Currently, cond_break macro uses bytes to encode the may_goto insn.
Patch [1] in llvm implemented may_goto insn in BPF backend.
Replace byte-level encoding with llvm inline asm for better usability.
Using llvm may_goto insn is controlled by macro __BPF_FEATURE_MAY_GOTO.

[1] https://github.com/llvm/llvm-project/commit/0e0bfacff71859d1f9212205f8f873d47029d3fb

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240402025446.3215182-1-yonghong.song@linux.dev

bpf: Add a verbose message if map limit is reached

When more than 64 maps are used by a program and its subprograms the
verifier returns -E2BIG. Add a verbose message which highlights the
source of the error and also print the actual limit.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240402073347.195920-1-aspsk@isovalent.com

bpf: Fix typo in uapi doc comments

In a few places in the bpf uapi headers, EOPNOTSUPP is missing a "P" in
the doc comments. This adds the missing "P".

Signed-off-by: David Lechner <dlechner@baylibre.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240329152900.398260-2-dlechner@baylibre.com

bpftool: Clean-up typos, punctuation, list formatting in docs

Improve the formatting of the attach flags for cgroup programs in the
relevant man page, and fix typos ("can be on of", "an userspace inet
socket") when introducing that list. Also fix a couple of other trivial
issues in docs.

[ Quentin: Fixed trival issues in bpftool-gen.rst and bpftool-iter.rst ]

Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-4-qmo@kernel.org

bpftool: Remove useless emphasis on command description in man pages

As it turns out, the terms in definition lists in the rST file are
already rendered with bold-ish formatting when generating the man pages;
all double-star sequences we have in the commands for the command
description are unnecessary, and can be removed to make the
documentation easier to read.

The rST files were automatically processed with:

sed -i '/DESCRIPTION/,/OPTIONS/ { /^\*/ s/\*\*//g }' b*.rst

Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-3-qmo@kernel.org

bpftool: Use simpler indentation in source rST for documentation

The rST manual pages for bpftool would use a mix of tabs and spaces for
indentation. While this is the norm in C code, this is rather unusual
for rST documents, and over time we've seen many contributors use a
wrong level of indentation for documentation update.

Let's fix bpftool's indentation in docs once and for all:

- Let's use spaces, that are more common in rST files.
- Remove one level of indentation for the synopsis, the command
  description, and the "see also" section. As a result, all sections
  start with the same indentation level in the generated man page.
- Rewrap the paragraphs after the changes.

There is no content change in this patch, only indentation and
rewrapping changes. The wrapping in the generated source files for the
manual pages is changed, but the pages displayed with "man" remain the
same, apart from the adjusted indentation level on relevant sections.

[ Quentin: rebased on bpf-next, removed indent level for command
  description and options, updated synopsis, command summary, and "see
  also" sections. ]

Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-2-qmo@kernel.org

selftests/bpf: make multi-uprobe tests work in RELEASE=1 mode

When BPF selftests are built in RELEASE=1 mode with -O2 optimization
level, uprobe_multi binary, called from multi-uprobe tests is optimized
to the point that all the thousands of target uprobe_multi_func_XXX
functions are eliminated, breaking tests.

So ensure they are preserved by using weak attribute.

But, actually, compiling uprobe_multi binary with -O2 takes a really
long time, and is quite useless (it's not a benchmark). So in addition
to ensuring that uprobe_multi_func_XXX functions are preserved, opt-out
of -O2 explicitly in Makefile and stick to -O0. This saves a lot of
compilation time.

With -O2, just recompiling uprobe_multi:

  $ touch uprobe_multi.c
  $ time make RELEASE=1 -j90
  make RELEASE=1 -j90  291.66s user 2.54s system 99% cpu 4:55.52 total

With -O0:
  $ touch uprobe_multi.c
  $ time make RELEASE=1 -j90
  make RELEASE=1 -j90  22.40s user 1.91s system 99% cpu 24.355 total

5 minutes vs (still slow, but...) 24 seconds.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240329190410.4191353-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie.

syzbot reported the following lock sequence:
cpu 2:
  grabs timer_base lock
    spins on bpf_lpm lock

cpu 1:
  grab rcu krcp lock
    spins on timer_base lock

cpu 0:
  grab bpf_lpm lock
    spins on rcu krcp lock

bpf_lpm lock can be the same.
timer_base lock can also be the same due to timer migration.
but rcu krcp lock is always per-cpu, so it cannot be the same lock.
Hence it's a false positive.
To avoid lockdep complaining move kfree_rcu() after spin_unlock.

Reported-by: syzbot+1fa663a2100308ab6eab@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240329171439.37813-1-alexei.starovoitov@gmail.com

Merge branch 'Use start_server and connect_fd_to_fd'

Geliang Tang says:

====================
Simplify bpf_tcp_ca test by using connect_fd_to_fd and start_server
helpers.

v4:
- Matt reminded me that I shouldn't send a square-to patch to BPF (thanks),
   so I update them into two patches in v4.

v3:
- split v2 as two patches as Daniel suggested.
- The patch "selftests/bpf: Use start_server in bpf_tcp_ca" is merged
   by Daniel (thanks), but I forgot to drop 'settimeo(lfd, 0)' in it, so
   I send a squash-to patch to fix this.
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

selftests/bpf: Drop settimeo in do_test

settimeo is invoked in start_server() and in connect_fd_to_fd() already,
no need to invoke settimeo(lfd, 0) and settimeo(fd, 0) in do_test()
anymore. This patch drops them.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/dbc3613bee3b1c78f95ac9ff468bf47c92f106ea.1711447102.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

selftests/bpf: Use connect_fd_to_fd in bpf_tcp_ca

To simplify the code, use BPF selftests helper connect_fd_to_fd() in
bpf_tcp_ca.c instead of open-coding it. This helper is defined in
network_helpers.c, and exported in network_helpers.h, which is already
included in bpf_tcp_ca.c.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/e105d1f225c643bee838409378dd90fd9aabb6dc.1711447102.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

bpf: Mark bpf prog stack with kmsan_unposion_memory in interpreter mode

syzbot reported uninit memory usages during map_{lookup,delete}_elem.

==========
BUG: KMSAN: uninit-value in __dev_map_lookup_elem kernel/bpf/devmap.c:441 [inline]
BUG: KMSAN: uninit-value in dev_map_lookup_elem+0xf3/0x170 kernel/bpf/devmap.c:796
__dev_map_lookup_elem kernel/bpf/devmap.c:441 [inline]
dev_map_lookup_elem+0xf3/0x170 kernel/bpf/devmap.c:796
____bpf_map_lookup_elem kernel/bpf/helpers.c:42 [inline]
bpf_map_lookup_elem+0x5c/0x80 kernel/bpf/helpers.c:38
___bpf_prog_run+0x13fe/0xe0f0 kernel/bpf/core.c:1997
__bpf_prog_run256+0xb5/0xe0 kernel/bpf/core.c:2237
==========

The reproducer should be in the interpreter mode.

The C reproducer is trying to run the following bpf prog:

    0: (18) r0 = 0x0
    2: (18) r1 = map[id:49]
    4: (b7) r8 = 16777216
    5: (7b) *(u64 *)(r10 -8) = r8
    6: (bf) r2 = r10
    7: (07) r2 += -229
            ^^^^^^^^^^

    8: (b7) r3 = 8
    9: (b7) r4 = 0
   10: (85) call dev_map_lookup_elem#1543472
   11: (95) exit

It is due to the "void *key" (r2) passed to the helper. bpf allows uninit
stack memory access for bpf prog with the right privileges. This patch
uses kmsan_unpoison_memory() to mark the stack as initialized.

This should address different syzbot reports on the uninit "void *key"
argument during map_{lookup,delete}_elem.

Reported-by: syzbot+603bcd9b0bf1d94dbb9b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000f9ce6d061494e694@google.com/
Reported-by: syzbot+eb02dc7f03dce0ef39f3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000a5c69c06147c2238@google.com/
Reported-by: syzbot+b4e65ca24fd4d0c734c3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000ac56fb06143b6cfa@google.com/
Reported-by: syzbot+d2b113dc9fea5e1d2848@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000000d69b206142d1ff7@google.com/
Reported-by: syzbot+1a3cf6f08d68868f9db3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000006f876b061478e878@google.com/
Tested-by: syzbot+1a3cf6f08d68868f9db3@syzkaller.appspotmail.com
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240328185801.1843078-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-fix-a-couple-of-test-failures-with-lto-kernel'

Yonghong Song says:

====================
bpf: Fix a couple of test failures with LTO kernel

With a LTO kernel built with clang, with one of earlier version of kernel,
I encountered two test failures, ksyms and kprobe_multi_bench_attach/kernel.
Now with latest bpf-next, only kprobe_multi_bench_attach/kernel failed.
But it is possible in the future ksyms selftest may fail again.

Both test failures are due to static variable/function renaming
due to cross-file inlining. For Ksyms failure, the solution is
to strip .llvm.<hash> suffixes for symbols in /proc/kallsyms before
comparing against the ksym in bpf program.
For kprobe_multi_bench_attach/kernel failure, the solution is
to either provide names in /proc/kallsyms to the kernel or
ignore those names who have .llvm.<hash> suffix since the kernel
sym name comparison is against /proc/kallsyms.

Please see each individual patches for details.

Changelogs:
  v2 -> v3:
    - no need to check config file, directly so strstr with '.llvm.'.
    - for kprobe_multi_bench with syms, instead of skipping the syms,
      consult /proc/kallyms to find corresponding names.
    - add a test with populating addrs to the kernel for kprobe
      multi attach.
  v1 -> v2:
    - Let libbpf handle .llvm.<hash suffixes since it may impact
      bpf program ksym.
====================

Link: https://lore.kernel.org/r/20240326041443.1197498-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add a kprobe_multi subtest to use addrs instead of syms

Get addrs directly from available_filter_functions_addrs and
send to the kernel during kprobe_multi_attach. This avoids
consultation of /proc/kallsyms. But available_filter_functions_addrs
is introduced in 6.5, i.e., it is introduced recently,
so I skip the test if the kernel does not support it.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041523.1200301-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix kprobe_multi_bench_attach test failure with LTO kernel

In my locally build clang LTO kernel (enabling CONFIG_LTO and
CONFIG_LTO_CLANG_THIN), kprobe_multi_bench_attach/kernel subtest
failed like:
  test_kprobe_multi_bench_attach:PASS:get_syms 0 nsec
  test_kprobe_multi_bench_attach:PASS:kprobe_multi_empty__open_and_load 0 nsec
  libbpf: prog 'test_kprobe_empty': failed to attach: No such process
  test_kprobe_multi_bench_attach:FAIL:bpf_program__attach_kprobe_multi_opts unexpected error: -3
  #117/1   kprobe_multi_bench_attach/kernel:FAIL

There are multiple symbols in /sys/kernel/debug/tracing/available_filter_functions
are renamed in /proc/kallsyms due to cross file inlining. One example is for
  static function __access_remote_vm in mm/memory.c.
In a non-LTO kernel, we have the following call stack:
  ptrace_access_vm (global, kernel/ptrace.c)
    access_remote_vm (global, mm/memory.c)
      __access_remote_vm (static, mm/memory.c)

With LTO kernel, it is possible that access_remote_vm() is inlined by
ptrace_access_vm(). So we end up with the following call stack:
  ptrace_access_vm (global, kernel/ptrace.c)
    __access_remote_vm (static, mm/memory.c)
The compiler renames __access_remote_vm to __access_remote_vm.llvm.<hash>
to prevent potential name collision.

The kernel bpf_kprobe_multi_link_attach() and ftrace_lookup_symbols() try
to find addresses based on /proc/kallsyms, hence the current test failed
with LTO kenrel.

This patch consulted /proc/kallsyms to find the corresponding entries
for the ksym and this solved the issue.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041518.1199758-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add {load,search}_kallsyms_custom_local()

These two functions allow selftests to do loading/searching
kallsyms based on their specific compare functions.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041513.1199440-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Refactor trace helper func load_kallsyms_local()

Refactor trace helper function load_kallsyms_local() such that
it invokes a common function with a compare function as input.
The common function will be used later for other local functions.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041508.1199239-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Refactor some functions for kprobe_multi_test

Refactor some functions in kprobe_multi_test.c to extract
some helper functions who will be used in later patches
to avoid code duplication.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041503.1198982-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Handle <orig_name>.llvm.<hash> symbol properly

With CONFIG_LTO_CLANG_THIN enabled, with some of previous
version of kernel code base ([1]), I hit the following
error:
   test_ksyms:PASS:kallsyms_fopen 0 nsec
   test_ksyms:FAIL:ksym_find symbol 'bpf_link_fops' not found
   #118     ksyms:FAIL

The reason is that 'bpf_link_fops' is renamed to
   bpf_link_fops.llvm.8325593422554671469
Due to cross-file inlining, the static variable 'bpf_link_fops'
in syscall.c is used by a function in another file. To avoid
potential duplicated names, the llvm added suffix
'.llvm.<hash>' ([2]) to 'bpf_link_fops' variable.
Such renaming caused a problem in libbpf if 'bpf_link_fops'
is used in bpf prog as a ksym but 'bpf_link_fops' does not
match any symbol in /proc/kallsyms.

To fix this issue, libbpf needs to understand that suffix '.llvm.<hash>'
is caused by clang lto kernel and to process such symbols properly.

With latest bpf-next code base built with CONFIG_LTO_CLANG_THIN,
I cannot reproduce the above failure any more. But such an issue
could happen with other symbols or in the future for bpf_link_fops symbol.

For example, with my current kernel, I got the following from
/proc/kallsyms:
  ffffffff84782154 d __func__.net_ratelimit.llvm.6135436931166841955
  ffffffff85f0a500 d tk_core.llvm.726630847145216431
  ffffffff85fdb960 d __fs_reclaim_map.llvm.10487989720912350772
  ffffffff864c7300 d fake_dst_ops.llvm.54750082607048300

I could not easily create a selftest to test newly-added
libbpf functionality with a static C test since I do not know
which symbol is cross-file inlined. But based on my particular kernel,
the following test change can run successfully.

>  diff --git a/tools/testing/selftests/bpf/prog_tests/ksyms.c b/tools/testing/selftests/bpf/prog_tests/ksyms.c
>  index 6a86d1f07800..904a103f7b1d 100644
>  --- a/tools/testing/selftests/bpf/prog_tests/ksyms.c
>  +++ b/tools/testing/selftests/bpf/prog_tests/ksyms.c
>  @@ -42,6 +42,7 @@ void test_ksyms(void)
>          ASSERT_EQ(data->out__bpf_link_fops, link_fops_addr, "bpf_link_fops");
>          ASSERT_EQ(data->out__bpf_link_fops1, 0, "bpf_link_fops1");
>          ASSERT_EQ(data->out__btf_size, btf_size, "btf_size");
>  +       ASSERT_NEQ(data->out__fake_dst_ops, 0, "fake_dst_ops");
>          ASSERT_EQ(data->out__per_cpu_start, per_cpu_start_addr, "__per_cpu_start");
>
>   cleanup:
>  diff --git a/tools/testing/selftests/bpf/progs/test_ksyms.c b/tools/testing/selftests/bpf/progs/test_ksyms.c
>  index 6c9cbb5a3bdf..fe91eef54b66 100644
>  --- a/tools/testing/selftests/bpf/progs/test_ksyms.c
>  +++ b/tools/testing/selftests/bpf/progs/test_ksyms.c
>  @@ -9,11 +9,13 @@ __u64 out__bpf_link_fops = -1;
>   __u64 out__bpf_link_fops1 = -1;
>   __u64 out__btf_size = -1;
>   __u64 out__per_cpu_start = -1;
>  +__u64 out__fake_dst_ops = -1;
>
>   extern const void bpf_link_fops __ksym;
>   extern const void __start_BTF __ksym;
>   extern const void __stop_BTF __ksym;
>   extern const void __per_cpu_start __ksym;
>  +extern const void fake_dst_ops __ksym;
>   /* non-existing symbol, weak, default to zero */
>   extern const void bpf_link_fops1 __ksym __weak;
>
>  @@ -23,6 +25,7 @@ int handler(const void *ctx)
>          out__bpf_link_fops = (__u64)&bpf_link_fops;
>          out__btf_size = (__u64)(&__stop_BTF - &__start_BTF);
>          out__per_cpu_start = (__u64)&__per_cpu_start;
>  +       out__fake_dst_ops = (__u64)&fake_dst_ops;
>
>          out__bpf_link_fops1 = (__u64)&bpf_link_fops1;

This patch fixed the issue in libbpf such that
the suffix '.llvm.<hash>' will be ignored during comparison of
bpf prog ksym vs. symbols in /proc/kallsyms, this resolved the issue.
Currently, only static variables in /proc/kallsyms are checked
with '.llvm.<hash>' suffix since in bpf programs function ksyms
with '.llvm.<hash>' suffix are most likely kfunc's and unlikely
to be cross-file inlined.

Note that currently kernel does not support gcc build with lto.

  [1] https://lore.kernel.org/bpf/20240302165017.1627295-1-yonghong.song@linux.dev/
  [2] https://github.com/llvm/llvm-project/blob/release/18.x/llvm/include/llvm/IR/ModuleSummaryIndex.h#L1714-L1719

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041458.1198161-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Mark libbpf_kallsyms_parse static function

Currently libbpf_kallsyms_parse() function is declared as a global
function but actually it is not a API and there is no external
users in bpftool/bpf-selftests. So let us mark the function as
static.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041453.1197949-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Replace CHECK with ASSERT macros for ksyms test

Replace CHECK with ASSERT macros for ksyms tests.
This test failed earlier with clang lto kernel, but the
issue is gone with latest code base. But replacing
CHECK with ASSERT still improves code as ASSERT is
preferred in selftests.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041448.1197812-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Test loading bpf-tcp-cc prog calling the kernel tcp-cc kfuncs

This patch adds a test to ensure all static tcp-cc kfuncs is visible to
the struct_ops bpf programs. It is checked by successfully loading
the struct_ops programs calling these tcp-cc kfuncs.

This patch needs to enable the CONFIG_TCP_CONG_DCTCP and
the CONFIG_TCP_CONG_BBR.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240322191433.4133280-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Remove CONFIG_X86 and CONFIG_DYNAMIC_FTRACE guard from the tcp-cc kfuncs

The commit 7aae231ac93b ("bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE")
added CONFIG_DYNAMIC_FTRACE guard because pahole was only generating
btf for ftrace-able functions. The ftrace filter had already been
removed from pahole, so the CONFIG_DYNAMIC_FTRACE guard can be
removed.

The commit 569c484f9995 ("bpf: Limit static tcp-cc functions in the .BTF_ids list to x86")
has added CONFIG_X86 guard because it failed the powerpc arch which
prepended a "." to the local static function, so "cubictcp_init" becomes
".cubictcp_init". "__bpf_kfunc" has been added to kfunc
since then and it uses the __unused compiler attribute.
There is an existing
"__bpf_kfunc static u32 bpf_kfunc_call_test_static_unused_arg(u32 arg, u32 unused)"
test in bpf_testmod.c to cover the static kfunc case.

cross compile on ppc64 with CONFIG_DYNAMIC_FTRACE disabled:
> readelf -s vmlinux | grep cubictcp_
56938: c00000000144fd00   184 FUNC    LOCAL  DEFAULT    2 cubictcp_cwnd_event     [<localentry>: 8]
56939: c00000000144fdb8   200 FUNC    LOCAL  DEFAULT    2 cubictcp_recalc_[...]   [<localentry>: 8]
56940: c00000000144fe80   296 FUNC    LOCAL  DEFAULT    2 cubictcp_init     [<localentry>: 8]
56941: c00000000144ffa8   228 FUNC    LOCAL  DEFAULT    2 cubictcp_state     [<localentry>: 8]
56942: c00000000145008c  1908 FUNC    LOCAL  DEFAULT    2 cubictcp_cong_avoid  [<localentry>: 8]
56943: c000000001450800  1644 FUNC    LOCAL  DEFAULT    2 cubictcp_acked     [<localentry>: 8]

> bpftool btf dump file vmlinux | grep cubictcp_
[51540] FUNC 'cubictcp_acked' type_id=38137 linkage=static
[51541] FUNC 'cubictcp_cong_avoid' type_id=38122 linkage=static
[51543] FUNC 'cubictcp_cwnd_event' type_id=51542 linkage=static
[51544] FUNC 'cubictcp_init' type_id=9186 linkage=static
[51545] FUNC 'cubictcp_recalc_ssthresh' type_id=35021 linkage=static
[51547] FUNC 'cubictcp_state' type_id=38141 linkage=static

The patch removed both config guards.

Cc: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240322191433.4133280-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Mitigate latency spikes associated with freeing non-preallocated htab

Following the recent upgrade of one of our BPF programs, we encountered
significant latency spikes affecting other applications running on the same
host. After thorough investigation, we identified that these spikes were
primarily caused by the prolonged duration required to free a
non-preallocated htab with approximately 2 million keys.

Notably, our kernel configuration lacks the presence of CONFIG_PREEMPT. In
scenarios where kernel execution extends excessively, other threads might
be starved of CPU time, resulting in latency issues across the system. To
mitigate this, we've adopted a proactive approach by incorporating
cond_resched() calls within the kernel code. This ensures that during
lengthy kernel operations, the scheduler is invoked periodically to provide
opportunities for other threads to execute.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240327032022.78391-1-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bench-fast-in-kernel-triggering-benchmarks'

Andrii Nakryiko says:

====================
bench: fast in-kernel triggering benchmarks

Remove "legacy" triggering benchmarks which rely on syscalls (and thus syscall
overhead is a noticeable part of benchmark, unfortunately). Replace them with
faster versions that rely on triggering BPF programs in-kernel through another
simple "driver" BPF program. See patch #2 with comparison results.

raw_tp/tp/fmodret benchmarks required adding a simple kfunc in kernel to be
able to trigger a simple tracepoint from BPF program (plus it is also allowed
to be replaced by fmod_ret programs). This limits raw_tp/tp/fmodret benchmarks
to new kernels only, but it keeps bench tool itself very portable and most of
other benchmarks will still work on wide variety of kernels without the need
to worry about building and deploying custom kernel module. See patches #5
and #6 for details.

v1->v2:
  - move new TP closer to BPF test run code;
  - rename/move kfunc and register it for fmod_rets (Alexei);
  - limit --trig-batch-iters param to [1, 1000] (Alexei).
====================

Link: https://lore.kernel.org/r/20240326162151.3981687-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add batched tp/raw_tp/fmodret tests

Utilize bpf_modify_return_test_tp() kfunc to have a fast way to trigger
tp/raw_tp/fmodret programs from another BPF program, which gives us
comparable batched benchmarks to (batched) kprobe/fentry benchmarks.

We don't switch kprobe/fentry batched benchmarks to this kfunc to make
bench tool usable on older kernels as well.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: add bpf_modify_return_test_tp() kfunc triggering tracepoint

Add a simple bpf_modify_return_test_tp() kfunc, available to all program
types, that is useful for various testing and benchmarking scenarios, as
it allows to trigger most tracing BPF program types from BPF side,
allowing to do complex testing and benchmarking scenarios.

It is also attachable to for fmod_ret programs, making it a good and
simple way to trigger fmod_ret program under test/benchmark.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: lazy-load trigger bench BPF programs

Instead of front-loading all possible benchmarking BPF programs for
trigger benchmarks, explicitly specify which BPF programs are used by
specific benchmark and load only it.

This allows to be more flexible in supporting older kernels, where some
program types might not be possible to load (e.g., those that rely on
newly added kfunc).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: remove syscall-driven benchs, keep syscall-count only

Remove "legacy" benchmarks triggered by syscalls in favor of newly added
in-kernel/batched benchmarks. Drop -batched suffix now as well.
Next patch will restore "feature parity" by adding back
tp/raw_tp/fmodret benchmarks based on in-kernel kfunc approach.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks

Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
one syscall execution and BPF program run. While we use a fast
get_pgid() syscall, syscall overhead can still be non-trivial.

This patch adds kprobe/fentry set of benchmarks significantly amortizing
the cost of syscall vs actual BPF triggering overhead. We do this by
employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
which does a tight parameterized loop calling cheap BPF helper
(bpf_get_numa_node_id()), to which kprobe/fentry programs are
attached for benchmarking.

This way 1 bpf() syscall causes N executions of BPF program being
benchmarked. N defaults to 100, but can be adjusted with
--trig-batch-iters CLI argument.

For comparison we also implement a new baseline program that instead of
triggering another BPF program just does N atomic per-CPU counter
increments, establishing the limit for all other types of program within
this batched benchmarking setup.

Taking the final set of benchmarks added in this patch set (including
tp/raw_tp/fmodret, added in later patch), and keeping for now "legacy"
syscall-driven benchmarks, we can capture all triggering benchmarks in
one place for comparison, before we remove the legacy ones (and rename
xxx-batched into just xxx).

$ benchs/run_bench_trigger.sh
usermode-count       :   79.500 ± 0.024M/s
kernel-count         :   49.949 ± 0.081M/s
syscall-count        :    9.009 ± 0.007M/s

fentry-batch         :   31.002 ± 0.015M/s
fexit-batch          :   20.372 ± 0.028M/s
fmodret-batch        :   21.651 ± 0.659M/s
rawtp-batch          :   36.775 ± 0.264M/s
tp-batch             :   19.411 ± 0.248M/s
kprobe-batch         :   12.949 ± 0.220M/s
kprobe-multi-batch   :   15.400 ± 0.007M/s
kretprobe-batch      :    5.559 ± 0.011M/s
kretprobe-multi-batch:    5.861 ± 0.003M/s

fentry-legacy        :    8.329 ± 0.004M/s
fexit-legacy         :    6.239 ± 0.003M/s
fmodret-legacy       :    6.595 ± 0.001M/s
rawtp-legacy         :    8.305 ± 0.004M/s
tp-legacy            :    6.382 ± 0.001M/s
kprobe-legacy        :    5.528 ± 0.003M/s
kprobe-multi-legacy  :    5.864 ± 0.022M/s
kretprobe-legacy     :    3.081 ± 0.001M/s
kretprobe-multi-legacy:   3.193 ± 0.001M/s

Note how xxx-batch variants are measured with significantly higher
throughput, even though it's exactly the same in-kernel overhead. As
such, results can be compared only between benchmarks of the same kind
(syscall vs batched):

fentry-legacy        :    8.329 ± 0.004M/s
fentry-batch         :   31.002 ± 0.015M/s

kprobe-multi-legacy  :    5.864 ± 0.022M/s
kprobe-multi-batch   :   15.400 ± 0.007M/s

Note also that syscall-count is setting a theoretical limit for
syscall-triggered benchmarks, while kernel-count is setting similar
limits for batch variants. usermode-count is a happy and unachievable
case of user space counting without doing any syscalls, and is mostly
the measure of CPU speed for such a trivial benchmark.

As was mentioned, tp/raw_tp/fmodret require kernel-side kfunc to produce
similar benchmark, which we address in a separate patch.

Note that run_bench_trigger.sh allows to override a list of benchmarks
to run, which is very useful for performance work.

Cc: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: rename and clean up userspace-triggered benchmarks

Rename uprobe-base to more precise usermode-count (it will match other
baseline-like benchmarks, kernel-count and syscall-count). Also use
BENCH_TRIG_USERMODE() macro to define all usermode-based triggering
benchmarks, which include usermode-count and uprobe/uretprobe benchmarks.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf,arena: Use helper sizeof_field in struct accessors

Use the well defined helper sizeof_field() to calculate the size of a
struct member, instead of doing custom calculations.

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Link: https://lore.kernel.org/r/20240327065334.8140-1-haiyue.wang@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: improve error message for unsupported helper

BPF verifier emits "unknown func" message when given BPF program type
does not support BPF helper. This message may be confusing for users, as
important context that helper is unknown only to current program type is
not provided.

This patch changes message to "program of this type cannot use helper "
and aligns dependent code in libbpf and tests. Any suggestions on
improving/changing this message are welcome.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20240325152210.377548-1-yatsenko@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add a check for struct bpf_fib_lookup size

The struct bpf_fib_lookup should not grow outside of its 64 bytes.
Add a static assert to validate this.

Suggested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240326101742.17421-4-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add BPF_FIB_LOOKUP_MARK tests

This patch extends the fib_lookup test suite by adding a few test
cases for each IP family to test the new BPF_FIB_LOOKUP_MARK flag
to the bpf_fib_lookup:

* Test destination IP address selection with and without a mark
and/or the BPF_FIB_LOOKUP_MARK flag set

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240326101742.17421-3-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add support for passing mark with bpf_fib_lookup

Extend the bpf_fib_lookup() helper by making it to utilize mark if
the BPF_FIB_LOOKUP_MARK flag is set. In order to pass the mark the
four bytes of struct bpf_fib_lookup are used, shared with the
output-only smac/dmac fields.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240326101742.17421-2-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'ravb-support-describing-the-mdio-bus'

Niklas Söderlund says:

====================
ravb: Support describing the MDIO bus

This series adds support to the binding and driver of the Renesas
Ethernet AVB to described the MDIO bus. Currently the driver uses
the OF node of the device itself when registering the MDIO bus.
This forces any MDIO bus properties the MDIO core should react on
to be set on the device OF node. This is confusing and none of
the MDIO bus properties are described in the Ethernet AVB bindings.

Patch 1/2 extends the bindings with an optional mdio child-node
to the device that can be used to contain the MDIO bus settings.
While patch 2/2 changes the driver to use this node (if present)
when registering the MDIO bus.

If the new optional mdio child-node is not present the driver
fallback to the old behavior and uses the device OF node like before.
This change is fully backward compatible with existing usage
of the bindings.
====================

Link: https://lore.kernel.org/r/20240325153451.2366083-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ravb: Add support for an optional MDIO mode

The driver used the DT node of the device itself when registering the
MDIO bus. While this works, it creates a problem: it forces any MDIO bus
properties to also be set on the devices DT node. This mixes the
properties of two distinctly different things and is confusing.

This change adds support for an optional mdio node to be defined as a
child to the device DT node. The child node can then be used to describe
MDIO bus properties that the MDIO core can act on when registering the
bus.

If no mdio child node is found the driver fallback to the old behavior
and register the MDIO bus using the device DT node. This change is
backward compatible with old bindings in use.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240325153451.2366083-3-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: renesas,etheravb: Add optional MDIO bus node

The Renesas Ethernet AVB bindings do not allow the MDIO bus to be
described. This has not been needed as only a single PHY is
supported and no MDIO bus properties have been needed.

Add an optional mdio node to the binding which allows the MDIO bus to be
described and allow bus properties to be set.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20240325153451.2366083-2-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'doc-netlink-specs-add-vlan-support'

Hangbin Liu says:

====================
doc/netlink/specs: Add vlan support

Add vlan support in rt_link spec.
====================

Link: https://lore.kernel.org/r/20240327123130.1322921-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink/specs: Add vlan attr in rt_link spec

With command:
# ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/rt_link.yaml \
--do getlink --json '{"ifname": "eno1.2"}' --output-json | \
jq -C '.linkinfo'

Before:
Exception: No message format for 'vlan' in sub-message spec 'linkinfo-data-msg'

After:
{
   "kind": "vlan",
   "data": {
     "protocol": "8021q",
     "id": 2,
     "flag": {
       "flags": [
         "reorder-hdr"
       ],
       "mask": "0xffffffff"
     },
     "egress-qos": {
       "mapping": [
         {
           "from": 1,
           "to": 2
         },
         {
           "from": 4,
           "to": 4
         }
       ]
     }
   }
}

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240327123130.1322921-3-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ynl: support hex display_hint for integer

Some times it would be convenient to read the integer as hex, like
mask values.

Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240327123130.1322921-2-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-fixes-for-kernel-ci'

Petr Machata says:

====================
selftests: Fixes for kernel CI

As discussed on the bi-weekly call on Jan 30, and in mailing around
kernel CI effort, some changes are desirable in the suite of forwarding
selftests the better to work with the CI tooling. Namely:

- The forwarding selftests use a configuration file where names of
  interfaces are defined and various variables can be overridden. There
  is also forwarding.config.sample that users can use as a template to
  refer to when creating the config file. What happens a fair bit is
  that users either do not know about this at all, or simply forget, and
  are confused by cryptic failures about interfaces that cannot be
  created.

  In patches #1 - #3 have lib.sh just be the single source of truth with
  regards to which variables exist. That includes the topology variables
  which were previously only in the sample file, and any "tweak
  variables", such as what tools to use, sleep times, etc.

  forwarding.config.sample then becomes just a placeholder with a couple
  examples. Unless specific HW should be exercised, or specific tools
  used, the defaults are usually just fine.

- Several net/forwarding/ selftests (and one net/ one) cannot be run on
  veth pairs, they need an actual HW interface to run on. They are
  generic in the sense that any capable HW should pass them, which is
  why they have been put to net/forwarding/ as opposed to drivers/net/,
  but they do not generalize to veth. The fact that these tests are in
  net/forwarding/, but still complaining when run, is confusing.

  In patches #4 - #6 move these tests to a new directory
  drivers/net/hw.

- The following patches extend the codebase to handle well test results
  other than pass and fail.

  Patch #7 is preparatory. It converts several log_test_skip to XFAIL,
  so that tests do not spuriously end up returning non-0 when they
  are not supposed to.

  In patches #8 - #10, introduce some missing ksft constants, then support
  having those constants in RET, and then finally in EXIT_STATUS.

- The traffic scheduler tests generate a large amount of network traffic
  to test the behavior of the scheduler. This demands a relatively
  high-performance computer. On slow machines, such as with a debugging
  kernel, the test would spuriously fail.

  It can still be useful to "go through the motions" though, to possibly
  catch bugs in setup of the scheduler graph and passing packets around.
  Thus we still want to run the tests, just with lowered demands.

  To that end, in patches #11 - #12, introduce an environment variable
  KSFT_MACHINE_SLOW, with obvious meaning. Tests can then make checks
  more lenient, such as mark failures as XFAIL. A helper, xfail_on_slow,
  is provided to mark performance-sensitive parts of the selftest.

- In patch #13, use a similar mechanism to mark a NH group stats
  selftest to XFAIL HW stats tests when run on VETH pairs.

- All these changes complicate the hitherto straightforward logging and
  checking logic, so in patch #14, add a selftest that checks this
  functionality in lib.sh.

v1 (vs. an RFC circulated through linux-kselftest):
- Patch #9:
    - Clarify intended usage by s/set_ret/ret_set_ksft_status/,
      s/nret/ksft_status/
====================

Link: https://lore.kernel.org/r/cover.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Add a test for testing lib.sh functionality

Rerunning various scenarios to make sure lib.sh changes do not impact the
observable behavior is no fun. Add a selftest at least for the bare basics
-- the mechanics of setting RET, retmsg, and EXIT_STATUS.

Since the selftest itself uses lib.sh, it would be possible to break lib.sh
in such a way that invalidates result of the selftest. Since the metatest
only uses the bare basics (just pass/fail), hopefully such fundamental
breakages would be noticed.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/6d25cedbf2d4b83614944809a34fe023fbe8db38.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: router_mpath_nh_lib: Don't skip, xfail on veth

When the NH group stats tests are currently run on a veth topology, the
HW-stats leg of each test is SKIP'ped. But kernel networking CI interprets
skips as a sign that tooling is missing, and prompts maintainer
investigation. Lack of capability to pass a test should be expressed as
XFAIL.

Selftests that require HW should normally be put in drivers/net/hw, but
doing so for the NH counter selftests would just lead to a lot of
duplicity.

So instead, introduce a helper, xfail_on_veth(), which can be used to mark
selftests that should XFAIL instead of FAILing when run on a veth topology.
On non-veth topology, they don't do anything.

Use the helper in the HW-stats part of router_mpath_nh_lib selftest.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/15f0ab9637aa0497f164ec30e83c1c8f53d53719.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Mark performance-sensitive tests

When run on a slow machine, the scheduler traffic tests can be expected to
fail, and should be reported as XFAIL in that case. Therefore run these
tests through the perf_sensitive wrapper.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/9a357f8cf34f5ececac08d43a3eb023008996035.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Support for performance sensitive tests

Several tests in the suite use large amounts of traffic to e.g. cause
congestion and evaluate RED or shaper performance. These tests will not run
well on a slow machine, be it one with heavy debug kernel, or a VM, or e.g.
a single-board computer. Allow users to specify an environment variable,
KSFT_MACHINE_SLOW=yes, to indicate that the tests are being run on one such
machine.

Performance sensitive tests can then use a new helper, xfail_on_slow(), to
mark parts of the test that are sensitive to low-performance machines.
The helper can be used to just mark the whole suite, like so:

xfail_on_slow tests_run

... or, on the other side of the granularity spectrum, to override
individual checks:

xfail_on_slow check_err $? "Expected much, got little."

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/99a376a2d2ffdaeee7752b1910cb0c3ea5d80fbe.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Convert log_test() to recognize RET values

In a previous patch, the interpretation of RET value was changed to mean
the kselftest framework constant with the test outcome: $ksft_pass,
$ksft_xfail, etc.

Update log_test() to recognize the various possible RET values.

Then have EXIT_STATUS track the RET value of the current test. This differs
subtly from the way RET tracks the value: while for RET we want to
recognize XFAIL as a separate status, for purposes of exit code, we want to
to conflate XFAIL and PASS, because they both communicate non-failure. Thus
add a new helper, ksft_exit_status_merge().

With this log_test_skip() and log_test_xfail() can be reexpressed as thin
wrappers around log_test.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/e5f807cb5476ab795fd14ac74da53a731a9fc432.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Have RET track kselftest framework constants

The variable RET keeps track of whether the test under execution has so far
failed or not. Currently it works in binary fashion: zero means everything
is fine, non-zero means something failed. log_test() then uses the value to
given a human-readable message.

In order to allow log_test() to report skips and xfails, the semantics of
RET need to be more fine-grained. Therefore have RET value be one of
kselftest framework constants: $ksft_fail, $ksft_xfail, etc.

The current logic in check_err() is such that first non-zero value of RET
trumps all those that follow. But that is not right when RET has more
fine-grained value semantics. Different outcomes have different weights.

The results of PASS and XFAIL are mostly the same: they both communicate a
test that did not go wrong. SKIP communicates lack of tooling, which the
user should go and try to fix, and as such should not be overridden by the
passes. So far, the higher-numbered statuses can be considered weightier.
But FAIL should be the weightiest.

Add a helper, ksft_status_merge(), which merges two statuses in a way that
respects the above conditions. Express it in a generic manner, because exit
status merge is subtly different, and we want to reuse the same logic.

Use the new helper when setting RET in check_err().

Re-express check_fail() in terms of check_err() to avoid duplication.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/7dfff51cc925c7a3ac879b9050a0d6a327c8d21f.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: lib: Define more kselftest exit codes

The following patches will operate with more exit codes besides
ksft_skip. Add them here.

Additionally, move a duplicated skip exit code definition from
forwarding/tc_tunnel_key.sh. Keep a similar duplicate in
forwarding/devlink_lib.sh, because even though lib.sh will have
been sourced in all cases where devlink_lib is, the inclusion is not
visible in the file itself, and relying on it would be confusing.

Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/545a03046c7aca0628a51a389a9b81949ab288ce.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Change inappropriate log_test_skip() calls

The SKIP return should be used for cases where tooling of the machine under
test is lacking. For cases where HW is lacking, the appropriate outcome is
XFAIL.

This is the case with ethtool_rmon and mlxsw_lib. For these, introduce a
new helper, log_test_xfail().

Do the same for router_mpath_nh_lib. Note that it will be fixed using a
more reusable way in a following patch.

For the two resource_scale selftests, the log should simply not be written,
because there is no problem.

Cc: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/3d668d8fb6fa0d9eeb47ce6d9e54114348c7c179.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Ditch skip_on_veth()

Since the selftests that are not supposed to run on veth pairs are now in
their own dedicated directory, the skip_on_veth logic can go away. Drop it
from the selftests, and from lib.sh.

Cc: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/63b470e10d65270571ee7de709b31672ce314872.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Move several selftests

The tests in net/forwarding are generally expected to be HW-independent.
There are however several tests that, while not depending on any HW in
particular, nevertheless depend on being used on HW interfaces. Placing
these selftests to net/forwarding is confusing, because the selftest will
just report it can't be run on veth pairs. At the same time, placing them
to a particular driver's selftests subdirectory would be wrong.

Instead, add a new directory, drivers/net/hw, where these generic but HW
independent selftests should be placed. Move over several such tests
including one helper library.

Since typically these tests will not be expected to run, omit the directory
drivers/net/hw from the TARGETS list in selftests/Makefile. Retain a
Makefile in the new directory itself, so that a user can make -C into that
directory and act on those tests explicitly.

Cc: Roger Quadros <rogerq@kernel.org>
Cc: Tobias Waldekranz <tobias@waldekranz.com>
Cc: Danielle Ratson <danieller@nvidia.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Cc: Johannes Nixdorf <jnixdorf-oss@avm.de>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/e11dae1f62703059e9fc2240004288ac7cc15756.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: ipip_lib: Do not import lib.sh

This library is always sourced in the context where lib.sh has already been
sourced as well. Therefore drop the explicit sourcing and expect the client
to already have done it. This will simplify moving some of the clients to a
different directory.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/a4da5e9cd42a34cbace917a048ca71081719d6ac.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: README: Document customization

That any sort of customization is possible at all, let alone how it should
be done, is currently not at all clear. Document the whats and hows in
README.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/e819623af6aaeea49e9dc36cecd95694fad73bb8.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding.config.sample: Move overrides to lib.sh

forwarding.config.sample, net/lib.sh and net/forwarding/lib.sh contain
definitions and redefinitions of some of the same variables. The overlap
between net/forwarding/lib.sh and forwarding.config.sample is especially
large. This duplication is a potential source of confusion and problems.

It would be overall less error prone if each variable were defined in one
place only. In this patch set, that place is the library itself. Therefore
move all comments from forwarding.config.sample to net/forwarding/lib.sh.

Move over also a definition of TC_FLAG, which was missing from lib.sh
entirely.

Additionally, add to lib.sh a default definition of the topology variables.
The logic behind this is that forgetting to specify forwarding.config was a
frequent source of frustration for the selftest users. But really, most of
the time the default veth based topology is just fine. We considered just
sourcing forwarding.config.sample instead if forwarding.config is not
available, but this is a cleaner solution.

That means the syntax of the forwarding.config.sample override has to
change to an array assignment, so that the whole variable is overwritten,
not just individual keys, which could leave the value of some keys
unchanged. Do the same in lib.sh for any cut'n'pasters out there.

The config file is then given a sort of carte blanche to redefine whatever
variables it sees fit from the libraries. This is described in a comment in
the file. Only a handful of variables are left behind, to illustrate the
customization.

The fact that the variables are now missing from forwarding.config.sample,
and therefore would miss from forwarding.config derived from that file as
well, should not change anything. This is just the sample file. Users that
keep their own forwarding.config would retain it as before.

The only observable change is introduction of TC_FLAG to lib.sh, because
now the filters would not be attempted to install to HW datapath. For veth
pairs this does not change anything. For HW deployments, users presumably
have forwarding.config with this value overridden.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/b9b8a11a22821a7aa532211ff461a34f596e26bf.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: libs: Change variable fallback syntax

The current syntax of X=${X:=X} first evaluates the ${X:=Y} expression,
which either uses the existing value of $X if there is one, or uses the
value of "Y" as a fallback, and assigns it to X. The expression is then
replaced with the now-current value of $X. Assigning that value to X once
more is meaningless.

So avoid the outer X=... bit, and instead express the same idea though the
do-nothing ":" built-in as : "${X:=Y}". This also cleans up the block
nicely and makes it more readable.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/1890ddc58420c2c0d5ba3154c87ecc6d9faf6947.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts, or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.9-rc2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from bpf, WiFi and netfilter.

  Current release - regressions:

   - ipv6: fix address dump when IPv6 is disabled on an interface

  Current release - new code bugs:

   - bpf: temporarily disable atomic operations in BPF arena

   - nexthop: fix uninitialized variable in nla_put_nh_group_stats()

  Previous releases - regressions:

   - bpf: protect against int overflow for stack access size

   - hsr: fix the promiscuous mode in offload mode

   - wifi: don't always use FW dump trig

   - tls: adjust recv return with async crypto and failed copy to
     userspace

   - tcp: properly terminate timers for kernel sockets

   - ice: fix memory corruption bug with suspend and rebuild

   - at803x: fix kernel panic with at8031_probe

   - qeth: handle deferred cc1

  Previous releases - always broken:

   - bpf: fix bug in BPF_LDX_MEMSX

   - netfilter: reject table flag and netdev basechain updates

   - inet_defrag: prevent sk release while still in use

   - wifi: pick the version of SESSION_PROTECTION_NOTIF

   - wwan: t7xx: split 64bit accesses to fix alignment issues

   - mlxbf_gige: call request_irq() after NAPI initialized

   - hns3: fix kernel crash when devlink reload during pf
     initialization"

* tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  inet: inet_defrag: prevent sk release while still in use
  Octeontx2-af: fix pause frame configuration in GMP mode
  net: lan743x: Add set RFE read fifo threshold for PCI1x1x chips
  net: bcmasp: Remove phy_{suspend/resume}
  net: bcmasp: Bring up unimac after PHY link up
  net: phy: qcom: at803x: fix kernel panic with at8031_probe
  netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c
  netfilter: nf_tables: skip netdev hook unregistration if table is dormant
  netfilter: nf_tables: reject table flag and netdev basechain updates
  netfilter: nf_tables: reject destroy command to remove basechain hooks
  bpf: update BPF LSM designated reviewer list
  bpf: Protect against int overflow for stack access size
  bpf: Check bloom filter map value size
  bpf: fix warning for crash_kexec
  selftests: netdevsim: set test timeout to 10 minutes
  net: wan: framer: Add missing static inline qualifiers
  mlxbf_gige: call request_irq() after NAPI initialized
  tls: get psock ref after taking rxlock to avoid leak
  selftests: tls: add test with a partially invalid iov
  tls: adjust recv return with async crypto and failed copy to userspace
  ...

inet: inet_defrag: prevent sk release while still in use

ip_local_out() and other functions can pass skb->sk as function argument.

If the skb is a fragment and reassembly happens before such function call
returns, the sk must not be released.

This affects skb fragments reassembled via netfilter or similar
modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.

Eric Dumazet made an initial analysis of this bug.  Quoting Eric:
  Calling ip_defrag() in output path is also implying skb_orphan(),
  which is buggy because output path relies on sk not disappearing.

  A relevant old patch about the issue was :
  8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")

  [..]

  net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
  inet socket, not an arbitrary one.

  If we orphan the packet in ipvlan, then downstream things like FQ
  packet scheduler will not work properly.

  We need to change ip_defrag() to only use skb_orphan() when really
  needed, ie whenever frag_list is going to be used.

Eric suggested to stash sk in fragment queue and made an initial patch.
However there is a problem with this:

If skb is refragmented again right after, ip_do_fragment() will copy
head->sk to the new fragments, and sets up destructor to sock_wfree.
IOW, we have no choice but to fix up sk_wmem accouting to reflect the
fully reassembled skb, else wmem will underflow.

This change moves the orphan down into the core, to last possible moment.
As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
offset into the FRAG_CB, else skb->sk gets clobbered.

This allows to delay the orphaning long enough to learn if the skb has
to be queued or if the skb is completing the reasm queue.

In the former case, things work as before, skb is orphaned.  This is
safe because skb gets queued/stolen and won't continue past reasm engine.

In the latter case, we will steal the skb->sk reference, reattach it to
the head skb, and fix up wmem accouting when inet_frag inflates truesize.

Fixes: 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn().")
Diagnosed-by: Eric Dumazet <edumazet@google.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yue sun <samsun1006219@gmail.com>
Reported-by: syzbot+e5167d7144a62715044c@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240326101845.30836-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Octeontx2-af: fix pause frame configuration in GMP mode

The Octeontx2 MAC block (CGX) has separate data paths (SMU and GMP) for
different speeds, allowing for efficient data transfer.

The previous patch which added pause frame configuration has a bug due
to which pause frame feature is not working in GMP mode.

This patch fixes the issue by configurating appropriate registers.

Fixes: f7e086e754fe ("octeontx2-af: Pause frame configuration at cgx")
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240326052720.4441-1-hkelam@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: lan743x: Add set RFE read fifo threshold for PCI1x1x chips

PCI11x1x Rev B0 devices might drop packets when receiving back to back frames
at 2.5G link speed. Change the B0 Rev device's Receive filtering Engine FIFO
threshold parameter from its hardware default of 4 to 3 dwords to prevent the
problem. Rev C0 and later hardware already defaults to 3 dwords.

Fixes: bb4f6bffe33c ("net: lan743x: Add PCI11010 / PCI11414 device IDs")
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240326065805.686128-1-Raju.Lakkaraju@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-bcmasp-phy-managements-fixes'

Justin Chen says:

====================
net: bcmasp: phy managements fixes

Fix two issues.

- The unimac may be put in a bad state if PHY RX clk doesn't exist
during reset. Work around this by bringing the unimac out of reset
during phy up.

- Remove redundant phy_{suspend/resume}
====================

Link: https://lore.kernel.org/r/20240325193025.1540737-1-justin.chen@broadcom.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>