Dave Marchevsky [Tue, 14 Feb 2023 00:40:17 +0000 (16:40 -0800)]
bpf, documentation: Add graph documentation for non-owning refs
It is difficult to intuit the semantics of owning and non-owning
references from verifier code. In order to keep the high-level details
from being lost in the mailing list, this patch adds documentation
explaining semantics and details.
The target audience of doc added in this patch is folks working on BPF
internals, as there's focus on "what should the verifier do here". Via
reorganization or copy-and-paste, much of the content can probably be
repurposed for BPF program writer audience as well.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-9-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:16 +0000 (16:40 -0800)]
selftests/bpf: Add rbtree selftests
This patch adds selftests exercising the logic changed/added in the
previous patches in the series. A variety of successful and unsuccessful
rbtree usages are validated:
Success:
* Add some nodes, let map_value bpf_rbtree_root destructor clean them
up
* Add some nodes, remove one using the non-owning ref leftover by
successful rbtree_add() call
* Add some nodes, remove one using the non-owning ref returned by
rbtree_first() call
Failure:
* BTF where bpf_rb_root owns bpf_list_node should fail to load
* BTF where node of type X is added to tree containing nodes of type Y
should fail to load
* No calling rbtree api functions in 'less' callback for rbtree_add
* No releasing lock in 'less' callback for rbtree_add
* No removing a node which hasn't been added to any tree
* No adding a node which has already been added to a tree
* No escaping of non-owning references past their lock's
critical section
* No escaping of non-owning references past other invalidation points
(rbtree_remove)
These tests mostly focus on rbtree-specific additions, but some of the
failure cases revalidate scenarios common to both linked_list and rbtree
which are covered in the former's tests. Better to be a bit redundant in
case linked_list and rbtree semantics deviate over time.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-8-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:15 +0000 (16:40 -0800)]
bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h
These kfuncs will be used by selftests in following patches
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-7-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:14 +0000 (16:40 -0800)]
bpf: Special verifier handling for bpf_rbtree_{remove, first}
Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties
that require handling in the verifier:
* both bpf_rbtree_remove and bpf_rbtree_first return the type containing
the bpf_rb_node field, with the offset set to that field's offset,
instead of a struct bpf_rb_node *
* mark_reg_graph_node helper added in previous patch generalizes
this logic, use it
* bpf_rbtree_remove's node input is a node that's been inserted
in the tree - a non-owning reference.
* bpf_rbtree_remove must invalidate non-owning references in order to
avoid aliasing issue. Use previously-added
invalidate_non_owning_refs helper to mark this function as a
non-owning ref invalidation point.
* Unlike other functions, which convert one of their input arg regs to
non-owning reference, bpf_rbtree_first takes no arguments and just
returns a non-owning reference (possibly null)
* For now verifier logic for this is special-cased instead of
adding new kfunc flag.
This patch, along with the previous one, complete special verifier
handling for all rbtree API functions added in this series.
With functional verifier handling of rbtree_remove, under current
non-owning reference scheme, a node type with both bpf_{list,rb}_node
fields could cause the verifier to accept programs which remove such
nodes from collections they haven't been added to.
In order to prevent this, this patch adds a check to btf_parse_fields
which rejects structs with both bpf_{list,rb}_node fields. This is a
temporary measure that can be removed after "collection identity"
followup. See comment added in btf_parse_fields. A linked_list BTF test
exercising the new check is added in this patch as well.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:13 +0000 (16:40 -0800)]
bpf: Add callback validation to kfunc verifier logic
Some BPF helpers take a callback function which the helper calls. For
each helper that takes such a callback, there's a special call to
__check_func_call with a callback-state-setting callback that sets up
verifier bpf_func_state for the callback's frame.
kfuncs don't have any of this infrastructure yet, so let's add it in
this patch, following existing helper pattern as much as possible. To
validate functionality of this added plumbing, this patch adds
callback handling for the bpf_rbtree_add kfunc and hopes to lay
groundwork for future graph datastructure callbacks.
In the "general plumbing" category we have:
* check_kfunc_call doing callback verification right before clearing
CALLER_SAVED_REGS, exactly like check_helper_call
* recognition of func_ptr BTF types in kfunc args as
KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type
In the "rbtree_add / graph datastructure-specific plumbing" category:
* Since bpf_rbtree_add must be called while the spin_lock associated
with the tree is held, don't complain when callback's func_state
doesn't unlock it by frame exit
* Mark rbtree_add callback's args with ref_set_non_owning
to prevent rbtree api functions from being called in the callback.
Semantically this makes sense, as less() takes no ownership of its
args when determining which comes first.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:12 +0000 (16:40 -0800)]
bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
Now that we find bpf_rb_root and bpf_rb_node in structs, let's give args
that contain those types special classification and properly handle
these types when checking kfunc args.
"Properly handling" these types largely requires generalizing similar
handling for bpf_list_{head,node}, with little new logic added in this
patch.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:11 +0000 (16:40 -0800)]
bpf: Add bpf_rbtree_{add,remove,first} kfuncs
This patch adds implementations of bpf_rbtree_{add,remove,first}
and teaches verifier about their BTF_IDs as well as those of
bpf_rb_{root,node}.
All three kfuncs have some nonstandard component to their verification
that needs to be addressed in future patches before programs can
properly use them:
* bpf_rbtree_add: Takes 'less' callback, need to verify it
* bpf_rbtree_first: Returns ptr_to_node_type(off=rb_node_off) instead
of ptr_to_rb_node(off=0). Return value ref is
non-owning.
* bpf_rbtree_remove: Returns ptr_to_node_type(off=rb_node_off) instead
of ptr_to_rb_node(off=0). 2nd arg (node) is a
non-owning reference.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:10 +0000 (16:40 -0800)]
bpf: Add basic bpf_rb_{root,node} support
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
map_values.
structs bpf_rb_root and bpf_rb_node are opaque types meant to
obscure structs rb_root_cached rb_node, respectively.
btf_struct_access will prevent BPF programs from touching these special
fields automatically now that they're recognized.
btf_check_and_fixup_fields now groups list_head and rb_root together as
"graph root" fields and {list,rb}_node as "graph node", and does same
ownership cycle checking as before. Note that this function does _not_
prevent ownership type mixups (e.g. rb_root owning list_node) - that's
handled by btf_parse_graph_root.
After this patch, a bpf program can have a struct bpf_rb_root in a
map_value, but not add anything to nor do anything useful with it.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Sun, 12 Feb 2023 09:27:07 +0000 (01:27 -0800)]
bpf: Migrate release_on_unlock logic to non-owning ref semantics
This patch introduces non-owning reference semantics to the verifier,
specifically linked_list API kfunc handling. release_on_unlock logic for
refs is refactored - with small functional changes - to implement these
semantics, and bpf_list_push_{front,back} are migrated to use them.
When a list node is pushed to a list, the program still has a pointer to
the node:
n = bpf_obj_new(typeof(*n));
bpf_spin_lock(&l);
bpf_list_push_back(&l, n);
/* n still points to the just-added node */
bpf_spin_unlock(&l);
What the verifier considers n to be after the push, and thus what can be
done with n, are changed by this patch.
Common properties both before/after this patch:
* After push, n is only a valid reference to the node until end of
critical section
* After push, n cannot be pushed to any list
* After push, the program can read the node's fields using n
Before:
* After push, n retains the ref_obj_id which it received on
bpf_obj_new, but the associated bpf_reference_state's
release_on_unlock field is set to true
* release_on_unlock field and associated logic is used to implement
"n is only a valid ref until end of critical section"
* After push, n cannot be written to, the node must be removed from
the list before writing to its fields
* After push, n is marked PTR_UNTRUSTED
After:
* After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF
type flag is added to reg's type, indicating that it's a non-owning
reference.
* NON_OWN_REF flag and logic is used to implement "n is only a
valid ref until end of critical section"
* n can be written to (except for special fields e.g. bpf_list_node,
timer, ...)
Summary of specific implementation changes to achieve the above:
* release_on_unlock field, ref_set_release_on_unlock helper, and logic
to "release on unlock" based on that field are removed
* The anonymous active_lock struct used by bpf_verifier_state is
pulled out into a named struct bpf_active_lock.
* NON_OWN_REF type flag is introduced along with verifier logic
changes to handle non-owning refs
* Helpers are added to use NON_OWN_REF flag to implement non-owning
ref semantics as described above
* invalidate_non_owning_refs - helper to clobber all non-owning refs
matching a particular bpf_active_lock identity. Replaces
release_on_unlock logic in process_spin_lock.
* ref_set_non_owning - set NON_OWN_REF type flag after doing some
sanity checking
* ref_convert_owning_non_owning - convert owning reference w/
specified ref_obj_id to non-owning references. Set NON_OWN_REF
flag for each reg with that ref_obj_id and 0-out its ref_obj_id
* Update linked_list selftests to account for minor semantic
differences introduced by this patch
* Writes to a release_on_unlock node ref are not allowed, while
writes to non-owning reference pointees are. As a result the
linked_list "write after push" failure tests are no longer scenarios
that should fail.
* The test##missing_lock##op and test##incorrect_lock##op
macro-generated failure tests need to have a valid node argument in
order to have the same error output as before. Otherwise
verification will fail early and the expected error output won't be seen.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Mon, 13 Feb 2023 18:13:13 +0000 (19:13 +0100)]
Merge branch 'xdp-ice-mbuf'
Alexander Lobakin says:
====================
The set grew from the poor performance of %BPF_F_TEST_XDP_LIVE_FRAMES
when the ice-backed device is a sender. Initially there were around
3.3 Mpps / thread, while I have 5.5 on skb-based pktgen ...
After fixing 0005 (0004 is a prereq for it) first (strange thing nobody
noticed that earlier), I started catching random OOMs. This is how 0002
(and partially 0001) appeared.
0003 is a suggestion from Maciej to not waste time on refactoring dead
lines. 0006 is a "cherry on top" to get away with the final 6.7 Mpps.
4.5 of 6 are fixes, but only the first three are tagged, since it then
starts being tricky. I may backport them manually later on.
TL;DR for the series is that shortcuts are good, but only as long as
they don't make the driver miss important things. %XDP_TX is purely
driver-local, however .ndo_xdp_xmit() is not, and sometimes assumptions
can be unsafe there.
With that series and also one core code patch[0], "live frames" and
xdp-trafficgen are now safe'n'fast on ice (probably more to come).
[0] https://lore.kernel.org/all/
20230209172827.874728-1-alexandr.lobakin@intel.com
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexander Lobakin [Fri, 10 Feb 2023 17:06:18 +0000 (18:06 +0100)]
ice: Micro-optimize .ndo_xdp_xmit() path
After the recent mbuf changes, ice_xmit_xdp_ring() became a 3-liner.
It makes no sense to keep it global in a different file than its caller.
Move it just next to the sole call site and mark static. Also, it
doesn't need a full xdp_convert_frame_to_buff(). Save several cycles
and fill only the fields used by __ice_xmit_xdp_ring() later on.
Finally, since it doesn't modify @xdpf anyhow, mark the argument const
to save some more (whole -11 bytes of .text! :D).
Thanks to 1 jump less and less calcs as well, this yields as many as
6.7 Mpps per queue. `xdp.data_hard_start = xdpf` is fully intentional
again (see xdp_convert_buff_to_frame()) and just works when there are
no source device's driver issues.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-7-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:17 +0000 (18:06 +0100)]
ice: Fix freeing XDP frames backed by Page Pool
As already mentioned, freeing any &xdp_frame via page_frag_free() is
wrong, as it assumes the frame is backed by either an order-0 page or
a page with no "patrons" behind them, while in fact frames backed by
Page Pool can be redirected to a device, which's driver doesn't use it.
Keep storing a pointer to the raw buffer and then freeing it
unconditionally via page_frag_free() for %XDP_TX frames, but introduce
a separate type in the enum for frames coming through .ndo_xdp_xmit(),
and free them via xdp_return_frame_bulk(). Note that saving xdpf as
xdp_buff->data_hard_start is intentional and is always true when
everything is configured properly.
After this change, %XDP_REDIRECT from a Page Pool based driver to ice
becomes zero-alloc as it should be and horrendous 3.3 Mpps / queue
turn into 6.6, hehe.
Let it go with no "Fixes:" tag as it spans across good 5+ commits and
can't be trivially backported.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-6-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:16 +0000 (18:06 +0100)]
ice: Robustify cleaning/completing XDP Tx buffers
When queueing frames from a Page Pool for redirecting to a device backed
by the ice driver, `perf top` shows heavy load on page_alloc() and
page_frag_free(), despite that on a properly working system it must be
fully or at least almost zero-alloc. The problem is in fact a bit deeper
and raises from how ice cleans up completed Tx buffers.
The story so far: when cleaning/freeing the resources related to
a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some
heuristics only without setting any type explicitly (except for dummy
Flow Director packets, which are marked via ice_tx_buf::tx_flags).
This kinda works, but only up to some point. For example, currently ice
assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by
either plain order-0 page or plain page frag, while it may also be
backed by Page Pool or any other possible memory models introduced in
future. This means any &xdp_frame must be freed properly via
xdp_return_frame() family with no assumptions.
In order to do that, the whole heuristics must be replaced with setting
the Tx buffer/frame type explicitly, just how it's always been done via
an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't
hurt much -- especially given that sometimes there was a check for
%ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum
member. The rest of the changes is straightforward and most of it is
just a conversion to rely now on the type set in &ice_tx_buf rather than
to some secondary properties.
For now, no functional changes intended, the change only prepares the
ground for starting freeing XDP frames properly next step. And it must
be done atomically/synchronously to not break stuff.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:15 +0000 (18:06 +0100)]
ice: Remove two impossible branches on XDP Tx cleaning
The tagged commit started sending %XDP_TX frames from XSk Rx ring
directly without converting it to an &xdp_frame. However, when XSk is
enabled on a queue pair, it has its separate Tx cleaning functions, so
neither ice_clean_xdp_irq() nor ice_unmap_and_free_tx_buf() ever happens
there.
Remove impossible branches in order to reduce the diffstat of the
upcoming change.
Fixes: a24b4c6e9aab ("ice: xsk: Do not convert to buff to frame for XDP_TX")
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-4-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:14 +0000 (18:06 +0100)]
ice: Fix XDP Tx ring overrun
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic
generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due
to the driver not freeing all of the pages passed to it by
.ndo_xdp_xmit().
Turned out that during the development of the tagged commit, the check,
which ensures that we have a free descriptor to queue a frame, moved
into the branch happening only when a buffer has frags. Otherwise, we
only run a cleaning cycle, but don't check anything.
ATST, there can be situations when the driver gets new frames to send,
but there are no buffers that can be cleaned/completed and the ring has
no free slots. It's very rare, but still possible (> 6.5 Mpps per ring).
The driver then fills the next buffer/descriptor, effectively
overwriting the data, which still needs to be freed.
Restore the check after the cleaning routine to make sure there is a
slot to queue a new frame. When there are frags, there still will be a
separate check that we can place all of them, but if the ring is full,
there's no point in wasting any more time.
(minor: make `!ready_frames` unlikely since it happens ~1-2 times per
billion of frames)
Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:13 +0000 (18:06 +0100)]
ice: fix ice_tx_ring:: Xdp_tx_active underflow
xdp_tx_active is used to indicate whether an XDP ring has any %XDP_TX
frames queued to shortcut processing Tx cleaning for XSk-enabled queues.
When !XSk, it simply indicates whether the ring has any queued frames in
general.
It gets increased on each frame placed onto the ring and counts the
whole frame, not each frag. However, currently it gets decremented in
ice_clean_xdp_tx_buf(), which is called per each buffer, i.e. per each
frag. Thus, on completing multi-frag frames, an underflow happens.
Move the decrement to the outer function and do it once per frame, not
buf. Also, do that on the stack and update the ring counter after the
loop is done to save several cycles.
XSk rings are fine since there are no frags at the moment.
Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-2-alexandr.lobakin@intel.com
Ilya Leoshkevich [Wed, 8 Feb 2023 23:12:11 +0000 (00:12 +0100)]
selftests/bpf: Fix out-of-srctree build
Building BPF selftests out of srctree fails with:
make: *** No rule to make target '/linux-build//ima_setup.sh', needed by 'ima_setup.sh'. Stop.
The culprit is the rule that defines convenient shorthands like
"make test_progs", which builds $(OUTPUT)/test_progs. These shorthands
make sense only for binaries that are built though; scripts that live
in the source tree do not end up in $(OUTPUT).
Therefore drop $(TEST_PROGS) and $(TEST_PROGS_EXTENDED) from the rule.
The issue exists for a while, but it became a problem only after commit
d68ae4982cb7 ("selftests/bpf: Install all required files to run selftests"),
which added dependencies on these scripts.
Fixes: 03dcb78460c2 ("selftests/bpf: Add simple per-test targets to Makefile")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230208231211.283606-1-iii@linux.ibm.com
Alan Maguire [Thu, 9 Feb 2023 13:28:51 +0000 (13:28 +0000)]
bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25
v1.25 of pahole supports filtering out functions with multiple inconsistent
function prototypes or optimized-out parameters from the BTF representation.
These present problems because there is no additional info in BTF saying which
inconsistent prototype matches which function instance to help guide attachment,
and functions with optimized-out parameters can lead to incorrect assumptions
about register contents.
So for now, filter out such functions while adding BTF representations for
functions that have "."-suffixes (foo.isra.0) but not optimized-out parameters.
This patch assumes that below linked changes land in pahole for v1.25.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/1675790102-23037-1-git-send-email-alan.maguire@oracle.com
Link: https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com
Alexei Starovoitov [Sat, 11 Feb 2023 02:59:57 +0000 (18:59 -0800)]
Merge branch 'bpf, mm: introduce cgroup.memory=nobpf'
Yafang Shao says:
====================
The bpf memory accouting has some known problems in contianer
environment,
- The container memory usage is not consistent if there's pinned bpf
program
After the container restart, the leftover bpf programs won't account
to the new generation, so the memory usage of the container is not
consistent. This issue can be resolved by introducing selectable
memcg, but we don't have an agreement on the solution yet. See also
the discussions at https://lwn.net/Articles/905150/ .
- The leftover non-preallocated bpf map can't be limited
The leftover bpf map will be reparented, and thus it will be limited by
the parent, rather than the container itself. Furthermore, if the
parent is destroyed, it be will limited by its parent's parent, and so
on. It can also be resolved by introducing selectable memcg.
- The memory dynamically allocated in bpf prog is charged into root memcg
only
Nowdays the bpf prog can dynamically allocate memory, for example via
bpf_obj_new(), but it only allocate from the global bpf_mem_alloc
pool, so it will charge into root memcg only. That needs to be
addressed by a new proposal.
So let's give the container user an option to disable bpf memory accouting.
The idea of "cgroup.memory=nobpf" is originally by Tejun[1].
[1]. https://lwn.net/ml/linux-mm/YxjOawzlgE458ezL@slm.duckdns.org/
Changes,
v1->v2:
- squash patches (Roman)
- commit log improvement in patch #2. (Johannes)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yafang Shao [Fri, 10 Feb 2023 15:47:34 +0000 (15:47 +0000)]
bpf: allow to disable bpf prog memory accounting
We can simply disable the bpf prog memory accouting by not setting the
GFP_ACCOUNT.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-5-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yafang Shao [Fri, 10 Feb 2023 15:47:33 +0000 (15:47 +0000)]
bpf: allow to disable bpf map memory accounting
We can simply set root memcg as the map's memcg to disable bpf memory
accounting. bpf_map_area_alloc is a little special as it gets the memcg
from current rather than from the map, so we need to disable GFP_ACCOUNT
specifically for it.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yafang Shao [Fri, 10 Feb 2023 15:47:32 +0000 (15:47 +0000)]
bpf: use bpf_map_kvcalloc in bpf_local_storage
Introduce new helper bpf_map_kvcalloc() for the memory allocation in
bpf_local_storage(). Then the allocation will charge the memory from the
map instead of from current, though currently they are the same thing as
it is only used in map creation path now. By charging map's memory into
the memcg from the map, it will be more clear.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-3-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yafang Shao [Fri, 10 Feb 2023 15:47:31 +0000 (15:47 +0000)]
mm: memcontrol: add new kernel parameter cgroup.memory=nobpf
Add new kernel parameter cgroup.memory=nobpf to allow user disable bpf
memory accounting. This is a preparation for the followup patch.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Fri, 10 Feb 2023 22:40:31 +0000 (23:40 +0100)]
docs, bpf: Ensure IETF's BPF mailing list gets copied for ISA doc changes
Given BPF is increasingly being used beyond just the Linux kernel, with
implementations in NICs and other hardware, Windows, etc, there is an
ongoing effort to document and standardize parts of the existing BPF
infrastructure such as its ISA. As "source of truth" we decided some
time ago to rely on the in-tree documentation, in particular, starting
out with the Documentation/bpf/instruction-set.rst as a base for later
RFC drafts on the ISA. Therefore, we want to ensure that changes to that
document have bpf@ietf.org in Cc, so add a MAINTAINERS file entry with
a section on documents related to standardization efforts. For now, this
only relates to instruction-set.rst, and later additional files will be
added.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dave Thaler <dthaler@microsoft.com>
Cc: bpf@ietf.org
Link: https://datatracker.ietf.org/doc/bofreq-thaler-bpf-ebpf/
Link: https://lore.kernel.org/r/57619c0dd8e354d82bf38745f99405e3babdc970.1676068387.git.daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jakub Kicinski [Sat, 11 Feb 2023 01:51:27 +0000 (17:51 -0800)]
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-02-11
We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).
There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit
5b246e533d01 ("ice: split probe into smaller functions")
from the net-next tree and commit
66c0e13ad236 ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:
[...]
ice_set_netdev_features(netdev);
netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
NETDEV_XDP_ACT_XSK_ZEROCOPY;
ice_set_ops(netdev);
[...]
Stephen's merge conflict mail:
https://lore.kernel.org/bpf/
20230207101951.
21a114fa@canb.auug.org.au/
The main changes are:
1) Add support for BPF trampoline on s390x which finally allows to remove many
test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.
2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.
3) Add capability to export the XDP features supported by the NIC.
Along with that, add a XDP compliance test tool,
from Lorenzo Bianconi & Marek Majtyka.
4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
from David Vernet.
5) Add a deep dive documentation about the verifier's register
liveness tracking algorithm, from Eduard Zingerman.
6) Fix and follow-up cleanups for resolve_btfids to be compiled
as a host program to avoid cross compile issues,
from Jiri Olsa & Ian Rogers.
7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
when testing on different NICs, from Jesper Dangaard Brouer.
8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.
9) Extend libbpf to add an option for when the perf buffer should
wake up, from Jon Doron.
10) Follow-up fix on xdp_metadata selftest to just consume on TX
completion, from Stanislav Fomichev.
11) Extend the kfuncs.rst document with description on kfunc
lifecycle & stability expectations, from David Vernet.
12) Fix bpftool prog profile to skip attaching to offline CPUs,
from Tonghao Zhang.
====================
Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Randy Dunlap [Thu, 9 Feb 2023 07:13:44 +0000 (23:13 -0800)]
Documentation: isdn: correct spelling
Correct spelling problems for Documentation/isdn/ as reported
by codespell.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230209071400.31476-9-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 11 Feb 2023 00:23:05 +0000 (16:23 -0800)]
Merge branch 'net-move-more-duplicate-code-of-ovs-and-tc-conntrack-into-nf_conntrack_ovs'
Xin Long says:
====================
net: move more duplicate code of ovs and tc conntrack into nf_conntrack_ovs
We've moved some duplicate code into nf_nat_ovs in:
"net: eliminate the duplicate code in the ct nat functions of ovs and tc"
This patchset addresses more code duplication in the conntrack of ovs
and tc then creates nf_conntrack_ovs for them, and four functions will
be extracted and moved into it:
nf_ct_handle_fragments()
nf_ct_skb_network_trim()
nf_ct_helper()
nf_ct_add_helper()
====================
Link: https://lore.kernel.org/r/cover.1675810210.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:10 +0000 (17:52 -0500)]
net: extract nf_ct_handle_fragments to nf_conntrack_ovs
Now handle_fragments() in OVS and TC have the similar code, and
this patch removes the duplicate code by moving the function
to nf_conntrack_ovs.
Note that skb_clear_hash(skb) or skb->ignore_df = 1 should be
done only when defrag returns 0, as it does in other places
in kernel.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:09 +0000 (17:52 -0500)]
net: sched: move frag check and tc_skb_cb update out of handle_fragments
This patch has no functional changes and just moves frag check and
tc_skb_cb update out of handle_fragments, to make it easier to move
the duplicate code from handle_fragments() into nf_conntrack_ovs later.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:08 +0000 (17:52 -0500)]
openvswitch: move key and ovs_cb update out of handle_fragments
This patch has no functional changes and just moves key and ovs_cb update
out of handle_fragments, and skb_clear_hash() and skb->ignore_df change
into handle_fragments(), to make it easier to move the duplicate code
from handle_fragments() into nf_conntrack_ovs later.
Note that it changes to pass info->family to handle_fragments() instead
of key for the packet type check, as info->family is set according to
key->eth.type in ovs_ct_copy_action() when creating the action.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:07 +0000 (17:52 -0500)]
net: extract nf_ct_skb_network_trim function to nf_conntrack_ovs
There are almost the same code in ovs_skb_network_trim() and
tcf_ct_skb_network_trim(), this patch extracts them into a function
nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:06 +0000 (17:52 -0500)]
net: create nf_conntrack_ovs for ovs and tc use
Similar to nf_nat_ovs created by Commit
ebddb1404900 ("net: move the
nat function to nf_nat_ovs for ovs and tc"), this patch is to create
nf_conntrack_ovs to get these functions shared by OVS and TC only.
There are nf_ct_helper() and nf_ct_add_helper() from nf_conntrak_helper
in this patch, and will be more in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ilya Leoshkevich [Fri, 10 Feb 2023 00:12:01 +0000 (01:12 +0100)]
libbpf: Fix alen calculation in libbpf_nla_dump_errormsg()
The code assumes that everything that comes after nlmsgerr are nlattrs.
When calculating their size, it does not account for the initial
nlmsghdr. This may lead to accessing uninitialized memory.
Fixes: bbf48c18ee0c ("libbpf: add error reporting in XDP")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-8-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:12:00 +0000 (01:12 +0100)]
selftests/bpf: Attach to fopen()/fclose() in attach_probe
malloc() and free() may be completely replaced by sanitizers, use
fopen() and fclose() instead.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-7-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:59 +0000 (01:11 +0100)]
selftests/bpf: Attach to fopen()/fclose() in uprobe_autoattach
malloc() and free() may be completely replaced by sanitizers, use
fopen() and fclose() instead.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-6-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:58 +0000 (01:11 +0100)]
selftests/bpf: Forward SAN_CFLAGS and SAN_LDFLAGS to runqslower and libbpf
To get useful results from the Memory Sanitizer, all code running in a
process needs to be instrumented. When building tests with other
sanitizers, it's not strictly necessary, but is also helpful.
So make sure runqslower and libbpf are compiled with SAN_CFLAGS and
linked with SAN_LDFLAGS.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-5-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:57 +0000 (01:11 +0100)]
selftests/bpf: Split SAN_CFLAGS and SAN_LDFLAGS
Memory Sanitizer requires passing different options to CFLAGS and
LDFLAGS: besides the mandatory -fsanitize=memory, one needs to pass
header and library paths, and passing -L to a compilation step
triggers -Wunused-command-line-argument. So introduce a separate
variable for linker flags. Use $(SAN_CFLAGS) as a default in order to
avoid complicating the ASan usage.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-4-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:56 +0000 (01:11 +0100)]
tools: runqslower: Add EXTRA_CFLAGS and EXTRA_LDFLAGS support
This makes it possible to add sanitizer flags.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-3-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:55 +0000 (01:11 +0100)]
selftests/bpf: Quote host tools
Using HOSTCC="ccache clang" breaks building the tests, since, when it's
forwarded to e.g. bpftool, the child make sees HOSTCC=ccache and
"clang" is considered a target. Fix by quoting it, and also HOSTLD and
HOSTAR for consistency.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-2-iii@linux.ibm.com
Jakub Kicinski [Thu, 9 Feb 2023 06:06:42 +0000 (22:06 -0800)]
net: skbuff: drop the word head from skb cache
skbuff_head_cache is misnamed (perhaps for historical reasons?)
because it does not hold heads. Head is the buffer which skb->data
points to, and also where shinfo lives. struct sk_buff is a metadata
structure, not the head.
Eric recently added skb_small_head_cache (which allocates actual
head buffers), let that serve as an excuse to finally clean this up :)
Leave the user-space visible name intact, it could possibly be uAPI.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Feb 2023 08:06:33 +0000 (08:06 +0000)]
Merge branch 'net-ipa-GSI'
Alex Elder says:
====================
net: ipa: prepare for GSI register updtaes
An upcoming series (or two) will convert the definitions of GSI
registers used by IPA so they use the "IPA reg" mechanism to specify
register offsets and their fields. This will simplify implementing
the fairly large number of changes required in GSI registers to
support more than 32 GSI channels (introduced in IPA v5.0).
A few minor problems and inconsistencies were found, and they're
fixed here. The last three patches in this series change the
"ipa_reg" code to separate the IPA-specific part (the base virtual
address, basically) from the generic register part, and the now-
generic code is renamed to use just "reg_" or "REG_" as a prefix
rather than "ipa_reg" or "IPA_REG_".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:53 +0000 (14:56 -0600)]
net: ipa: generalize register field functions
Rename functions related to register fields so they don't appear to
be IPA-specific, and move their definitions into "reg.h":
ipa_reg_fmask() -> reg_fmask()
ipa_reg_bit() -> reg_bit()
ipa_reg_field_max() -> reg_field_max()
ipa_reg_encode() -> reg_encode()
ipa_reg_decode() -> reg_decode()
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:52 +0000 (14:56 -0600)]
net: ipa: generalize register offset functions
Rename ipa_reg_offset() to be reg_offset() and move its definition
to "reg.h". Rename ipa_reg_n_offset() to be reg_n_offset() also.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:51 +0000 (14:56 -0600)]
net: ipa: start generalizing "ipa_reg"
IPA register definitions have evolved with each new version. The
changes required to support more than 32 endpoints in IPA v5.0 made
it best to define a unified mechanism for defining registers and
their fields.
GSI register definitions, meanwhile, have remained fairly stable.
And even as the total number of IPA endpoints goes beyond 32, the
number of GSI channels on a given EE that underly endpoints still
remains 32 or less.
Despite that, GSI v3.0 (which is used with IPA v5.0) extends the
number of channels (and events) it supports to be about 256, and as
a result, many GSI register definitions must change significantly.
To address this, we'll use the same "ipa_reg" mechanism to define
the GSI registers.
As a first step in generalizing the "ipa_reg" to also support GSI
registers, isolate the definitions of the "ipa_reg" and "ipa_regs"
structure types (and some supporting macros) into a new header file,
and remove the "ipa_" and "IPA_" from symbol names.
Separate the IPA register ID validity checking from the generic
check that a register ID is in range. Aside from that, this is
intended to have no functional effect on the code.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:50 +0000 (14:56 -0600)]
net: ipa: GSI register cleanup
Move some static inline function definitions out of "gsi_reg.h" and
into "gsi.c", which is the only place they're used. Rename them so
their names identify the register they're associated with.
Move the gsi_channel_type enumerated type definition below the
offset and field definitions for the CH_C_CNTXT_0 register where
it's used.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:49 +0000 (14:56 -0600)]
net: ipa: use bitmasks for GSI IRQ values
There are seven GSI interrupt types that can be signaled by a single
GSI IRQ. These are represented in a bitmask, and the gsi_irq_type_id
enumerated type defines what each bit position represents.
Similarly, the global and general GSI interrupt types each has a set
of conditions it signals, and both types have an enumerated type
that defines which bit that represents each condition.
When used, these enumerated values are passed as an argument to BIT()
in *all* cases. So clean up the code a little bit by defining the
enumerated type values as one-bit masks rather than bit positions.
Rename gsi_general_id to be gsi_general_irq_id for consistency.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:48 +0000 (14:56 -0600)]
net: ipa: tighten up IPA register validity checking
When checking the validity of an IPA register ID, compare it against
all possible ipa_reg_id values.
Rename the function ipa_reg_id_valid() to be specific about what's
being checked.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:47 +0000 (14:56 -0600)]
net: ipa: add some new IPA versions
Soon IPA v5.0+ will be supported, and when that happens we will be
able to enable support for the SDX65 (IPA v5.0), SM8450 (IPA v5.1),
and SM8550 (IPA v5.5).
Fix the comment about the GSI version used for IPA v3.1.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:46 +0000 (14:56 -0600)]
net: ipa: get rid of ipa->reg_addr
The reg_addr field in the IPA structure is set but never used.
Get rid of it.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:45 +0000 (14:56 -0600)]
net: ipa: generic command param fix
Starting at IPA v4.11, the GSI_GENERIC_COMMAND GSI register got a
new PARAMS field. The code that encodes a value into that field
sets it unconditionally, which is wrong.
We currently only provide 0 as the field's value, so this error has
no real effect. Still, it's a bug, so let's fix it.
Fix an (unrelated) incorrect comment as well. Fields in the
ERROR_LOG GSI register actually *are* defined for IPA versions
prior to v3.5.1.
Fixes: fe68c43ce388 ("net: ipa: support enhanced channel flow control")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Wed, 8 Feb 2023 13:08:39 +0000 (14:08 +0100)]
net: microchip: vcap: Add tc flower keys for lan966x
Add the following TC flower filter keys to lan966x for IS2:
- ipv4_addr (sip and dip)
- ipv6_addr (sip and dip)
- control (IPv4 fragments)
- portnum (tcp and udp port numbers)
- basic (L3 and L4 protocol)
- vlan (outer vlan tag info)
- tcp (tcp flags)
- ip (tos field)
As the parsing of these keys is similar between lan966x and sparx5, move
the code in a separate file to be shared by these 2 chips. And put the
specific parsing outside of the common functions.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Feb 2023 08:00:05 +0000 (08:00 +0000)]
Merge tag 'rxrpc-next-
20230208' of git://git./linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc development
Here are some miscellaneous changes for rxrpc:
(1) Use consume_skb() rather than kfree_skb_reason().
(2) Fix unnecessary waking when poking and already-poked call.
(3) Add ack.rwind to the rxrpc_tx_ack tracepoint as this indicates how
many incoming DATA packets we're telling the peer that we are
currently willing to accept on this call.
(4) Reduce duplicate ACK transmission. We send ACKs to let the peer know
that we're increasing the receive window (ack.rwind) as we consume
packets locally. Normal ACK transmission is triggered in three places
and that leads to duplicates being sent.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Clément Léger [Wed, 8 Feb 2023 16:12:49 +0000 (17:12 +0100)]
net: pcs: rzn1-miic: remove unused struct members and use miic variable
Remove unused bulk clocks struct from the miic state and use an already
existing miic variable in miic_config().
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Link: https://lore.kernel.org/r/20230208161249.329631-1-clement.leger@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko [Wed, 8 Feb 2023 13:31:53 +0000 (15:31 +0200)]
openvswitch: Use string_is_terminated() helper
Use string_is_terminated() helper instead of cpecific memchr() call.
This shows better the intention of the call.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko [Wed, 8 Feb 2023 13:31:52 +0000 (15:31 +0200)]
genetlink: Use string_is_terminated() helper
Use string_is_terminated() helper instead of cpecific memchr() call.
This shows better the intention of the call.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko [Wed, 8 Feb 2023 13:31:51 +0000 (15:31 +0200)]
string_helpers: Move string_is_valid() to the header
Move string_is_valid() to the header for wider use.
While at it, rename to string_is_terminated() to be
precise about its semantics.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Horatiu Vultur [Wed, 8 Feb 2023 11:44:06 +0000 (12:44 +0100)]
net: micrel: Cable Diagnostics feature for lan8841 PHY
Add support for cable diagnostics in lan8841 PHY. It has the same
registers layout as lan8814 PHY, therefore reuse the functionality.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230208114406.1666671-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Huanhuan Wang [Wed, 8 Feb 2023 09:10:00 +0000 (10:10 +0100)]
nfp: support IPsec offloading for NFP3800
Add IPsec offloading support for NFP3800. Include data
plane and control plane.
Data plane: add IPsec packet process flow in NFP3800
datapath (NFDk).
Control plane: add an algorithm support distinction flow
in xfrm hook function xdo_dev_state_add(), as NFP3800 has
a different set of IPsec algorithm support.
This matches existing support for the NFP6000/NFP4000 and
their NFD3 datapath.
In addition, fixup the md_bytes calculation for NFD3 datapath
to make sure the two datapahts are keept in sync.
Signed-off-by: Huanhuan Wang <huanhuan.wang@corigine.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230208091000.4139974-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 10 Feb 2023 01:45:57 +0000 (17:45 -0800)]
Merge branch 'net-introduce-rps_default_mask'
Paolo Abeni says:
====================
net: introduce rps_default_mask
Real-time setups try hard to ensure proper isolation between time
critical applications and e.g. network processing performed by the
network stack in softirq and RPS is used to move the softirq
activity away from the isolated core.
If the network configuration is dynamic, with netns and devices
routinely created at run-time, enforcing the correct RPS setting
on each newly created device allowing to transient bad configuration
became complex.
Additionally, when multi-queue devices are involved, configuring rps
in user-space on each queue easily becomes very expensive, e.g.
some setups use veths with per cpu queues.
These series try to address the above, introducing a new
sysctl knob: rps_default_mask. The new sysctl entry allows
configuring a netns-wide RPS mask, to be enforced since receive
queue creation time without any fourther per device configuration
required.
Additionally, a simple self-test is introduced to check the
rps_default_mask behavior.
====================
Link: https://lore.kernel.org/r/cover.1675789134.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 7 Feb 2023 18:44:58 +0000 (19:44 +0100)]
self-tests: introduce self-tests for RPS default mask
Ensure that RPS default mask changes take place on
all newly created netns/devices and don't affect
existing ones.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 7 Feb 2023 18:44:57 +0000 (19:44 +0100)]
net: introduce default_rps_mask netns attribute
If RPS is enabled, this allows configuring a default rps
mask, which is effective since receive queue creation time.
A default RPS mask allows the system admin to ensure proper
isolation, avoiding races at network namespace or device
creation time.
The default RPS mask is initially empty, and can be
modified via a newly added sysctl entry.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 7 Feb 2023 18:44:56 +0000 (19:44 +0100)]
net-sysctl: factor-out rpm mask manipulation helpers
Will simplify the following patch. No functional change
intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 7 Feb 2023 18:44:55 +0000 (19:44 +0100)]
net-sysctl: factor out cpumask parsing helper
Will be used by the following patch to avoid code
duplication. No functional changes intended.
The only difference is that now flow_limit_cpu_sysctl() will
always compute the flow limit mask on each read operation,
even when read() will not return any byte to user-space.
Note that the new helper is placed under a new #ifdef at
the file start to better fit the usage in the later patch
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 9 Feb 2023 20:05:25 +0000 (12:05 -0800)]
Merge git://git./linux/kernel/git/netdev/net
net/devlink/leftover.c / net/core/devlink.c:
565b4824c39f ("devlink: change port event netdev notifier from per-net to global")
f05bd8ebeb69 ("devlink: move code to a dedicated directory")
687125b5799c ("devlink: split out core code")
https://lore.kernel.org/all/
20230208094657.
379f2b1a@canb.auug.org.au/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Olsa [Thu, 9 Feb 2023 14:37:35 +0000 (15:37 +0100)]
tools/resolve_btfids: Pass HOSTCFLAGS as EXTRA_CFLAGS to prepare targets
Thorsten reported build issue with command line that defined extra
HOSTCFLAGS that were not passed into 'prepare' targets, but were
used to build resolve_btfids objects.
This results in build fail when these objects are linked together:
/usr/bin/ld: /build.../tools/bpf/resolve_btfids//libbpf/libbpf.a(libbpf-in.o):
relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE \
object; recompile with -fPIE
Fixing this by passing HOSTCFLAGS in EXTRA_CFLAGS as part of
HOST_OVERRIDES variable for prepare targets.
[1] https://lore.kernel.org/bpf/
f7922132-6645-6316-5675-
0ece4197bfff@leemhuis.info/
Fixes: 56a2df7615fa ("tools/resolve_btfids: Compile resolve_btfids as host program")
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/bpf/20230209143735.4112845-1-jolsa@kernel.org
Eric Dumazet [Wed, 8 Feb 2023 14:25:08 +0000 (14:25 +0000)]
net: enable usercopy for skb_small_head_cache
syzbot and other bots reported that we have to enable
user copy to/from skb->head. [1]
We can prevent access to skb_shared_info, which is a nice
improvement over standard kmem_cache.
Layout of these kmem_cache objects is:
< SKB_SMALL_HEAD_HEADROOM >< struct skb_shared_info >
usercopy: Kernel memory overwrite attempt detected to SLUB object 'skbuff_small_head' (offset 32, size 20)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102 !
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 1 Comm: swapper/0 Not tainted
6.2.0-rc6-syzkaller-01425-gcb6b2e11a42d #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
RIP: 0010:usercopy_abort+0xbd/0xbf mm/usercopy.c:102
Code: e8 ee ad ba f7 49 89 d9 4d 89 e8 4c 89 e1 41 56 48 89 ee 48 c7 c7 20 2b 5b 8a ff 74 24 08 41 57 48 8b 54 24 20 e8 7a 17 fe ff <0f> 0b e8 c2 ad ba f7 e8 7d fb 08 f8 48 8b 0c 24 49 89 d8 44 89 ea
RSP: 0000:
ffffc90000067a48 EFLAGS:
00010286
RAX:
000000000000006b RBX:
ffffffff8b5b6ea0 RCX:
0000000000000000
RDX:
ffff8881401c0000 RSI:
ffffffff8166195c RDI:
fffff5200000cf3b
RBP:
ffffffff8a5b2a60 R08:
000000000000006b R09:
0000000000000000
R10:
0000000080000000 R11:
0000000000000000 R12:
ffffffff8bf2a925
R13:
ffffffff8a5b29a0 R14:
0000000000000014 R15:
ffffffff8a5b2960
FS:
0000000000000000(0000) GS:
ffff8880b9900000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000000 CR3:
000000000c48e000 CR4:
00000000003506e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
<TASK>
__check_heap_object+0xdd/0x110 mm/slub.c:4761
check_heap_object mm/usercopy.c:196 [inline]
__check_object_size mm/usercopy.c:251 [inline]
__check_object_size+0x1da/0x5a0 mm/usercopy.c:213
check_object_size include/linux/thread_info.h:199 [inline]
check_copy_size include/linux/thread_info.h:235 [inline]
copy_from_iter include/linux/uio.h:186 [inline]
copy_from_iter_full include/linux/uio.h:194 [inline]
memcpy_from_msg include/linux/skbuff.h:3977 [inline]
qrtr_sendmsg+0x65f/0x970 net/qrtr/af_qrtr.c:965
sock_sendmsg_nosec net/socket.c:722 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:745
say_hello+0xf6/0x170 net/qrtr/ns.c:325
qrtr_ns_init+0x220/0x2b0 net/qrtr/ns.c:804
qrtr_proto_init+0x59/0x95 net/qrtr/af_qrtr.c:1296
do_one_initcall+0x141/0x790 init/main.c:1306
do_initcall_level init/main.c:1379 [inline]
do_initcalls init/main.c:1395 [inline]
do_basic_setup init/main.c:1414 [inline]
kernel_init_freeable+0x6f9/0x782 init/main.c:1634
kernel_init+0x1e/0x1d0 init/main.c:1522
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
Fixes: bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Link: https://lore.kernel.org/linux-next/CA+G9fYs-i-c2KTSA7Ai4ES_ZESY1ZnM=Zuo8P1jN00oed6KHMA@mail.gmail.com
Link: https://lore.kernel.org/r/20230208142508.3278406-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 9 Feb 2023 17:17:38 +0000 (09:17 -0800)]
Merge tag 'net-6.2-rc8' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from can and ipsec subtrees.
Current release - regressions:
- sched: fix off by one in htb_activate_prios()
- eth: mana: fix accessing freed irq affinity_hint
- eth: ice: fix out-of-bounds KASAN warning in virtchnl
Current release - new code bugs:
- eth: mtk_eth_soc: enable special tag when any MAC uses DSA
Previous releases - always broken:
- core: fix sk->sk_txrehash default
- neigh: make sure used and confirmed times are valid
- mptcp: be careful on subflow status propagation on errors
- xfrm: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
- phylink: move phy_device_free() to correctly release phy device
- eth: mlx5:
- fix crash unsetting rx-vlan-filter in switchdev mode
- fix hang on firmware reset
- serialize module cleanup with reload and remove"
* tag 'net-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
selftests: forwarding: lib: quote the sysctl values
net: mscc: ocelot: fix all IPv6 getting trapped to CPU when PTP timestamping is used
rds: rds_rm_zerocopy_callback() use list_first_entry()
net: txgbe: Update support email address
selftests: Fix failing VXLAN VNI filtering test
selftests: mptcp: stop tests earlier
selftests: mptcp: allow more slack for slow test-case
mptcp: be careful on subflow status propagation on errors
mptcp: fix locking for in-kernel listener creation
mptcp: fix locking for setsockopt corner-case
mptcp: do not wait for bare sockets' timeout
net: ethernet: mtk_eth_soc: fix DSA TX tag hwaccel for switch port 0
nfp: ethtool: fix the bug of setting unsupported port speed
txhash: fix sk->sk_txrehash default
net: ethernet: mtk_eth_soc: fix wrong parameters order in __xdp_rxq_info_reg()
net: ethernet: mtk_eth_soc: enable special tag when any MAC uses DSA
net: sched: sch: Fix off by one in htb_activate_prios()
igc: Add ndo_tx_timeout support
net: mana: Fix accessing freed irq affinity_hint
hv_netvsc: Allocate memory in netvsc_dma_map() with GFP_ATOMIC
...
Linus Torvalds [Thu, 9 Feb 2023 17:09:13 +0000 (09:09 -0800)]
Merge tag 'for-linus-
2023020901' of git://git./linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:
- fix potential infinite loop with a badly crafted HID device (Xin
Zhao)
- fix regression from 6.1 in USB logitech devices potentially making
their mouse wheel not working (Bastien Nocera)
- clean up in AMD sensors, which fixes a long time resume bug (Mario
Limonciello)
- few device small fixes and quirks
* tag 'for-linus-
2023020901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: Ignore battery for ELAN touchscreen 29DF on HP
HID: amd_sfh: if no sensors are enabled, clean up
HID: logitech: Disable hi-res scrolling on USB
HID: core: Fix deadloop in hid_apply_multiplier.
HID: Ignore battery for Elan touchscreen on Asus TP420IA
HID: elecom: add support for TrackBall 056E:011C
Linus Torvalds [Thu, 9 Feb 2023 17:00:26 +0000 (09:00 -0800)]
Merge tag '6.2-rc8-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull cifx fix from Steve French:
"Small fix for use after free"
* tag '6.2-rc8-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix use-after-free in rdata->read_into_pages()
Hangbin Liu [Wed, 8 Feb 2023 03:21:10 +0000 (11:21 +0800)]
selftests: forwarding: lib: quote the sysctl values
When set/restore sysctl value, we should quote the value as some keys
may have multi values, e.g. net.ipv4.ping_group_range
Fixes: f5ae57784ba8 ("selftests: forwarding: lib: Add sysctl_set(), sysctl_restore()")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20230208032110.879205-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Vladimir Oltean [Tue, 7 Feb 2023 18:31:17 +0000 (20:31 +0200)]
net: mscc: ocelot: fix all IPv6 getting trapped to CPU when PTP timestamping is used
While running this selftest which usually passes:
~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti [ OK ]
TEST: swp0: Multicast IPv4 to joined group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti [ OK ]
TEST: swp0: Multicast IPv6 to joined group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti [ OK ]
if I start PTP timestamping then run it again (debug prints added by me),
the unknown IPv6 MC traffic is seen by the CPU port even when it should
have been dropped:
~/selftests/drivers/net/dsa# ptp4l -i swp0 -2 -P -m
ptp4l[225.410]: selected /dev/ptp1 as PTP clock
[ 225.445746] mscc_felix 0000:00:00.5: ocelot_l2_ptp_trap_add: port 0 adding L2 PTP trap
[ 225.453815] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_add: port 0 adding IPv4 PTP event trap
[ 225.462703] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_add: port 0 adding IPv4 PTP general trap
[ 225.471768] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_add: port 0 adding IPv6 PTP event trap
[ 225.480651] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_add: port 0 adding IPv6 PTP general trap
ptp4l[225.488]: port 1: INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[225.488]: port 0: INITIALIZING to LISTENING on INIT_COMPLETE
^C
~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti [ OK ]
TEST: swp0: Multicast IPv4 to joined group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti [ OK ]
TEST: swp0: Multicast IPv6 to joined group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group [FAIL]
reception succeeded, but should have failed
TEST: swp0: Multicast IPv6 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti [ OK ]
The PGID_MCIPV6 is configured correctly to not flood to the CPU,
I checked that.
Furthermore, when I disable back PTP RX timestamping (ptp4l doesn't do
that when it exists), packets are RX filtered again as they should be:
~/selftests/drivers/net/dsa# hwstamp_ctl -i swp0 -r 0
[ 218.202854] mscc_felix 0000:00:00.5: ocelot_l2_ptp_trap_del: port 0 removing L2 PTP trap
[ 218.212656] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_del: port 0 removing IPv4 PTP event trap
[ 218.222975] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_del: port 0 removing IPv4 PTP general trap
[ 218.233133] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_del: port 0 removing IPv6 PTP event trap
[ 218.242251] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_del: port 0 removing IPv6 PTP general trap
current settings:
tx_type 1
rx_filter 12
new settings:
tx_type 1
rx_filter 0
~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti [ OK ]
TEST: swp0: Multicast IPv4 to joined group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti [ OK ]
TEST: swp0: Multicast IPv6 to joined group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti [ OK ]
So it's clear that something in the PTP RX trapping logic went wrong.
Looking a bit at the code, I can see that there are 4 typos, which
populate "ipv4" VCAP IS2 key filter fields for IPv6 keys.
VCAP IS2 keys of type OCELOT_VCAP_KEY_IPV4 and OCELOT_VCAP_KEY_IPV6 are
handled by is2_entry_set(). OCELOT_VCAP_KEY_IPV4 looks at
&filter->key.ipv4, and OCELOT_VCAP_KEY_IPV6 at &filter->key.ipv6.
Simply put, when we populate the wrong key field, &filter->key.ipv6
fields "proto.mask" and "proto.value" remain all zeroes (or "don't care").
So is2_entry_set() will enter the "else" of this "if" condition:
if (msk == 0xff && (val == IPPROTO_TCP || val == IPPROTO_UDP))
and proceed to ignore the "proto" field. The resulting rule will match
on all IPv6 traffic, trapping it to the CPU.
This is the reason why the local_termination.sh selftest sees it,
because control traps are stronger than the PGID_MCIPV6 used for
flooding (from the forwarding data path).
But the problem is in fact much deeper. We trap all IPv6 traffic to the
CPU, but if we're bridged, we set skb->offload_fwd_mark = 1, so software
forwarding will not take place and IPv6 traffic will never reach its
destination.
The fix is simple - correct the typos.
I was intentionally inaccurate in the commit message about the breakage
occurring when any PTP timestamping is enabled. In fact it only happens
when L4 timestamping is requested (HWTSTAMP_FILTER_PTP_V2_EVENT or
HWTSTAMP_FILTER_PTP_V2_L4_EVENT). But ptp4l requests a larger RX
timestamping filter than it needs for "-2": HWTSTAMP_FILTER_PTP_V2_EVENT.
I wanted people skimming through git logs to not think that the bug
doesn't affect them because they only use ptp4l in L2 mode.
Fixes: 96ca08c05838 ("net: mscc: ocelot: set up traps for PTP packets")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230207183117.1745754-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pietro Borrello [Tue, 7 Feb 2023 18:26:34 +0000 (18:26 +0000)]
rds: rds_rm_zerocopy_callback() use list_first_entry()
rds_rm_zerocopy_callback() uses list_entry() on the head of a list
causing a type confusion.
Use list_first_entry() to actually access the first element of the
rs_zcookie_queue list.
Fixes: 9426bbc6de99 ("rds: use list structure to track information for zerocopy completion notification")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Link: https://lore.kernel.org/r/20230202-rds-zerocopy-v3-1-83b0df974f9a@diag.uniroma1.it
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Thu, 9 Feb 2023 05:35:38 +0000 (21:35 -0800)]
Merge tag 'ipsec-2023-02-08' of git://git./linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
ipsec 2023-02-08
1) Fix policy checks for nested IPsec tunnels when using
xfrm interfaces. From Benedict Wong.
2) Fix netlink message expression on 32=>64-bit
messages translators. From Anastasia Belova.
3) Prevent potential spectre v1 gadget in xfrm_xlate32_attr.
From Eric Dumazet.
4) Always consistently use time64_t in xfrm_timer_handler.
From Eric Dumazet.
5) Fix KCSAN reported bug: Multiple cpus can update use_time
at the same time. From Eric Dumazet.
6) Fix SCP copy from IPv4 to IPv6 on interfamily tunnel.
From Christian Hopps.
* tag 'ipsec-2023-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: fix bug with DSCP copy to v6 from v4 tunnel
xfrm: annotate data-race around use_time
xfrm: consistently use time64_t in xfrm_timer_handler()
xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
xfrm: compat: change expression for switch in xfrm_xlate64
Fix XFRM-I support for nested ESP tunnels
====================
Link: https://lore.kernel.org/r/20230208114322.266510-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 9 Feb 2023 05:32:19 +0000 (21:32 -0800)]
Merge tag 'linux-can-next-for-6.3-
20230208' of git://git./linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
can-next 2023-02-08
The 1st patch is by Oliver Hartkopp and cleans up the CAN_RAW's
raw_setsockopt() for CAN_RAW_FD_FRAMES.
The 2nd patch is by me and fixes the compilation if
CONFIG_CAN_CALC_BITTIMING is disabled. (Problem introduced in last
pull request to next-next.)
* tag 'linux-can-next-for-6.3-
20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: bittiming: can_calc_bittiming(): add missing parameter to no-op function
can: raw: use temp variable instead of rolling back config
====================
Link: https://lore.kernel.org/r/20230208210014.3169347-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 9 Feb 2023 05:00:54 +0000 (21:00 -0800)]
Merge tag 'mlx5-next-netdev-deadlock' of git://git./linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-next-netdev-deadlock
This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()
To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
* tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
net/mlx5e: Propagate an internal event in case uplink netdev changes
net/mlx5e: Fix trap event handling
net/mlx5: Introduce CQE error syndrome
====================
Link: https://lore.kernel.org/r/20230208005626.72930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yang Li [Wed, 8 Feb 2023 00:49:59 +0000 (08:49 +0800)]
net: libwx: Remove unneeded semicolon
./drivers/net/ethernet/wangxun/libwx/wx_lib.c:683:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3976
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230208004959.47553-1-yang.lee@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yang Li [Wed, 8 Feb 2023 01:32:27 +0000 (09:32 +0800)]
net: libwx: clean up one inconsistent indenting
drivers/net/ethernet/wangxun/libwx/wx_lib.c:1835 wx_setup_all_rx_resources() warn: inconsistent indenting
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3981
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230208013227.111605-1-yang.lee@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiawen Wu [Wed, 8 Feb 2023 02:30:35 +0000 (10:30 +0800)]
net: txgbe: Update support email address
Update new email address for Wangxun 10Gb NIC support team.
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20230208023035.3371250-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Tue, 1 Nov 2022 14:36:01 +0000 (15:36 +0100)]
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
When removing a network namespace with mlx5 devlink instance being in
it, following callchain is performed:
cleanup_net (takes down_read(&pernet_ops_rwsem)
devlink_pernet_pre_exit()
devlink_reload()
mlx5_devlink_reload_down()
mlx5_unload_one_devl_locked()
mlx5_detach_device()
del_adev()
mlx5r_remove()
__mlx5_ib_remove()
mlx5_ib_roce_cleanup()
mlx5_remove_netdev_notifier()
unregister_netdevice_notifier (takes down_write(&pernet_ops_rwsem)
This deadlocks.
Resolve this by converting to register_netdevice_notifier_dev_net()
which does not take pernet_ops_rwsem and moves the notifier block around
according to netdev it takes as arg.
Use previously introduced netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Jiri Pirko [Tue, 1 Nov 2022 14:27:43 +0000 (15:27 +0100)]
net/mlx5e: Propagate an internal event in case uplink netdev changes
Whenever uplink netdev is set/cleared, propagate newly introduced event
to inform notifier blocks netdev was added/removed.
Move the set() helper to core.c from header, introduce clear() and
netdev_added_event_replay() helpers. The last one is going to be called
from rdma driver, so export it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Jiri Pirko [Thu, 24 Nov 2022 12:05:53 +0000 (13:05 +0100)]
net/mlx5e: Fix trap event handling
Current code does not return correct return value from event handler.
Fix it by returning NOTIFY_* and propagate err over newly introduce ctx
structure.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Jakub Kicinski [Thu, 9 Feb 2023 03:23:44 +0000 (19:23 -0800)]
Merge tag 'mlx5-fixes-2023-02-07' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2023-02-07
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2023-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Serialize module cleanup with reload and remove
net/mlx5: fw_tracer, Zero consumer index when reloading the tracer
net/mlx5: fw_tracer, Clear load bit when freeing string DBs buffers
net/mlx5: Expose SF firmware pages counter
net/mlx5: Store page counters in a single array
net/mlx5e: IPoIB, Show unknown speed instead of error
net/mlx5e: Fix crash unsetting rx-vlan-filter in switchdev mode
net/mlx5: Bridge, fix ageing of peer FDB entries
net/mlx5: DR, Fix potential race in dr_rule_create_rule_nic
net/mlx5e: Update rx ring hw mtu upon each rx-fcs flag change
====================
Link: https://lore.kernel.org/r/20230208030302.95378-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 9 Feb 2023 03:05:58 +0000 (19:05 -0800)]
Merge tag 'mlx5-updates-2023-02-07' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-02-07
1) Minor and trivial code Cleanups
2) Minor fixes for net-next
3) From Shay: dynamic FW trace strings update.
* tag 'mlx5-updates-2023-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: fw_tracer, Add support for unrecognized string
net/mlx5: fw_tracer, Add support for strings DB update event
net/mlx5: fw_tracer, allow 0 size string DBs
net/mlx5: fw_tracer: Fix debug print
net/mlx5: fs, Remove redundant assignment of size
net/mlx5: fs_core, Remove redundant variable err
net/mlx5: Fix memory leak in error flow of port set buffer
net/mlx5e: Remove incorrect debugfs_create_dir NULL check in TLS
net/mlx5e: Remove incorrect debugfs_create_dir NULL check in hairpin
net/mlx5: fs, Remove redundant vport_number assignment
net/mlx5e: Remove redundant code for handling vlan actions
net/mlx5e: Don't listen to remove flows event
net/mlx5: fw reset: Skip device ID check if PCI link up failed
net/mlx5: Remove redundant health work lock
mlx5: reduce stack usage in mlx5_setup_tc
====================
Link: https://lore.kernel.org/r/20230208003712.68386-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 7 Feb 2023 14:18:19 +0000 (16:18 +0200)]
selftests: Fix failing VXLAN VNI filtering test
iproute2 does not recognize the "group6" and "remote6" keywords. Fix by
using "group" and "remote" instead.
Before:
# ./test_vxlan_vnifiltering.sh
[...]
Tests passed: 25
Tests failed: 2
After:
# ./test_vxlan_vnifiltering.sh
[...]
Tests passed: 27
Tests failed: 0
Fixes: 3edf5f66c12a ("selftests: add new tests for vxlan vnifiltering")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230207141819.256689-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rong Tao [Wed, 8 Feb 2023 01:04:41 +0000 (09:04 +0800)]
samples/bpf: Add openat2() enter/exit tracepoint to syscall_tp sample
Commit
fe3300897cbf("samples: bpf: fix syscall_tp due to unused syscall")
added openat() syscall tracepoints. This patch adds support for
openat2() as well.
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/tencent_9381CB1A158ED7ADD12C4406034E21A3AC07@qq.com
Jon Doron [Tue, 7 Feb 2023 08:19:16 +0000 (10:19 +0200)]
libbpf: Add sample_period to creation options
Add option to set when the perf buffer should wake up, by default the
perf buffer becomes signaled for every event that is being pushed to it.
In case of a high throughput of events it will be more efficient to wake
up only once you have X events ready to be read.
So your application can wakeup once and drain the entire perf buffer.
Signed-off-by: Jon Doron <jond@wiz.io>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230207081916.3398417-1-arilou@gmail.com
Marc Kleine-Budde [Tue, 7 Feb 2023 20:15:01 +0000 (21:15 +0100)]
can: bittiming: can_calc_bittiming(): add missing parameter to no-op function
In commit
286c0e09e8e0 ("can: bittiming: can_changelink() pass extack
down callstack") a new parameter was added to can_calc_bittiming(),
however the static inline no-op (which is used if
CONFIG_CAN_CALC_BITTIMING is disabled) wasn't converted.
Add the new parameter to the static inline no-op of
can_calc_bittiming().
Fixes: 286c0e09e8e0 ("can: bittiming: can_changelink() pass extack down callstack")
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/20230207201734.2905618-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Oliver Hartkopp [Fri, 3 Feb 2023 09:08:07 +0000 (10:08 +0100)]
can: raw: use temp variable instead of rolling back config
Introduce a temporary variable to check for an invalid configuration
attempt from user space. Before this patch the value was copied to
the real config variable and rolled back in the case of an error.
Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20230203090807.97100-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Lorenzo Bianconi [Wed, 8 Feb 2023 16:58:40 +0000 (17:58 +0100)]
sfc: move xdp_features configuration in efx_pci_probe_post_io()
Move xdp_features configuration from efx_pci_probe() to
efx_pci_probe_post_io() since it is where all the other basic netdev
features are initialised.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/9bd31c9a29bcf406ab90a249a28fc328e5578fd1.1675875404.git.lorenzo@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Dave Thaler [Fri, 27 Jan 2023 01:47:06 +0000 (01:47 +0000)]
bpf, docs: Add note about type convention
Add explanation about use of "u64", "u32", etc. as
the type convention used in BPF documentation.
Signed-off-by: Dave Thaler <dthaler@microsoft.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230127014706.1005-1-dthaler1968@googlemail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Toke Høiland-Jørgensen [Wed, 8 Feb 2023 16:41:43 +0000 (17:41 +0100)]
bpf/docs: Update design QA to be consistent with kfunc lifecycle docs
Cong pointed out that there are some inconsistencies between the BPF design
QA and the lifecycle expectations documentation we added for kfuncs. Let's
update the QA file to be consistent with the kfunc docs, and add references
where it makes sense. Also document that modules may export kfuncs now.
v3:
- Grammar nit + ack from David
v2:
- Fix repeated word (s/defined defined/defined/)
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230208164143.286392-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
David S. Miller [Wed, 8 Feb 2023 09:48:53 +0000 (09:48 +0000)]
Merge branch 'taprio-auto-qmaxsdu-new-tx'
Vladimir Oltean says:
====================
taprio automatic queueMaxSDU and new TXQ selection procedure
This patch set addresses 2 design limitations in the taprio software scheduler:
1. Software scheduling fundamentally prioritizes traffic incorrectly,
in a way which was inspired from Intel igb/igc drivers and does not
follow the inputs user space gives (traffic classes and TC to TXQ
mapping). Patch 05/15 handles this, 01/15 - 04/15 are preparations
for this work.
2. Software scheduling assumes that the gate for a traffic class closes
as soon as the next interval begins. But this isn't true.
If consecutive schedule entries have that traffic class gate open,
there is no "gate close" event and taprio should keep dequeuing from
that TC without interruptions. Patches 06/15 - 15/15 handle this.
Patch 10/15 is a generic Qdisc change required for this to work.
Future development directions which depend on this patch set are:
- Propagating the automatic queueMaxSDU calculation down to offloading
device drivers, instead of letting them calculate this, as
vsc9959_tas_guard_bands_update() does today.
- A software data path for tc-taprio with preemptible traffic and
Hold/Release events.
v1 at:
https://patchwork.kernel.org/project/netdevbpf/cover/
20230128010719.
2182346-1-vladimir.oltean@nxp.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:40 +0000 (15:54 +0200)]
net/sched: taprio: don't segment unnecessarily
Improve commit
497cc00224cf ("taprio: Handle short intervals and large
packets") to only perform segmentation when skb->len exceeds what
taprio_dequeue() expects.
In practice, this will make the biggest difference when a traffic class
gate is always open in the schedule. This is because the max_frm_len
will be U32_MAX, and such large skb->len values as Kurt reported will be
sent just fine unsegmented.
What I don't seem to know how to handle is how to make sure that the
segmented skbs themselves are smaller than the maximum frame size given
by the current queueMaxSDU[tc]. Nonetheless, we still need to drop
those, otherwise the Qdisc will hang.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:39 +0000 (15:54 +0200)]
net/sched: taprio: split segmentation logic from qdisc_enqueue()
The majority of the taprio_enqueue()'s function is spent doing TCP
segmentation, which doesn't look right to me. Compilers shouldn't have a
problem in inlining code no matter how we write it, so move the
segmentation logic to a separate function.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:38 +0000 (15:54 +0200)]
net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations
taprio today has a huge problem with small TC gate durations, because it
might accept packets in taprio_enqueue() which will never be sent by
taprio_dequeue().
Since not much infrastructure was available, a kludge was added in
commit
497cc00224cf ("taprio: Handle short intervals and large
packets"), which segmented large TCP segments, but the fact of the
matter is that the issue isn't specific to large TCP segments (and even
worse, the performance penalty in segmenting those is absolutely huge).
In commit
a54fc09e4cba ("net/sched: taprio: allow user input of per-tc
max SDU"), taprio gained support for queueMaxSDU, which is precisely the
mechanism through which packets should be dropped at qdisc_enqueue() if
they cannot be sent.
After that patch, it was necessary for the user to manually limit the
maximum MTU per TC. This change adds the necessary logic for taprio to
further limit the values specified (or not specified) by the user to
some minimum values which never allow oversized packets to be sent.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:37 +0000 (15:54 +0200)]
net/sched: keep the max_frm_len information inside struct sched_gate_list
I have one practical reason for doing this and one concerning correctness.
The practical reason has to do with a follow-up patch, which aims to mix
2 sources of max_sdu (one coming from the user and the other automatically
calculated based on TC gate durations @current link speed). Among those
2 sources of input, we must always select the smaller max_sdu value, but
this can change at various link speeds. So the max_sdu coming from the
user must be kept separated from the value that is operationally used
(the minimum of the 2), because otherwise we overwrite it and forget
what the user asked us to do.
To solve that, this patch proposes that struct sched_gate_list contains
the operationally active max_frm_len, and q->max_sdu contains just what
was requested by the user.
The reason having to do with correctness is based on the following
observation: the admin sched_gate_list becomes operational at a given
base_time in the future. Until then, it is inactive and applies no
shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't
apply either (this is a mechanism to ensure that packets smaller than
the largest gate duration for that TC don't hang the port; clearly it
makes little sense if the gates are always open).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:36 +0000 (15:54 +0200)]
net/sched: taprio: warn about missing size table
Vinicius intended taprio to take the L1 overhead into account when
estimating packet transmission time through user input, specifically
through the qdisc size table (man tc-stab).
Something like this:
tc qdisc replace dev $eth root stab overhead 24 taprio \
num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 \
sched-entry S 0x7e
9000000 \
sched-entry S 0x82
1000000 \
max-sdu 0 0 0 0 0 0 0 200 \
flags 0x0 clockid CLOCK_TAI
Without the overhead being specified, transmission times will be
underestimated and will cause late transmissions. For an offloading
driver, it might even cause TX hangs if there is no open gate large
enough to send the maximum sized packets for that TC (including L1
overhead). Properly knowing the L1 overhead will ensure that we are able
to auto-calculate the queueMaxSDU per traffic class just right, and
avoid these hangs due to head-of-line blocking.
We can't make the stab mandatory due to existing setups, but we can warn
the user that it's important with a warning netlink extack.
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:35 +0000 (15:54 +0200)]
net/sched: make stab available before ops->init() call
Some qdiscs like taprio turn out to be actually pretty reliant on a well
configured stab, to not underestimate the skb transmission time (by
properly accounting for L1 overhead).
In a future change, taprio will need the stab, if configured by the
user, to be available at ops->init() time. It will become even more
important in upcoming work, when the overhead will be used for the
queueMaxSDU calculation that is passed to an offloading driver.
However, rcu_assign_pointer(sch->stab, stab) is called right after
ops->init(), making it unavailable, and I don't really see a good reason
for that.
Move it earlier, which nicely seems to simplify the error handling path
as well.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:34 +0000 (15:54 +0200)]
net/sched: taprio: calculate guard band against actual TC gate close time
taprio_dequeue_from_txq() looks at the entry->end_time to determine
whether the skb will overrun its traffic class gate, as if at the end of
the schedule entry there surely is a "gate close" event for it. Hint:
maybe there isn't.
For each schedule entry, introduce an array of kernel times which
actually tracks when in the future will there be an *actual* gate close
event for that traffic class, and use that in the guard band overrun
calculation.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:33 +0000 (15:54 +0200)]
net/sched: taprio: calculate budgets per traffic class
Currently taprio assumes that the budget for a traffic class expires at
the end of the current interval as if the next interval contains a "gate
close" event for this traffic class.
This is, however, an unfounded assumption. Allow schedule entry
intervals to be fused together for a particular traffic class by
calculating the budget until the gate *actually* closes.
This means we need to keep budgets per traffic class, and we also need
to update the budget consumption procedure.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>