linux.git
13 months agoinet: use xa_array iterator to implement inet_dump_ifaddr()
Eric Dumazet [Thu, 29 Feb 2024 11:40:16 +0000 (11:40 +0000)]
inet: use xa_array iterator to implement inet_dump_ifaddr()

1) inet_dump_ifaddr() can can run under RCU protection
   instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
   an extra recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoinet: prepare inet_base_seq() to run without RTNL
Eric Dumazet [Thu, 29 Feb 2024 11:40:15 +0000 (11:40 +0000)]
inet: prepare inet_base_seq() to run without RTNL

In the following patch, inet_base_seq() will no longer be called
with RTNL held.

Add READ_ONCE()/WRITE_ONCE() annotations in dev_base_seq_inc()
and inet_base_seq().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoinet: annotate data-races around ifa->ifa_flags
Eric Dumazet [Thu, 29 Feb 2024 11:40:14 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_flags

ifa->ifa_flags can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoinet: annotate data-races around ifa->ifa_preferred_lft
Eric Dumazet [Thu, 29 Feb 2024 11:40:13 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_preferred_lft

ifa->ifa_preferred_lft can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoinet: annotate data-races around ifa->ifa_valid_lft
Eric Dumazet [Thu, 29 Feb 2024 11:40:12 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_valid_lft

ifa->ifa_valid_lft can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoinet: annotate data-races around ifa->ifa_tstamp and ifa->ifa_cstamp
Eric Dumazet [Thu, 29 Feb 2024 11:40:11 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_tstamp and ifa->ifa_cstamp

ifa->ifa_tstamp can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Do the same for ifa->ifa_cstamp to prepare upcoming changes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'netdevsim-link'
David S. Miller [Fri, 1 Mar 2024 10:43:11 +0000 (10:43 +0000)]
Merge branch 'netdevsim-link'

David Wei says:

====================
netdevsim: link and forward skbs between ports

This patchset adds the ability to link two netdevsim ports together and
forward skbs between them, similar to veth. The goal is to use netdevsim
for testing features e.g. zero copy Rx using io_uring.

This feature was tested locally on QEMU, and a selftest is included.

I ran netdev selftests CI style and all tests but the following passed:
- gro.sh
- l2tp.sh
- ip_local_port_range.sh

gro.sh fails because virtme-ng mounts as read-only and it tries to write
to log.txt. This issue was reported to virtme-ng upstream.

l2tp.sh and ip_local_port_range.sh both fail for me on net-next/main as
well.

---
v13->v14:
- implement ndo_get_iflink()
- fix returning 0 if peer is already linked during linking or not linked
  during unlinking
- bump dropped counter if nsim_ipsec_tx() fails and generally reorder
  nsim_start_xmit()
- fix overflowing lines and indentations

v12->v13:
- wait for socat listening port to be ready before sending data in
  selftest

v11->v12:
- fix leaked netns refs
- fix rtnetlink.sh kci_test_ipsec_offload() selftest

v10->v11:
- add udevadm settle after creating netdevsims in selftest

v9->v10:
- fix not freeing skb when not there is no peer
- prevent possible id clashes in selftest
- cleanup selftest on error paths

v8->v9:
- switch to getting netns using fd rather than id
- prevent linking a netdevsim to itself
- update tests

v7->v8:
- fix not dereferencing RCU ptr using rcu_dereference()
- remove unused variables in selftest

v6->v7:
- change link syntax to netnsid:ifidx
- replace dev_get_by_index() with __dev_get_by_index()
- check for NULL peer when linking
- add a sysfs attribute for unlinking
- only update Tx stats if not dropped
- update selftest

v5->v6:
- reworked to link two netdevsims using sysfs attribute on the bus
  device instead of debugfs due to deadlock possibility if a netdevsim
  is removed during linking
- removed unnecessary patch maintaining a list of probed nsim_devs
- updated selftest

v4->v5:
- reduce nsim_dev_list_lock critical section
- fixed missing mutex unlock during unwind ladder
- rework nsim_dev_peer_write synchronization to take devlink lock as
  well as rtnl_lock
- return err msgs to user during linking if port doesn't exist or
  linking to self
- update tx stats outside of RCU lock

v3->v4:
- maintain a mutex protected list of probed nsim_devs instead of using
  nsim_bus_dev
- fixed synchronization issues by taking rtnl_lock
- track tx_dropped skbs

v2->v3:
- take lock when traversing nsim_bus_dev_list
- take device ref when getting a nsim_bus_dev
- return 0 if nsim_dev_peer_read cannot find the port
- address code formatting
- do not hard code values in selftests
- add Makefile for selftests

v1->v2:
- renamed debugfs file from "link" to "peer"
- replaced strstep() with sscanf() for consistency
- increased char[] buf sz to 22 for copying id + port from user
- added err msg w/ expected fmt when linking as a hint to user
- prevent linking port to itself
- protect peer ptr using RCU

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonetdevsim: fix rtnetlink.sh selftest
David Wei [Wed, 28 Feb 2024 23:22:53 +0000 (15:22 -0800)]
netdevsim: fix rtnetlink.sh selftest

I cleared IFF_NOARP flag from netdevsim dev->flags in order to support
skb forwarding. This breaks the rtnetlink.sh selftest
kci_test_ipsec_offload() test because ipsec does not connect to peers it
cannot transmit to.

Fix the issue by adding a neigh entry manually. ipsec_offload test now
successfully pass.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonetdevsim: add selftest for forwarding skb between connected ports
David Wei [Wed, 28 Feb 2024 23:22:52 +0000 (15:22 -0800)]
netdevsim: add selftest for forwarding skb between connected ports

Connect two netdevsim ports in different namespaces together, then send
packets between them using socat.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonetdevsim: add ndo_get_iflink() implementation
David Wei [Wed, 28 Feb 2024 23:22:51 +0000 (15:22 -0800)]
netdevsim: add ndo_get_iflink() implementation

Add an implementation for ndo_get_iflink() in netdevsim that shows the
ifindex of the linked peer, if any.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonetdevsim: forward skbs from one connected port to another
David Wei [Wed, 28 Feb 2024 23:22:50 +0000 (15:22 -0800)]
netdevsim: forward skbs from one connected port to another

Forward skbs sent from one netdevsim port to its connected netdevsim
port using dev_forward_skb, in a spirit similar to veth.

Add a tx_dropped variable to struct netdevsim, tracking the number of
skbs that could not be forwarded using dev_forward_skb().

The xmit() function accessing the peer ptr is protected by an RCU read
critical section. The rcu_read_lock() is functionally redundant as since
v5.0 all softirqs are implicitly RCU read critical sections; but it is
useful for human readers.

If another CPU is concurrently in nsim_destroy(), then it will first set
the peer ptr to NULL. This does not affect any existing readers that
dereferenced a non-NULL peer. Then, in unregister_netdevice(), there is
a synchronize_rcu() before the netdev is actually unregistered and
freed. This ensures that any readers i.e. xmit() that got a non-NULL
peer will complete before the netdev is freed.

Any readers after the RCU_INIT_POINTER() but before synchronize_rcu()
will dereference NULL, making it safe.

The codepath to nsim_destroy() and nsim_create() takes both the newly
added nsim_dev_list_lock and rtnl_lock. This makes it safe with
concurrent calls to linking two netdevsims together.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonetdevsim: allow two netdevsim ports to be connected
David Wei [Wed, 28 Feb 2024 23:22:49 +0000 (15:22 -0800)]
netdevsim: allow two netdevsim ports to be connected

Add two netdevsim bus attribute to sysfs:
/sys/bus/netdevsim/link_device
/sys/bus/netdevsim/unlink_device

Writing "A M B N" to link_device will link netdevsim M in netnsid A with
netdevsim N in netnsid B.

Writing "A M" to unlink_device will unlink netdevsim M in netnsid A from
its peer, if any.

rtnl_lock is taken to ensure nothing changes during the linking.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'selftests-xfail'
David S. Miller [Fri, 1 Mar 2024 10:30:30 +0000 (10:30 +0000)]
Merge branch 'selftests-xfail'

Jakub Kicinski says:

====================
selftests: kselftest_harness: support using xfail

When running selftests for our subsystem in our CI we'd like all
tests to pass. Currently some tests use SKIP for cases they
expect to fail, because the kselftest_harness limits the return
codes to pass/fail/skip. XFAIL which would be a great match
here cannot be used.

Remove the no_print handling and use vfork() to run the test in
a different process than the setup. This way we don't need to
pass "failing step" via the exit code. Further clean up the exit
codes so that we can use all KSFT_* values. Rewrite the result
printing to make handling XFAIL/XPASS easier. Support tests
declaring combinations of fixture + variant they expect to fail.

Merge plan is to put it on top of -rc6 and merge into net-next.
That way others should be able to pull the patches without
any networking changes.

v4:
 - rebase on top of Mickael's vfork() changes
v3: https://lore.kernel.org/all/20240220192235.2953484-1-kuba@kernel.org/
 - combine multiple series
 - change to "list of expected failures" rather than SKIP()-like handling
v2: https://lore.kernel.org/all/20240216002619.1999225-1-kuba@kernel.org/
 - fix alignment
follow up RFC: https://lore.kernel.org/all/20240216004122.2004689-1-kuba@kernel.org/
v1: https://lore.kernel.org/all/20240213154416.422739-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: ip_local_port_range: use XFAIL instead of SKIP
Jakub Kicinski [Thu, 29 Feb 2024 00:59:19 +0000 (16:59 -0800)]
selftests: ip_local_port_range: use XFAIL instead of SKIP

SCTP does not support IP_LOCAL_PORT_RANGE and we know it,
so use XFAIL instead of SKIP.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest_harness: support using xfail
Jakub Kicinski [Thu, 29 Feb 2024 00:59:18 +0000 (16:59 -0800)]
selftests: kselftest_harness: support using xfail

Currently some tests report skip for things they expect to fail
e.g. when given combination of parameters is known to be unsupported.
This is confusing because in an ideal test environment and fully
featured kernel no tests should be skipped.

Selftest summary line already includes xfail and xpass counters,
e.g.:

  Totals: pass:725 fail:0 xfail:0 xpass:0 skip:0 error:0

but there's no way to use it from within the harness.

Add a new per-fixture+variant combination list of test cases
we expect to fail.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest_harness: let PASS / FAIL provide diagnostic
Jakub Kicinski [Thu, 29 Feb 2024 00:59:17 +0000 (16:59 -0800)]
selftests: kselftest_harness: let PASS / FAIL provide diagnostic

Switch to printing KTAP line for PASS / FAIL with ksft_test_result_code(),
this gives us the ability to report diagnostic messages.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest_harness: separate diagnostic message with # in ksft_test_result_...
Jakub Kicinski [Thu, 29 Feb 2024 00:59:16 +0000 (16:59 -0800)]
selftests: kselftest_harness: separate diagnostic message with # in ksft_test_result_code()

According to the spec we should always print a # if we add
a diagnostic message. Having the caller pass in the new line
as part of diagnostic message makes handling this a bit
counter-intuitive, so append the new line in the helper.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest_harness: print test name for SKIP
Jakub Kicinski [Thu, 29 Feb 2024 00:59:15 +0000 (16:59 -0800)]
selftests: kselftest_harness: print test name for SKIP

Jakub points out that for parsers it's rather useful to always
have the test name on the result line. Currently if we SKIP
(or soon XFAIL or XPASS), we will print:

ok 17 # SKIP SCTP doesn't support IP_BIND_ADDRESS_NO_PORT

     ^
     no test name

Always print the test name.
KTAP format seems to allow or even call for it, per:
https://docs.kernel.org/dev-tools/ktap.html

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/all/87jzn6lnou.fsf@cloudflare.com/
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest: add ksft_test_result_code(), handling all exit codes
Jakub Kicinski [Thu, 29 Feb 2024 00:59:14 +0000 (16:59 -0800)]
selftests: kselftest: add ksft_test_result_code(), handling all exit codes

For generic test harness code it's more useful to deal with exit
codes directly, rather than having to switch on them and call
the right ksft_test_result_*() helper. Add such function to kselftest.h.

Note that "directive" and "diagnostic" are what ktap docs call
those parts of the message.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest_harness: use exit code to store skip
Jakub Kicinski [Thu, 29 Feb 2024 00:59:13 +0000 (16:59 -0800)]
selftests: kselftest_harness: use exit code to store skip

We always use skip in combination with exit_code being 0
(KSFT_PASS). This are basic KSFT / KTAP semantics.
Store the right KSFT_* code in exit_code directly.

This makes it easier to support tests reporting other
extended KSFT_* codes like XFAIL / XPASS.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest_harness: save full exit code in metadata
Jakub Kicinski [Thu, 29 Feb 2024 00:59:12 +0000 (16:59 -0800)]
selftests: kselftest_harness: save full exit code in metadata

Instead of tracking passed = 0/1 rename the field to exit_code
and invert the values so that they match the KSFT_* exit codes.
This will allow us to fold SKIP / XFAIL into the same value.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest_harness: generate test name once
Jakub Kicinski [Thu, 29 Feb 2024 00:59:11 +0000 (16:59 -0800)]
selftests: kselftest_harness: generate test name once

Since we added variant support generating full test case
name takes 4 string arguments. We're about to need it
in another two places. Stop the duplication and print
once into a temporary buffer.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests: kselftest_harness: use KSFT_* exit codes
Jakub Kicinski [Thu, 29 Feb 2024 00:59:10 +0000 (16:59 -0800)]
selftests: kselftest_harness: use KSFT_* exit codes

Now that we no longer need low exit codes to communicate
assertion steps - use normal KSFT exit codes.

Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests/harness: Merge TEST_F_FORK() into TEST_F()
Mickaël Salaün [Thu, 29 Feb 2024 00:59:09 +0000 (16:59 -0800)]
selftests/harness: Merge TEST_F_FORK() into TEST_F()

Replace Landlock-specific TEST_F_FORK() with an improved TEST_F() which
brings four related changes:

Run TEST_F()'s tests in a grandchild process to make it possible to
drop privileges and delegate teardown to the parent.

Compared to TEST_F_FORK(), simplify handling of the test grandchild
process thanks to vfork(2), and makes it generic (e.g. no explicit
conversion between exit code and _metadata).

Compared to TEST_F_FORK(), run teardown even when tests failed with an
assert thanks to commit 63e6b2a42342 ("selftests/harness: Run TEARDOWN
for ASSERT failures").

Simplify the test harness code by removing the no_print and step fields
which are not used.  I added this feature just after I made
kselftest_harness.h more broadly available but this step counter
remained even though it wasn't needed after all. See commit 369130b63178
("selftests: Enhance kselftest_harness.h to print which assert failed").

Replace spaces with tabs in one line of __TEST_F_IMPL().

Cc: Günther Noack <gnoack@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoselftests/landlock: Redefine TEST_F() as TEST_F_FORK()
Mickaël Salaün [Thu, 29 Feb 2024 00:59:08 +0000 (16:59 -0800)]
selftests/landlock: Redefine TEST_F() as TEST_F_FORK()

This has the effect of creating a new test process for either TEST_F()
or TEST_F_FORK(), which doesn't change tests but will ease potential
backports.  See next commit for the TEST_F_FORK() merge into TEST_F().

Cc: Günther Noack <gnoack@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'net-asp22-optimizations'
David S. Miller [Fri, 1 Mar 2024 09:22:50 +0000 (09:22 +0000)]
Merge branch 'net-asp22-optimizations'

Justin Chen says:

====================
Support for ASP 2.2 and optimizations

ASP 2.2 adds some power savings during low power modes.

Also make various improvements when entering low power modes and
reduce MDIO traffic by hooking up interrupts.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: bcmasp: Add support for PHY interrupts
Justin Chen [Wed, 28 Feb 2024 22:54:00 +0000 (14:54 -0800)]
net: bcmasp: Add support for PHY interrupts

Hook up the phy interrupts for internal phys to reduce mdio traffic
and improve responsiveness of link changes.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: bcmasp: Keep buffers through power management
Justin Chen [Wed, 28 Feb 2024 22:53:59 +0000 (14:53 -0800)]
net: bcmasp: Keep buffers through power management

There is no advantage of freeing and re-allocating buffers through
suspend and resume. This waste cycles and makes suspend/resume time
longer. We also open ourselves to failed allocations in systems with
heavy memory fragmentation.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: phy: mdio-bcm-unimac: Add asp v2.2 support
Justin Chen [Wed, 28 Feb 2024 22:53:58 +0000 (14:53 -0800)]
net: phy: mdio-bcm-unimac: Add asp v2.2 support

Add mdio compat string for ASP 2.0 ethernet driver.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: bcmasp: Add support for ASP 2.2
Justin Chen [Wed, 28 Feb 2024 22:53:57 +0000 (14:53 -0800)]
net: bcmasp: Add support for ASP 2.2

ASP 2.2 improves power savings during low power modes.

A new register was added to toggle to a slower clock during low
power modes.

EEE was broken for ASP 2.0/2.1. A HW workaround was added for
ASP 2.2 that requires toggling a chicken bit.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agodt-bindings: net: brcm,asp-v2.0: Add asp-v2.2
Justin Chen [Wed, 28 Feb 2024 22:53:56 +0000 (14:53 -0800)]
dt-bindings: net: brcm,asp-v2.0: Add asp-v2.2

Add support for ASP 2.2.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agodt-bindings: net: brcm,unimac-mdio: Add asp-v2.2
Justin Chen [Wed, 28 Feb 2024 22:53:55 +0000 (14:53 -0800)]
dt-bindings: net: brcm,unimac-mdio: Add asp-v2.2

The ASP 2.2 Ethernet controller uses a brcm unimac.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'qcom-phy-possible'
David S. Miller [Fri, 1 Mar 2024 08:56:39 +0000 (08:56 +0000)]
Merge branch 'qcom-phy-possible'

Robert Marko says:

====================
net: phy: qcom: qca808x: fill in possible_interfaces

QCA808x does not currently fill in the possible_interfaces.

This leads to Phylink not being aware that it supports 2500Base-X as well
so in cases where it is connected to a DSA switch like MV88E6393 it will
limit that port to phy-mode set in the DTS.

That means that if SGMII is used you are limited to 1G only while if
2500Base-X was set you are limited to 2.5G only.

Populating the possible_interfaces fixes this.

Changes in v2:
* Get rid of the if/else by Russels suggestion in the helper
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: phy: qcom: qca808x: fill in possible_interfaces
Robert Marko [Wed, 28 Feb 2024 17:24:10 +0000 (18:24 +0100)]
net: phy: qcom: qca808x: fill in possible_interfaces

Currently QCA808x driver does not fill the possible_interfaces.
2.5G QCA808x support SGMII and 2500Base-X while 1G model only supports
SGMII, so fill the possible_interfaces accordingly.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agonet: phy: qcom: qca808x: add helper for checking for 1G only model
Robert Marko [Wed, 28 Feb 2024 17:24:09 +0000 (18:24 +0100)]
net: phy: qcom: qca808x: add helper for checking for 1G only model

There are 2 versions of QCA808x, one 2.5G capable and one 1G capable.
Currently, this matter only in the .get_features call however, it will
be required for filling supported interface modes so lets add a helper
that can be reused.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoSimplify net_dbg_ratelimited() dummy
Geert Uytterhoeven [Wed, 28 Feb 2024 14:05:29 +0000 (15:05 +0100)]
Simplify net_dbg_ratelimited() dummy

There is no need to wrap calls to the no_printk() helper inside an
always-false check, as no_printk() already does that internally.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge branch 'ipv6-devconf-lockless'
David S. Miller [Fri, 1 Mar 2024 08:42:33 +0000 (08:42 +0000)]
Merge branch 'ipv6-devconf-lockless'

Eric Dumazet says:

====================
ipv6: lockless accesses to devconf

- First patch puts in a cacheline_group the fields used in fast paths.

- Annotate all data races around idev->cnf fields.

- Last patch in this series removes RTNL use for RTM_GETNETCONF dumps.

v3: addressed Jakub Kicinski feedback in addrconf_disable_ipv6()
    Added tags from Jiri and Florian.

v2: addressed Jiri Pirko feedback
 - Added "ipv6: addrconf_disable_ipv6() optimizations"
   and "ipv6: addrconf_disable_policy() optimization"
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: use xa_array iterator to implement inet6_netconf_dump_devconf()
Eric Dumazet [Wed, 28 Feb 2024 13:54:39 +0000 (13:54 +0000)]
ipv6: use xa_array iterator to implement inet6_netconf_dump_devconf()

1) inet6_netconf_dump_devconf() can run under RCU protection
   instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
   an extra recvmsg() system call.

3) Do not use inet6_base_seq() anymore, for_each_netdev_dump()
   has nice properties. Restarting a GETDEVCONF dump if a device has
   been added/removed or if net->ipv6.dev_addr_genid has changed is moot.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6/addrconf: annotate data-races around devconf fields (II)
Eric Dumazet [Wed, 28 Feb 2024 13:54:38 +0000 (13:54 +0000)]
ipv6/addrconf: annotate data-races around devconf fields (II)

Final (?) round of this series.

Annotate lockless reads on following devconf fields,
because they be changed concurrently from /proc/net/ipv6/conf.

- accept_dad
- optimistic_dad
- use_optimistic
- use_oif_addrs_only
- ra_honor_pio_life
- keep_addr_on_down
- ndisc_notify
- ndisc_evict_nocarrier
- suppress_frag_ndisc
- addr_gen_mode
- seg6_enabled
- ioam6_enabled
- ioam6_id
- ioam6_id_wide
- drop_unicast_in_l2_multicast
- mldv[12]_unsolicited_report_interval
- force_mld_version
- force_tllao
- accept_untracked_na
- drop_unsolicited_na
- accept_source_route

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6/addrconf: annotate data-races around devconf fields (I)
Eric Dumazet [Wed, 28 Feb 2024 13:54:37 +0000 (13:54 +0000)]
ipv6/addrconf: annotate data-races around devconf fields (I)

Annotate lockless reads and writes on following devconf fields:

- regen_min_advance
- regen_max_retry
- dad_transmits
- use_tempaddr
- max_addresses
- max_desync_factor
- temp_valid_lft
- rtr_solicits
- rtr_solicit_max_interval
- rtr_solicit_interval
- rtr_solicit_delay
- enhanced_dad
- accept_redirects

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: addrconf_disable_policy() optimization
Eric Dumazet [Wed, 28 Feb 2024 13:54:36 +0000 (13:54 +0000)]
ipv6: addrconf_disable_policy() optimization

Writing over /proc/sys/net/ipv6/conf/default/disable_policy
does not need to hold RTNL.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races around devconf->disable_policy
Eric Dumazet [Wed, 28 Feb 2024 13:54:35 +0000 (13:54 +0000)]
ipv6: annotate data-races around devconf->disable_policy

idev->cnf.disable_policy and net->ipv6.devconf_all->disable_policy
can be read locklessly. Add appropriate annotations on reads
and writes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races around devconf->proxy_ndp
Eric Dumazet [Wed, 28 Feb 2024 13:54:34 +0000 (13:54 +0000)]
ipv6: annotate data-races around devconf->proxy_ndp

devconf->proxy_ndp can be read and written locklessly,
add appropriate annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races in rt6_probe()
Eric Dumazet [Wed, 28 Feb 2024 13:54:33 +0000 (13:54 +0000)]
ipv6: annotate data-races in rt6_probe()

Use READ_ONCE() while reading idev->cnf.rtr_probe_interval
while its value could be changed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races around idev->cnf.ignore_routes_with_linkdown
Eric Dumazet [Wed, 28 Feb 2024 13:54:32 +0000 (13:54 +0000)]
ipv6: annotate data-races around idev->cnf.ignore_routes_with_linkdown

idev->cnf.ignore_routes_with_linkdown can be used without any locks,
add appropriate annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races in ndisc_router_discovery()
Eric Dumazet [Wed, 28 Feb 2024 13:54:31 +0000 (13:54 +0000)]
ipv6: annotate data-races in ndisc_router_discovery()

Annotate reads from in6_dev->cnf.XXX fields, as they could
change concurrently.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races around cnf.forwarding
Eric Dumazet [Wed, 28 Feb 2024 13:54:30 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.forwarding

idev->cnf.forwarding and net->ipv6.devconf_all->forwarding
might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races around cnf.hop_limit
Eric Dumazet [Wed, 28 Feb 2024 13:54:29 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.hop_limit

idev->cnf.hop_limit and net->ipv6.devconf_all->hop_limit
might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de> # for netfilter parts
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races around cnf.mtu6
Eric Dumazet [Wed, 28 Feb 2024 13:54:28 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.mtu6

idev->cnf.mtu6 might be read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: addrconf_disable_ipv6() optimization
Eric Dumazet [Wed, 28 Feb 2024 13:54:27 +0000 (13:54 +0000)]
ipv6: addrconf_disable_ipv6() optimization

Writing over /proc/sys/net/ipv6/conf/default/disable_ipv6
does not need to hold RTNL.

v3: remove a wrong change (Jakub Kicinski feedback)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: annotate data-races around cnf.disable_ipv6
Eric Dumazet [Wed, 28 Feb 2024 13:54:26 +0000 (13:54 +0000)]
ipv6: annotate data-races around cnf.disable_ipv6

disable_ipv6 is read locklessly, add appropriate READ_ONCE()
and WRITE_ONCE() annotations.

v2: do not preload net before rtnl_trylock() in
    addrconf_disable_ipv6() (Jiri)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoipv6: add ipv6_devconf_read_txrx cacheline_group
Eric Dumazet [Wed, 28 Feb 2024 13:54:25 +0000 (13:54 +0000)]
ipv6: add ipv6_devconf_read_txrx cacheline_group

IPv6 TX and RX fast path use the following fields:

- disable_ipv6
- hop_limit
- mtu6
- forwarding
- disable_policy
- proxy_ndp

Place them in a group to increase data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 29 Feb 2024 22:17:54 +0000 (14:17 -0800)]
Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

net/mptcp/protocol.c
  adf1bb78dab5 ("mptcp: fix snd_wnd initialization for passive socket")
  9426ce476a70 ("mptcp: annotate lockless access for RX path fields")
https://lore.kernel.org/all/20240228103048.19255709@canb.auug.org.au/

Adjacent changes:

drivers/dpll/dpll_core.c
  0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()")
  e7f8df0e81bf ("dpll: move xa_erase() call in to match dpll_pin_alloc() error path order")

drivers/net/veth.c
  1ce7d306ea63 ("veth: try harder when allocating queue memory")
  0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers")

drivers/net/wireless/intel/iwlwifi/mvm/d3.c
  8c9bef26e98b ("wifi: iwlwifi: mvm: d3: implement suspend with MLO")
  78f65fbf421a ("wifi: iwlwifi: mvm: ensure offloading TID queue exists")

net/wireless/nl80211.c
  f78c1375339a ("wifi: nl80211: reject iftype change with mesh ID change")
  414532d8aa89 ("wifi: cfg80211: use IEEE80211_MAX_MESH_ID_LEN appropriately")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge tag 'net-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 29 Feb 2024 20:40:20 +0000 (12:40 -0800)]
Merge tag 'net-6.8-rc7' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, WiFi and netfilter.

  We have one outstanding issue with the stmmac driver, which may be a
  LOCKDEP false positive, not a blocker.

  Current release - regressions:

   - netfilter: nf_tables: re-allow NFPROTO_INET in
     nft_(match/target)_validate()

   - eth: ionic: fix error handling in PCI reset code

  Current release - new code bugs:

   - eth: stmmac: complete meta data only when enabled, fix null-deref

   - kunit: fix again checksum tests on big endian CPUs

  Previous releases - regressions:

   - veth: try harder when allocating queue memory

   - Bluetooth:
      - hci_bcm4377: do not mark valid bd_addr as invalid
      - hci_event: fix handling of HCI_EV_IO_CAPA_REQUEST

  Previous releases - always broken:

   - info leak in __skb_datagram_iter() on netlink socket

   - mptcp:
      - map v4 address to v6 when destroying subflow
      - fix potential wake-up event loss due to sndbuf auto-tuning
      - fix double-free on socket dismantle

   - wifi: nl80211: reject iftype change with mesh ID change

   - fix small out-of-bound read when validating netlink be16/32 types

   - rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back

   - ipv6: fix potential "struct net" ref-leak in inet6_rtm_getaddr()

   - ip_tunnel: prevent perpetual headroom growth with huge number of
     tunnels on top of each other

   - mctp: fix skb leaks on error paths of mctp_local_output()

   - eth: ice: fixes for DPLL state reporting

   - dpll: rely on rcu for netdev_dpll_pin() to prevent UaF

   - eth: dpaa: accept phy-interface-type = '10gbase-r' in the device
     tree"

* tag 'net-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
  dpll: fix build failure due to rcu_dereference_check() on unknown type
  kunit: Fix again checksum tests on big endian CPUs
  tls: fix use-after-free on failed backlog decryption
  tls: separate no-async decryption request handling from async
  tls: fix peeking with sync+async decryption
  tls: decrement decrypt_pending if no async completion will be called
  gtp: fix use-after-free and null-ptr-deref in gtp_newlink()
  net: hsr: Use correct offset for HSR TLV values in supervisory HSR frames
  igb: extend PTP timestamp adjustments to i211
  rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back
  tools: ynl: fix handling of multiple mcast groups
  selftests: netfilter: add bridge conntrack + multicast test case
  netfilter: bridge: confirm multicast packets before passing them up the stack
  netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate()
  Bluetooth: qca: Fix triggering coredump implementation
  Bluetooth: hci_qca: Set BDA quirk bit if fwnode exists in DT
  Bluetooth: qca: Fix wrong event type for patch config command
  Bluetooth: Enforce validation on max value of connection interval
  Bluetooth: hci_event: Fix handling of HCI_EV_IO_CAPA_REQUEST
  Bluetooth: mgmt: Fix limited discoverable off timeout
  ...

13 months agoMerge tag 'landlock-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic...
Linus Torvalds [Thu, 29 Feb 2024 20:29:23 +0000 (12:29 -0800)]
Merge tag 'landlock-6.8-rc7' of git://git./linux/kernel/git/mic/linux

Pull Landlock fix from Mickaël Salaün:
 "Fix a potential issue when handling inodes with inconsistent
  properties"

* tag 'landlock-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Fix asymmetric private inodes referring

13 months agodpll: fix build failure due to rcu_dereference_check() on unknown type
Eric Dumazet [Thu, 29 Feb 2024 19:05:15 +0000 (11:05 -0800)]
dpll: fix build failure due to rcu_dereference_check() on unknown type

Tasmiya reports that their compiler complains that we deref
a pointer to unknown type with rcu_dereference_rtnl():

include/linux/rcupdate.h:439:9: error: dereferencing pointer to incomplete type ‘struct dpll_pin’

Unclear what compiler it is, at the moment, and we can't report
but since DPLL can't be a module - move the code from the header
into the source file.

Fixes: 0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()")
Reported-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com>
Link: https://lore.kernel.org/all/3fcf3a2c-1c1b-42c1-bacb-78fdcd700389@linux.vnet.ibm.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240229190515.2740221-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agokunit: Fix again checksum tests on big endian CPUs
Christophe Leroy [Fri, 23 Feb 2024 10:41:52 +0000 (11:41 +0100)]
kunit: Fix again checksum tests on big endian CPUs

Commit b38460bc463c ("kunit: Fix checksum tests on big endian CPUs")
fixed endianness issues with kunit checksum tests, but then
commit 6f4c45cbcb00 ("kunit: Add tests for csum_ipv6_magic and
ip_fast_csum") introduced new issues on big endian CPUs. Those issues
are once again reflected by the warnings reported by sparse.

So, fix them with the same approach, perform proper conversion in
order to support both little and big endian CPUs. Once the conversions
are properly done and the right types used, the sparse warnings are
cleared as well.

Reported-by: Erhard Furtner <erhard_f@mailbox.org>
Fixes: 6f4c45cbcb00 ("kunit: Add tests for csum_ipv6_magic and ip_fast_csum")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/73df3a9e95c2179119398ad1b4c84cdacbd8dfb6.1708684443.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge tag 'for-net-2024-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluet...
Jakub Kicinski [Thu, 29 Feb 2024 17:10:24 +0000 (09:10 -0800)]
Merge tag 'for-net-2024-02-28' of git://git./linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - mgmt: Fix limited discoverable off timeout
 - hci_qca: Set BDA quirk bit if fwnode exists in DT
 - hci_bcm4377: do not mark valid bd_addr as invalid
 - hci_sync: Check the correct flag before starting a scan
 - Enforce validation on max value of connection interval
 - hci_sync: Fix accept_list when attempting to suspend
 - hci_event: Fix handling of HCI_EV_IO_CAPA_REQUEST
 - Avoid potential use-after-free in hci_error_reset
 - rfcomm: Fix null-ptr-deref in rfcomm_check_security
 - hci_event: Fix wrongly recorded wakeup BD_ADDR
 - qca: Fix wrong event type for patch config command
 - qca: Fix triggering coredump implementation

* tag 'for-net-2024-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: qca: Fix triggering coredump implementation
  Bluetooth: hci_qca: Set BDA quirk bit if fwnode exists in DT
  Bluetooth: qca: Fix wrong event type for patch config command
  Bluetooth: Enforce validation on max value of connection interval
  Bluetooth: hci_event: Fix handling of HCI_EV_IO_CAPA_REQUEST
  Bluetooth: mgmt: Fix limited discoverable off timeout
  Bluetooth: hci_event: Fix wrongly recorded wakeup BD_ADDR
  Bluetooth: rfcomm: Fix null-ptr-deref in rfcomm_check_security
  Bluetooth: hci_sync: Fix accept_list when attempting to suspend
  Bluetooth: Avoid potential use-after-free in hci_error_reset
  Bluetooth: hci_sync: Check the correct flag before starting a scan
  Bluetooth: hci_bcm4377: do not mark valid bd_addr as invalid
====================

Link: https://lore.kernel.org/r/20240228145644.2269088-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge branch 'tls-a-few-more-fixes-for-async-decrypt'
Jakub Kicinski [Thu, 29 Feb 2024 17:07:18 +0000 (09:07 -0800)]
Merge branch 'tls-a-few-more-fixes-for-async-decrypt'

Sabrina Dubroca says:

====================
tls: a few more fixes for async decrypt

The previous patchset [1] took care of "full async". This adds a few
fixes for cases where only part of the crypto operations go the async
route, found by extending my previous debug patch [2] to do N
synchronous operations followed by M asynchronous ops (with N and M
configurable).

[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=823784&state=*
[2] https://lore.kernel.org/all/9d664093b1bf7f47497b2c40b3a085b45f3274a2.1694021240.git.sd@queasysnail.net/
====================

Link: https://lore.kernel.org/r/cover.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotls: fix use-after-free on failed backlog decryption
Sabrina Dubroca [Wed, 28 Feb 2024 22:44:00 +0000 (23:44 +0100)]
tls: fix use-after-free on failed backlog decryption

When the decrypt request goes to the backlog and crypto_aead_decrypt
returns -EBUSY, tls_do_decryption will wait until all async
decryptions have completed. If one of them fails, tls_do_decryption
will return -EBADMSG and tls_decrypt_sg jumps to the error path,
releasing all the pages. But the pages have been passed to the async
callback, and have already been released by tls_decrypt_done.

The only true async case is when crypto_aead_decrypt returns
 -EINPROGRESS. With -EBUSY, we already waited so we can tell
tls_sw_recvmsg that the data is available for immediate copy, but we
need to notify tls_decrypt_sg (via the new ->async_done flag) that the
memory has already been released.

Fixes: 859054147318 ("net: tls: handle backlogging of crypto requests")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/4755dd8d9bebdefaa19ce1439b833d6199d4364c.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotls: separate no-async decryption request handling from async
Sabrina Dubroca [Wed, 28 Feb 2024 22:43:59 +0000 (23:43 +0100)]
tls: separate no-async decryption request handling from async

If we're not doing async, the handling is much simpler. There's no
reference counting, we just need to wait for the completion to wake us
up and return its result.

We should preferably also use a separate crypto_wait. I'm not seeing a
UAF as I did in the past, I think aec7961916f3 ("tls: fix race between
async notify and socket close") took care of it.

This will make the next fix easier.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/47bde5f649707610eaef9f0d679519966fc31061.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotls: fix peeking with sync+async decryption
Sabrina Dubroca [Wed, 28 Feb 2024 22:43:58 +0000 (23:43 +0100)]
tls: fix peeking with sync+async decryption

If we peek from 2 records with a currently empty rx_list, and the
first record is decrypted synchronously but the second record is
decrypted async, the following happens:
  1. decrypt record 1 (sync)
  2. copy from record 1 to the userspace's msg
  3. queue the decrypted record to rx_list for future read(!PEEK)
  4. decrypt record 2 (async)
  5. queue record 2 to rx_list
  6. call process_rx_list to copy data from the 2nd record

We currently pass copied=0 as skip offset to process_rx_list, so we
end up copying once again from the first record. We should skip over
the data we've already copied.

Seen with selftest tls.12_aes_gcm.recv_peek_large_buf_mult_recs

Fixes: 692d7b5d1f91 ("tls: Fix recvmsg() to be able to peek across multiple records")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/1b132d2b2b99296bfde54e8a67672d90d6d16e71.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotls: decrement decrypt_pending if no async completion will be called
Sabrina Dubroca [Wed, 28 Feb 2024 22:43:57 +0000 (23:43 +0100)]
tls: decrement decrypt_pending if no async completion will be called

With mixed sync/async decryption, or failures of crypto_aead_decrypt,
we increment decrypt_pending but we never do the corresponding
decrement since tls_decrypt_done will not be called. In this case, we
should decrement decrypt_pending immediately to avoid getting stuck.

For example, the prequeue prequeue test gets stuck with mixed
modes (one async decrypt + one sync decrypt).

Fixes: 94524d8fc965 ("net/tls: Add support for async decryption of tls records")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/c56d5fc35543891d5319f834f25622360e1bfbec.1709132643.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agogtp: fix use-after-free and null-ptr-deref in gtp_newlink()
Alexander Ofitserov [Wed, 28 Feb 2024 11:47:03 +0000 (14:47 +0300)]
gtp: fix use-after-free and null-ptr-deref in gtp_newlink()

The gtp_link_ops operations structure for the subsystem must be
registered after registering the gtp_net_ops pernet operations structure.

Syzkaller hit 'general protection fault in gtp_genl_dump_pdp' bug:

[ 1010.702740] gtp: GTP module unloaded
[ 1010.715877] general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
[ 1010.715888] KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
[ 1010.715895] CPU: 1 PID: 128616 Comm: a.out Not tainted 6.8.0-rc6-std-def-alt1 #1
[ 1010.715899] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-alt1 04/01/2014
[ 1010.715908] RIP: 0010:gtp_newlink+0x4d7/0x9c0 [gtp]
[ 1010.715915] Code: 80 3c 02 00 0f 85 41 04 00 00 48 8b bb d8 05 00 00 e8 ed f6 ff ff 48 89 c2 48 89 c5 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 4f 04 00 00 4c 89 e2 4c 8b 6d 00 48 b8 00 00 00
[ 1010.715920] RSP: 0018:ffff888020fbf180 EFLAGS: 00010203
[ 1010.715929] RAX: dffffc0000000000 RBX: ffff88800399c000 RCX: 0000000000000000
[ 1010.715933] RDX: 0000000000000001 RSI: ffffffff84805280 RDI: 0000000000000282
[ 1010.715938] RBP: 000000000000000d R08: 0000000000000001 R09: 0000000000000000
[ 1010.715942] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88800399cc80
[ 1010.715947] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000400
[ 1010.715953] FS:  00007fd1509ab5c0(0000) GS:ffff88805b300000(0000) knlGS:0000000000000000
[ 1010.715958] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1010.715962] CR2: 0000000000000000 CR3: 000000001c07a000 CR4: 0000000000750ee0
[ 1010.715968] PKRU: 55555554
[ 1010.715972] Call Trace:
[ 1010.715985]  ? __die_body.cold+0x1a/0x1f
[ 1010.715995]  ? die_addr+0x43/0x70
[ 1010.716002]  ? exc_general_protection+0x199/0x2f0
[ 1010.716016]  ? asm_exc_general_protection+0x1e/0x30
[ 1010.716026]  ? gtp_newlink+0x4d7/0x9c0 [gtp]
[ 1010.716034]  ? gtp_net_exit+0x150/0x150 [gtp]
[ 1010.716042]  __rtnl_newlink+0x1063/0x1700
[ 1010.716051]  ? rtnl_setlink+0x3c0/0x3c0
[ 1010.716063]  ? is_bpf_text_address+0xc0/0x1f0
[ 1010.716070]  ? kernel_text_address.part.0+0xbb/0xd0
[ 1010.716076]  ? __kernel_text_address+0x56/0xa0
[ 1010.716084]  ? unwind_get_return_address+0x5a/0xa0
[ 1010.716091]  ? create_prof_cpu_mask+0x30/0x30
[ 1010.716098]  ? arch_stack_walk+0x9e/0xf0
[ 1010.716106]  ? stack_trace_save+0x91/0xd0
[ 1010.716113]  ? stack_trace_consume_entry+0x170/0x170
[ 1010.716121]  ? __lock_acquire+0x15c5/0x5380
[ 1010.716139]  ? mark_held_locks+0x9e/0xe0
[ 1010.716148]  ? kmem_cache_alloc_trace+0x35f/0x3c0
[ 1010.716155]  ? __rtnl_newlink+0x1700/0x1700
[ 1010.716160]  rtnl_newlink+0x69/0xa0
[ 1010.716166]  rtnetlink_rcv_msg+0x43b/0xc50
[ 1010.716172]  ? rtnl_fdb_dump+0x9f0/0x9f0
[ 1010.716179]  ? lock_acquire+0x1fe/0x560
[ 1010.716188]  ? netlink_deliver_tap+0x12f/0xd50
[ 1010.716196]  netlink_rcv_skb+0x14d/0x440
[ 1010.716202]  ? rtnl_fdb_dump+0x9f0/0x9f0
[ 1010.716208]  ? netlink_ack+0xab0/0xab0
[ 1010.716213]  ? netlink_deliver_tap+0x202/0xd50
[ 1010.716220]  ? netlink_deliver_tap+0x218/0xd50
[ 1010.716226]  ? __virt_addr_valid+0x30b/0x590
[ 1010.716233]  netlink_unicast+0x54b/0x800
[ 1010.716240]  ? netlink_attachskb+0x870/0x870
[ 1010.716248]  ? __check_object_size+0x2de/0x3b0
[ 1010.716254]  netlink_sendmsg+0x938/0xe40
[ 1010.716261]  ? netlink_unicast+0x800/0x800
[ 1010.716269]  ? __import_iovec+0x292/0x510
[ 1010.716276]  ? netlink_unicast+0x800/0x800
[ 1010.716284]  __sock_sendmsg+0x159/0x190
[ 1010.716290]  ____sys_sendmsg+0x712/0x880
[ 1010.716297]  ? sock_write_iter+0x3d0/0x3d0
[ 1010.716304]  ? __ia32_sys_recvmmsg+0x270/0x270
[ 1010.716309]  ? lock_acquire+0x1fe/0x560
[ 1010.716315]  ? drain_array_locked+0x90/0x90
[ 1010.716324]  ___sys_sendmsg+0xf8/0x170
[ 1010.716331]  ? sendmsg_copy_msghdr+0x170/0x170
[ 1010.716337]  ? lockdep_init_map_type+0x2c7/0x860
[ 1010.716343]  ? lockdep_hardirqs_on_prepare+0x430/0x430
[ 1010.716350]  ? debug_mutex_init+0x33/0x70
[ 1010.716360]  ? percpu_counter_add_batch+0x8b/0x140
[ 1010.716367]  ? lock_acquire+0x1fe/0x560
[ 1010.716373]  ? find_held_lock+0x2c/0x110
[ 1010.716384]  ? __fd_install+0x1b6/0x6f0
[ 1010.716389]  ? lock_downgrade+0x810/0x810
[ 1010.716396]  ? __fget_light+0x222/0x290
[ 1010.716403]  __sys_sendmsg+0xea/0x1b0
[ 1010.716409]  ? __sys_sendmsg_sock+0x40/0x40
[ 1010.716419]  ? lockdep_hardirqs_on_prepare+0x2b3/0x430
[ 1010.716425]  ? syscall_enter_from_user_mode+0x1d/0x60
[ 1010.716432]  do_syscall_64+0x30/0x40
[ 1010.716438]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1010.716444] RIP: 0033:0x7fd1508cbd49
[ 1010.716452] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ef 70 0d 00 f7 d8 64 89 01 48
[ 1010.716456] RSP: 002b:00007fff18872348 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 1010.716463] RAX: ffffffffffffffda RBX: 000055f72bf0eac0 RCX: 00007fd1508cbd49
[ 1010.716468] RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000006
[ 1010.716473] RBP: 00007fff18872360 R08: 00007fff18872360 R09: 00007fff18872360
[ 1010.716478] R10: 00007fff18872360 R11: 0000000000000202 R12: 000055f72bf0e1b0
[ 1010.716482] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1010.716491] Modules linked in: gtp(+) udp_tunnel ib_core uinput af_packet rfkill qrtr joydev hid_generic usbhid hid kvm_intel iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm snd_hda_codec_generic ledtrig_audio irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_intel nls_utf8 snd_intel_dspcfg nls_cp866 psmouse aesni_intel vfat crypto_simd fat cryptd glue_helper snd_hda_codec pcspkr snd_hda_core i2c_i801 snd_hwdep i2c_smbus xhci_pci snd_pcm lpc_ich xhci_pci_renesas xhci_hcd qemu_fw_cfg tiny_power_button button sch_fq_codel vboxvideo drm_vram_helper drm_ttm_helper ttm vboxsf vboxguest snd_seq_midi snd_seq_midi_event snd_seq snd_rawmidi snd_seq_device snd_timer snd soundcore msr fuse efi_pstore dm_mod ip_tables x_tables autofs4 virtio_gpu virtio_dma_buf drm_kms_helper cec rc_core drm virtio_rng virtio_scsi rng_core virtio_balloon virtio_blk virtio_net virtio_console net_failover failover ahci libahci libata evdev scsi_mod input_leds serio_raw virtio_pci intel_agp
[ 1010.716674]  virtio_ring intel_gtt virtio [last unloaded: gtp]
[ 1010.716693] ---[ end trace 04990a4ce61e174b ]---

Cc: stable@vger.kernel.org
Signed-off-by: Alexander Ofitserov <oficerovas@altlinux.org>
Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228114703.465107-1-oficerovas@altlinux.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agoMerge branch 'net-collect-tstats-automatically'
Paolo Abeni [Thu, 29 Feb 2024 11:50:15 +0000 (12:50 +0100)]
Merge branch 'net-collect-tstats-automatically'

Breno Leitao says:

====================
net: collect tstats automatically

The commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf") added a field in struct_netdevice, which tells what
type of statistics the driver supports.

That field is used primarily to allocate stats structures automatically,
but, it also could leveraged to simplify the drivers even further, such
as, if the driver relies in the default stats collection, then it
doesn't need to assign to .ndo_get_stats64. That means that drivers only
assign functions to .ndo_get_stats64 if they are using something
special.

I started to move some of these drivers[1][2][3] to use the core
allocation, and with this change in, I just need to touch the driver
once, and be able to simplify the whole stats allocation and collection
for generic case.

There are 44 devices today that could benefit from this simplification.

# grep -r .ndo_get_stats64 | grep dev_get_tstats64 | wc -l
44

As of today, netnext only has the `sit` driver fully ported to core
stats allocation, hence the second patch.

Links:
[1] https://lore.kernel.org/all/20240227182338.2739884-1-leitao@debian.org/
[2] https://lore.kernel.org/all/20240222144117.1370101-1-leitao@debian.org/
[3] https://lore.kernel.org/all/20240223115839.3572852-1-leitao@debian.org/
====================

Link: https://lore.kernel.org/r/20240228113125.3473685-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: sit: Do not set .ndo_get_stats64
Breno Leitao [Wed, 28 Feb 2024 11:31:22 +0000 (03:31 -0800)]
net: sit: Do not set .ndo_get_stats64

If the driver is using the network core allocation mechanism, by setting
NETDEV_PCPU_STAT_TSTATS, as this driver is, then, it doesn't need to set
the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Since
the network core calls it automatically, and .ndo_get_stats64 should
only be set if the driver needs special treatment.

This simplifies the driver, since all the generic statistics is now
handled by core.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: get stats64 if device if driver is configured
Breno Leitao [Wed, 28 Feb 2024 11:31:21 +0000 (03:31 -0800)]
net: get stats64 if device if driver is configured

If the network driver is relying in the net core to do stats allocation,
then we want to dev_get_tstats64() instead of netdev_stats_to_stats64(),
since there are per-cpu stats that needs to be taken in consideration.

This will also simplify the drivers in regard to statistics. Once the
driver sets NETDEV_PCPU_STAT_TSTATS, it doesn't not need to allocate the
stacks, neither it needs to set `.ndo_get_stats64 = dev_get_tstats64`
for the generic stats collection function anymore.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: stmmac: fix typo in comment
Yanteng Si [Wed, 28 Feb 2024 11:24:47 +0000 (19:24 +0800)]
net: stmmac: fix typo in comment

This is just a trivial fix for a typo in a comment, no functional
changes.

Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Link: https://lore.kernel.org/r/20240228112447.1490926-1-siyanteng@loongson.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agoMerge tag 'nf-24-02-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Paolo Abeni [Thu, 29 Feb 2024 11:16:07 +0000 (12:16 +0100)]
Merge tag 'nf-24-02-29' of git://git./linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

Patch #1 restores NFPROTO_INET with nft_compat, from Ignat Korchagin.

Patch #2 fixes an issue with bridge netfilter and broadcast/multicast
packets.

There is a day 0 bug in br_netfilter when used with connection tracking.

Conntrack assumes that an nf_conn structure that is not yet added to
hash table ("unconfirmed"), is only visible by the current cpu that is
processing the sk_buff.

For bridge this isn't true, sk_buff can get cloned in between, and
clones can be processed in parallel on different cpu.

This patch disables NAT and conntrack helpers for multicast packets.

Patch #3 adds a selftest to cover for the br_netfilter bug.

netfilter pull request 24-02-29

* tag 'nf-24-02-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: add bridge conntrack + multicast test case
  netfilter: bridge: confirm multicast packets before passing them up the stack
  netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate()
====================

Link: https://lore.kernel.org/r/20240229000135.8780-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: hsr: Use correct offset for HSR TLV values in supervisory HSR frames
Lukasz Majewski [Wed, 28 Feb 2024 08:56:44 +0000 (09:56 +0100)]
net: hsr: Use correct offset for HSR TLV values in supervisory HSR frames

Current HSR implementation uses following supervisory frame (even for
HSRv1 the HSR tag is not is not present):

00000000: 01 15 4e 00 01 2d XX YY ZZ 94 77 10 88 fb 00 01
00000010: 7e 1c 17 06 XX YY ZZ 94 77 10 1e 06 XX YY ZZ 94
00000020: 77 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 00 00 00 00 00 00

The current code adds extra two bytes (i.e. sizeof(struct hsr_sup_tlv))
when offset for skb_pull() is calculated.
This is wrong, as both 'struct hsrv1_ethhdr_sp' and 'hsrv0_ethhdr_sp'
already have 'struct hsr_sup_tag' defined in them, so there is no need
for adding extra two bytes.

This code was working correctly as with no RedBox support, the check for
HSR_TLV_EOT (0x00) was off by two bytes, which were corresponding to
zeroed padded bytes for minimal packet size.

Fixes: eafaa88b3eb7 ("net: hsr: Add support for redbox supervision frames")
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228085644.3618044-1-lukma@denx.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agoipv4: raw: remove useless input parameter in do_raw_set/getsockopt
Zhengchao Shao [Wed, 28 Feb 2024 07:25:05 +0000 (15:25 +0800)]
ipv4: raw: remove useless input parameter in do_raw_set/getsockopt

The input parameter 'level' in do_raw_set/getsockopt() is not used.
Therefore, remove it.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228072505.640550-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agoMerge branch 'net-dsa-mv88e6xxx-add-amethyst-specific-smi-gpio-function'
Paolo Abeni [Thu, 29 Feb 2024 09:16:42 +0000 (10:16 +0100)]
Merge branch 'net-dsa-mv88e6xxx-add-amethyst-specific-smi-gpio-function'

Robert Marko says:

====================
net: dsa: mv88e6xxx: add Amethyst specific SMI GPIO function

Amethyst family (MV88E6191X/6193X/6393X) has a simplified SMI GPIO setting
via the Scratch and Misc register so it requires family specific function.

In the v1 review, Andrew pointed out that it would make sense to rename the
existing mv88e6xxx_g2_scratch_gpio_set_smi as it only works on the MV6390
family.

Changes in v2:
* Add rename of mv88e6xxx_g2_scratch_gpio_set_smi to
mv88e6390_g2_scratch_gpio_set_smi
====================

Link: https://lore.kernel.org/r/20240227175457.2766628-1-robimarko@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: dsa: mv88e6xxx: add Amethyst specific SMI GPIO function
Robert Marko [Tue, 27 Feb 2024 17:54:22 +0000 (18:54 +0100)]
net: dsa: mv88e6xxx: add Amethyst specific SMI GPIO function

The existing mv88e6390_g2_scratch_gpio_set_smi() cannot be used on the
88E6393X as it requires certain P0_MODE, it also checks the CPU mode
as it impacts the bit setting value.

This is all irrelevant for Amethyst (MV88E6191X/6193X/6393X) as only
the default value of the SMI_PHY Config bit is set to CPU_MGD bootstrap
pin value but it can be changed without restrictions so that GPIO pins
9 and 10 are used as SMI pins.

So, introduce Amethyst specific function and call that if the Amethyst
family wants to setup the external PHY.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Robert Marko <robimarko@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agonet: dsa: mv88e6xxx: rename mv88e6xxx_g2_scratch_gpio_set_smi
Robert Marko [Tue, 27 Feb 2024 17:54:21 +0000 (18:54 +0100)]
net: dsa: mv88e6xxx: rename mv88e6xxx_g2_scratch_gpio_set_smi

The name mv88e6xxx_g2_scratch_gpio_set_smi is a bit ambiguous as it appears
to only be applicable to the 6390 family, so lets rename it to
mv88e6390_g2_scratch_gpio_set_smi to make it more obvious.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
13 months agoinet6: expand rcu_read_lock() scope in inet6_dump_addr()
Eric Dumazet [Tue, 27 Feb 2024 22:22:59 +0000 (22:22 +0000)]
inet6: expand rcu_read_lock() scope in inet6_dump_addr()

I missed that inet6_dump_addr() is calling in6_dump_addrs()
from two points.

First one under RTNL protection, and second one under rcu_read_lock().

Since we want to remove RTNL use from inet6_dump_addr() very soon,
no longer assume in6_dump_addrs() is protected by RTNL (even
if this is still the case).

Use rcu_read_lock() earlier to fix this lockdep splat:

WARNING: suspicious RCU usage
6.8.0-rc5-syzkaller-01618-gf8cbf6bde4c8 #0 Not tainted

net/ipv6/addrconf.c:5317 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by syz-executor.2/8834:
  #0: ffff88802f554678 (nlk_cb_mutex-ROUTE){+.+.}-{3:3}, at: __netlink_dump_start+0x119/0x780 net/netlink/af_netlink.c:2338
  #1: ffffffff8f377a88 (rtnl_mutex){+.+.}-{3:3}, at: netlink_dump+0x676/0xda0 net/netlink/af_netlink.c:2265
  #2: ffff88807e5f0580 (&ndev->lock){++--}-{2:2}, at: in6_dump_addrs+0xb8/0x1de0 net/ipv6/addrconf.c:5279

stack backtrace:
CPU: 1 PID: 8834 Comm: syz-executor.2 Not tainted 6.8.0-rc5-syzkaller-01618-gf8cbf6bde4c8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
  lockdep_rcu_suspicious+0x220/0x340 kernel/locking/lockdep.c:6712
  in6_dump_addrs+0x1b47/0x1de0 net/ipv6/addrconf.c:5317
  inet6_dump_addr+0x1597/0x1690 net/ipv6/addrconf.c:5428
  netlink_dump+0x6a6/0xda0 net/netlink/af_netlink.c:2266
  __netlink_dump_start+0x59d/0x780 net/netlink/af_netlink.c:2374
  netlink_dump_start include/linux/netlink.h:340 [inline]
  rtnetlink_rcv_msg+0xcf7/0x10d0 net/core/rtnetlink.c:6555
  netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2547
  netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
  netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
  netlink_sendmsg+0x8e0/0xcb0 net/netlink/af_netlink.c:1902
  sock_sendmsg_nosec net/socket.c:730 [inline]
  __sock_sendmsg+0x221/0x270 net/socket.c:745
  ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
  ___sys_sendmsg net/socket.c:2638 [inline]
  __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667

Fixes: c3718936ec47 ("ipv6: anycast: complete RCU handling of struct ifacaddr6")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227222259.4081489-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: call skb_defer_free_flush() from __napi_busy_loop()
Eric Dumazet [Tue, 27 Feb 2024 21:01:04 +0000 (21:01 +0000)]
net: call skb_defer_free_flush() from __napi_busy_loop()

skb_defer_free_flush() is currently called from net_rx_action()
and napi_threaded_poll().

We should also call it from __napi_busy_loop() otherwise
there is the risk the percpu queue can grow until an IPI
is forced from skb_attempt_defer_free() adding a latency spike.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227210105.3815474-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotcp: remove some holes in struct tcp_sock
Eric Dumazet [Tue, 27 Feb 2024 19:27:21 +0000 (19:27 +0000)]
tcp: remove some holes in struct tcp_sock

By moving some fields around, this patch shrinks
holes size from 56 to 32, saving 24 bytes on 64bit arches.

After the patch pahole gives the following for 'struct tcp_sock':

/* size: 2304, cachelines: 36, members: 162 */
/* sum members: 2234, holes: 6, sum holes: 32 */
/* sum bitfield members: 34 bits, bit holes: 5, sum bit holes: 14 bits */
/* padding: 32 */
/* paddings: 3, sum paddings: 10 */
/* forced alignments: 1, forced holes: 1, sum forced holes: 12 */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227192721.3558982-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoigb: extend PTP timestamp adjustments to i211
Oleksij Rempel [Tue, 27 Feb 2024 18:49:41 +0000 (10:49 -0800)]
igb: extend PTP timestamp adjustments to i211

The i211 requires the same PTP timestamp adjustments as the i210,
according to its datasheet. To ensure consistent timestamping across
different platforms, this change extends the existing adjustments to
include the i211.

The adjustment result are tested and comparable for i210 and i211 based
systems.

Fixes: 3f544d2a4d5c ("igb: adjust PTP timestamps for Tx/Rx latency")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240227184942.362710-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: bridge: Exit if multicast_init_stats fails
Breno Leitao [Tue, 27 Feb 2024 18:23:37 +0000 (10:23 -0800)]
net: bridge: Exit if multicast_init_stats fails

If br_multicast_init_stats() fails, there is no need to set lockdep
classes. Just return from the error path.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20240227182338.2739884-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: bridge: Do not allocate stats in the driver
Breno Leitao [Tue, 27 Feb 2024 18:23:36 +0000 (10:23 -0800)]
net: bridge: Do not allocate stats in the driver

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the bridge driver and leverage the network
core allocation.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20240227182338.2739884-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoselftests: vxlan_mdb: Avoid duplicate test names
Ido Schimmel [Tue, 27 Feb 2024 17:04:18 +0000 (19:04 +0200)]
selftests: vxlan_mdb: Avoid duplicate test names

Rename some test cases to avoid overlapping test names which is
problematic for the kernel test robot. No changes in the test's logic.

Suggested-by: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20240227170418.491442-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agortnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back
Lin Ma [Tue, 27 Feb 2024 12:11:28 +0000 (20:11 +0800)]
rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back

In the commit d73ef2d69c0d ("rtnetlink: let rtnl_bridge_setlink checks
IFLA_BRIDGE_MODE length"), an adjustment was made to the old loop logic
in the function `rtnl_bridge_setlink` to enable the loop to also check
the length of the IFLA_BRIDGE_MODE attribute. However, this adjustment
removed the `break` statement and led to an error logic of the flags
writing back at the end of this function.

if (have_flags)
    memcpy(nla_data(attr), &flags, sizeof(flags));
    // attr should point to IFLA_BRIDGE_FLAGS NLA !!!

Before the mentioned commit, the `attr` is granted to be IFLA_BRIDGE_FLAGS.
However, this is not necessarily true fow now as the updated loop will let
the attr point to the last NLA, even an invalid NLA which could cause
overflow writes.

This patch introduces a new variable `br_flag` to save the NLA pointer
that points to IFLA_BRIDGE_FLAGS and uses it to resolve the mentioned
error logic.

Fixes: d73ef2d69c0d ("rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20240227121128.608110-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonetlabel: remove impossible return value in netlbl_bitmap_walk
Zhengchao Shao [Tue, 27 Feb 2024 09:36:04 +0000 (17:36 +0800)]
netlabel: remove impossible return value in netlbl_bitmap_walk

Since commit 446fda4f2682 ("[NetLabel]: CIPSOv4 engine"), *bitmap_walk
function only returns -1. Nearly 18 years have passed, -2 scenes never
come up, so there's no need to consider it.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20240227093604.3574241-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge branch 'inet-implement-lockless-rtm_getnetconf-ops'
Jakub Kicinski [Thu, 29 Feb 2024 03:36:41 +0000 (19:36 -0800)]
Merge branch 'inet-implement-lockless-rtm_getnetconf-ops'

Eric Dumazet says:

====================
inet: implement lockless RTM_GETNETCONF ops

This series removes RTNL use for RTM_GETNETCONF operations on AF_INET.

- Annotate data-races to avoid possible KCSAN splats.

- "ip -4 netconf show dev XXX" can be implemented without RTNL [1]

- "ip -4 netconf" dumps can be implemented using RCU instead of RTNL [1]

[1] This only refers to RTM_GETNETCONF operation, "ip" command
    also uses RTM_GETLINK dumps which are using RTNL at this moment.
====================

Link: https://lore.kernel.org/r/20240227092411.2315725-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoinet: use xa_array iterator to implement inet_netconf_dump_devconf()
Eric Dumazet [Tue, 27 Feb 2024 09:24:11 +0000 (09:24 +0000)]
inet: use xa_array iterator to implement inet_netconf_dump_devconf()

1) inet_netconf_dump_devconf() can run under RCU protection
   instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
   an extra recvmsg() system call.

3) Do not use inet_base_seq() anymore, for_each_netdev_dump()
   has nice properties. Restarting a GETDEVCONF dump if a device has
   been added/removed or if net->ipv4.dev_addr_genid has changed is moot.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoinet: do not use RTNL in inet_netconf_get_devconf()
Eric Dumazet [Tue, 27 Feb 2024 09:24:10 +0000 (09:24 +0000)]
inet: do not use RTNL in inet_netconf_get_devconf()

"ip -4 netconf show dev XXXX" no longer acquires RTNL.

Return -ENODEV instead of -EINVAL if no netdev or idev can be found.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoinet: annotate devconf data-races
Eric Dumazet [Tue, 27 Feb 2024 09:24:09 +0000 (09:24 +0000)]
inet: annotate devconf data-races

Add READ_ONCE() in ipv4_devconf_get() and corresponding
WRITE_ONCE() in ipv4_devconf_set()

Add IPV4_DEVCONF_RO() and IPV4_DEVCONF_ALL_RO() macros,
and use them when reading devconf fields.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: phy: dp83826: disable WOL at init
Catalin Popescu [Mon, 26 Feb 2024 16:23:39 +0000 (17:23 +0100)]
net: phy: dp83826: disable WOL at init

Commit d1d77120bc28 ("net: phy: dp83826: support TX data voltage tuning")
introduced a regression in that WOL is not disabled by default for DP83826.
WOL should normally be enabled through ethtool.

Fixes: d1d77120bc28 ("net: phy: dp83826: support TX data voltage tuning")
Signed-off-by: Catalin Popescu <catalin.popescu@leica-geosystems.com>
Link: https://lore.kernel.org/r/20240226162339.696461-1-catalin.popescu@leica-geosystems.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agonet: remove SLAB_MEM_SPREAD flag usage
Chengming Zhou [Wed, 28 Feb 2024 03:06:58 +0000 (03:06 +0000)]
net: remove SLAB_MEM_SPREAD flag usage

The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.

[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240228030658.3512782-1-chengming.zhou@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agoMerge branch 'tools-ynl-stop-using-libmnl'
Jakub Kicinski [Wed, 28 Feb 2024 23:25:47 +0000 (15:25 -0800)]
Merge branch 'tools-ynl-stop-using-libmnl'

Jakub Kicinski says:

====================
tools: ynl: stop using libmnl

There is no strong reason to stop using libmnl in ynl but there
are a few small ones which add up.

First (as I remembered immediately after hitting send on v1),
C++ compilers do not like the libmnl for_each macros.
I haven't tried it myself, but having all the code directly
in YNL makes it easier for folks porting to C++ to modify them
and/or make YNL more C++ friendly.

Second, we do much more advanced netlink level parsing in ynl
than libmnl so it's hard to say that libmnl abstracts much from us.
The fact that this series, removing the libmnl dependency, only
adds <300 LoC shows that code savings aren't huge.
OTOH when new types are added (e.g. auto-int) we need to add
compatibility to deal with older version of libmnl (in fact,
even tho patches have been sent months ago, auto-ints are still
not supported in libmnl.git).

Thrid, the dependency makes ynl less self contained, and harder
to vendor in. Whether vendoring libraries into projects is a good
idea is a separate discussion, nonetheless, people want to do it.

Fourth, there are small annoyances with the libmnl APIs which
are hard to fix in backward-compatible ways. See the last patch
for example.

All in all, libmnl is a great library, but with all the code
generation and structured parsing, ynl is better served by going
its own way.

v1: https://lore.kernel.org/all/20240222235614.180876-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20240227223032.1835527-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: use MSG_DONTWAIT for getting notifications
Jakub Kicinski [Tue, 27 Feb 2024 22:30:32 +0000 (14:30 -0800)]
tools: ynl: use MSG_DONTWAIT for getting notifications

To stick to libmnl wrappers in the past we had to use poll()
to check if there are any outstanding notifications on the socket.
This is no longer necessary, we can use MSG_DONTWAIT.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-16-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: remove the libmnl dependency
Jakub Kicinski [Tue, 27 Feb 2024 22:30:31 +0000 (14:30 -0800)]
tools: ynl: remove the libmnl dependency

We don't use libmnl any more.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-15-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: stop using mnl socket helpers
Jakub Kicinski [Tue, 27 Feb 2024 22:30:30 +0000 (14:30 -0800)]
tools: ynl: stop using mnl socket helpers

Most libmnl socket helpers can be replaced by direct calls to
the underlying libc API. We need portid, the netlink manpage
suggests we bind() address of zero.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-14-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: switch away from MNL_CB_*
Jakub Kicinski [Tue, 27 Feb 2024 22:30:29 +0000 (14:30 -0800)]
tools: ynl: switch away from MNL_CB_*

Create a local version of the MNL_CB_* parser control values.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-13-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: switch away from mnl_cb_t
Jakub Kicinski [Tue, 27 Feb 2024 22:30:28 +0000 (14:30 -0800)]
tools: ynl: switch away from mnl_cb_t

All YNL parsing callbacks take struct ynl_parse_arg as the argument.
Make that official by using a local callback type instead of mnl_cb_t.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: stop using mnl_cb_run2()
Jakub Kicinski [Tue, 27 Feb 2024 22:30:27 +0000 (14:30 -0800)]
tools: ynl: stop using mnl_cb_run2()

There's only one set of callbacks in YNL, for netlink control
messages, and most of them are trivial. So implement the message
walking directly without depending on mnl_cb_run2().

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: use ynl_sock_read_msgs() for ACK handling
Jakub Kicinski [Tue, 27 Feb 2024 22:30:26 +0000 (14:30 -0800)]
tools: ynl: use ynl_sock_read_msgs() for ACK handling

ynl_recv_ack() is simple and it's the only user of mnl_cb_run().
Now that ynl_sock_read_msgs() exists it's actually less code
to use ynl_sock_read_msgs() instead of being special.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: wrap recv() + mnl_cb_run2() into a single helper
Jakub Kicinski [Tue, 27 Feb 2024 22:30:25 +0000 (14:30 -0800)]
tools: ynl: wrap recv() + mnl_cb_run2() into a single helper

All callers to mnl_cb_run2() call mnl_socket_recvfrom() right before.
Wrap the two in a helper, take typed arguments (struct ynl_parse_arg),
instead of hoping that all callers remember that parser error handling
requires yarg.

In case of ynl_sock_read_family() we will no longer check for kernel
returning no data, but that would be a kernel bug, not worth complicating
the code to catch this. Calling mnl_cb_run2() on an empty buffer
is legal and results in STOP (1).

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl-gen: remove unused parse code
Jakub Kicinski [Tue, 27 Feb 2024 22:30:24 +0000 (14:30 -0800)]
tools: ynl-gen: remove unused parse code

Commit f2ba1e5e2208 ("tools: ynl-gen: stop generating common notification handlers")
removed the last caller of the parse_cb_run() helper.
We no longer need to export ynl_cb_array.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 months agotools: ynl: make yarg the first member of struct ynl_dump_state
Jakub Kicinski [Tue, 27 Feb 2024 22:30:23 +0000 (14:30 -0800)]
tools: ynl: make yarg the first member of struct ynl_dump_state

All YNL parsing code expects a pointer to struct ynl_parse_arg AKA yarg.
For dump was pass in struct ynl_dump_state, which works fine, because
struct ynl_dump_state and struct ynl_parse_arg have identical layout
for the members that matter.. but it's a bit hacky.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>