Maksym Glubokiy [Tue, 23 Aug 2022 11:39:58 +0000 (14:39 +0300)]
net: prestera: manage matchall and flower priorities
matchall rules can be added only to chain 0 and their priorities have
limitations:
- new matchall ingress rule's priority must be higher (lower value)
than any existing flower rule;
- new matchall egress rule's priority must be lower (higher value)
than any existing flower rule.
The opposite works for flower rule adding:
- new flower ingress rule's priority must be lower (higher value)
than any existing matchall rule;
- new flower egress rule's priority must be higher (lower value)
than any existing matchall rule.
This is a hardware limitation and thus must be properly handled in
driver by reporting errors to the user when newly added rule has such a
priority that cannot be installed into the hardware.
To achieve this, the driver must maintain both min/max matchall
priorities for every flower block when user adds/deletes a matchall
rule, as well as both min/max flower priorities for chain 0 for every
adding/deletion of flower rules for chain 0.
Cc: Serhiy Boiko <serhiy.boiko@plvision.eu>
Signed-off-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Serhiy Boiko [Tue, 23 Aug 2022 11:39:57 +0000 (14:39 +0300)]
net: prestera: add support for egress traffic mirroring
This enables adding matchall rules for egress:
tc filter add .. egress .. matchall skip_sw \
action mirred egress mirror dev ..
Signed-off-by: Serhiy Boiko <serhiy.boiko@plvision.eu>
Signed-off-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Serhiy Boiko [Tue, 23 Aug 2022 11:39:56 +0000 (14:39 +0300)]
net: prestera: acl: extract matchall logic into a separate file
This commit adds more clarity to handling of TC_CLSMATCHALL_REPLACE and
TC_CLSMATCHALL_DESTROY events by calling newly added *_mall_*() handlers
instead of directly calling SPAN API.
This also extracts matchall rules management out of SPAN API since SPAN
is a hardware module which is used to implement 'matchall egress mirred'
action only.
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Serhiy Boiko <serhiy.boiko@plvision.eu>
Signed-off-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Mon, 22 Aug 2022 12:39:43 +0000 (14:39 +0200)]
net: asix: ax88772: add ethtool pause configuration
Add phylink based ethtool pause configuration
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Mon, 22 Aug 2022 12:39:42 +0000 (14:39 +0200)]
net: asix: ax88772: migrate to phylink
There are some exotic ax88772 based devices which may require
functionality provide by the phylink framework. For example:
- US100A20SFP, USB 2.0 auf LWL Converter with SFP Cage
- AX88772B USB to 100Base-TX Ethernet (with RMII) demo board, where it
is possible to switch between internal PHY and external RMII based
connection.
So, convert this driver to phylink as soon as possible.
Tested with:
- AX88772A + internal PHY
- AX88772B + external DP83TD510E T1L PHY
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rob Herring [Thu, 25 Aug 2022 19:26:07 +0000 (14:26 -0500)]
dt-bindings: net: Add missing (unevaluated|additional)Properties on child nodes
In order to ensure only documented properties are present, node schemas
must have unevaluatedProperties or additionalProperties set to false
(typically). Add missing properties/$refs as exposed by this addition.
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220825192609.1538463-1-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 25 Aug 2022 23:07:42 +0000 (16:07 -0700)]
Merge git://git./linux/kernel/git/netdev/net
drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
21234e3a84c7 ("net/mlx5e: Fix use after free in mlx5e_fs_init()")
c7eafc5ed068 ("net/mlx5e: Convert ethtool_steering member of flow_steering struct to pointer")
https://lore.kernel.org/all/
20220825104410.
67d4709c@canb.auug.org.au/
https://lore.kernel.org/all/
20220823055533.334471-1-saeed@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Uros Bizjak [Mon, 22 Aug 2022 14:32:43 +0000 (16:32 +0200)]
netdev: Use try_cmpxchg in napi_if_scheduled_mark_missed
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
napi_if_scheduled_mark_missed. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails, enabling further code simplifications.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220822143243.2798-1-ubizjak@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 25 Aug 2022 21:03:58 +0000 (14:03 -0700)]
Merge tag 'net-6.0-rc3' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from ipsec and netfilter (with one broken Fixes tag).
Current release - new code bugs:
- dsa: don't dereference NULL extack in dsa_slave_changeupper()
- dpaa: fix <1G ethernet on LS1046ARDB
- neigh: don't call kfree_skb() under spin_lock_irqsave()
Previous releases - regressions:
- r8152: fix the RX FIFO settings when suspending
- dsa: microchip: keep compatibility with device tree blobs with no
phy-mode
- Revert "net: macsec: update SCI upon MAC address change."
- Revert "xfrm: update SA curlft.use_time", comply with RFC 2367
Previous releases - always broken:
- netfilter: conntrack: work around exceeded TCP receive window
- ipsec: fix a null pointer dereference of dst->dev on a metadata dst
in xfrm_lookup_with_ifid
- moxa: get rid of asymmetry in DMA mapping/unmapping
- dsa: microchip: make learning configurable and keep it off while
standalone
- ice: xsk: prohibit usage of non-balanced queue id
- rxrpc: fix locking in rxrpc's sendmsg
Misc:
- another chunk of sysctl data race silencing"
* tag 'net-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
net: lantiq_xrx200: restore buffer if memory allocation failed
net: lantiq_xrx200: fix lock under memory pressure
net: lantiq_xrx200: confirm skb is allocated before using
net: stmmac: work around sporadic tx issue on link-up
ionic: VF initial random MAC address if no assigned mac
ionic: fix up issues with handling EAGAIN on FW cmds
ionic: clear broken state on generation change
rxrpc: Fix locking in rxrpc's sendmsg
net: ethernet: mtk_eth_soc: fix hw hash reporting for MTK_NETSYS_V2
MAINTAINERS: rectify file entry in BONDING DRIVER
i40e: Fix incorrect address type for IPv6 flow rules
ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
net: Fix a data-race around sysctl_somaxconn.
net: Fix a data-race around netdev_unregister_timeout_secs.
net: Fix a data-race around gro_normal_batch.
net: Fix data-races around sysctl_devconf_inherit_init_net.
net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.
net: Fix a data-race around netdev_budget_usecs.
net: Fix data-races around sysctl_max_skb_frags.
net: Fix a data-race around netdev_budget.
...
Jakub Kicinski [Thu, 25 Aug 2022 20:24:47 +0000 (13:24 -0700)]
Merge branch 'mlxsw-remove-some-unused-code'
Petr Machata says:
====================
mlxsw: Remove some unused code
This patchset removes code that is not used anymore after the following two
commits removed all users:
- commit
b0d80c013b04 ("mlxsw: Remove Mellanox SwitchX-2 ASIC support")
- commit
9b43fbb8ce24 ("mlxsw: Remove Mellanox SwitchIB ASIC support")
====================
Link: https://lore.kernel.org/r/cover.1661350629.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Wed, 24 Aug 2022 15:18:51 +0000 (17:18 +0200)]
mlxsw: Remove unused mlxsw_core_port_type_get()
Function mlxsw_core_port_type_get() is no longer used. So remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Wed, 24 Aug 2022 15:18:50 +0000 (17:18 +0200)]
mlxsw: Remove unused port_type_set devlink op
port_type_set devlink op is no longer used by any mlxsw driver,
so remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Wed, 24 Aug 2022 15:18:49 +0000 (17:18 +0200)]
mlxsw: Remove unused IB stuff
There are some IB leftovers that are no longer used in the code.
So remove them.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 25 Aug 2022 20:22:58 +0000 (13:22 -0700)]
Merge branch 'net-devlink-sync-flash-and-dev-info-commands'
Jiri Pirko says:
====================
net: devlink: sync flash and dev info commands
Purpose of this patchset is to introduce consistency between two devlink
commands:
devlink dev info
Shows versions of running default flash target and components.
devlink dev flash
Flashes default flash target or component name (if specified
on cmdline).
Currently it is up to the driver what versions to expose and what flash
update component names to accept. This is inconsistent. Thankfully, only
netdevsim currently using components so it is still time
to sanitize this.
This patchset makes sure, that devlink.c calls into driver for
component flash update only in case the driver exposes the same version
name.
Example:
$ devlink dev info
netdevsim/netdevsim10:
driver netdevsim
versions:
running:
fw.mgmt 10.20.30
stored:
fw.mgmt 10.20.30
$ devlink dev flash netdevsim/netdevsim10 file somefile.bin
[fw.mgmt] Preparing to flash
[fw.mgmt] Flashing 100%
[fw.mgmt] Flash select
[fw.mgmt] Flashing done
$ devlink dev flash netdevsim/netdevsim10 file somefile.bin component fw.mgmt
[fw.mgmt] Preparing to flash
[fw.mgmt] Flashing 100%
[fw.mgmt] Flash select
[fw.mgmt] Flashing done
$ devlink dev flash netdevsim/netdevsim10 file somefile.bin component dummy
Error: selected component is not supported by this device.
====================
Link: https://lore.kernel.org/r/20220824122011.1204330-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Wed, 24 Aug 2022 12:20:11 +0000 (14:20 +0200)]
net: devlink: limit flash component name to match version returned by info_get()
Limit the acceptance of component name passed to cmd_flash_update() to
match one of the versions returned by info_get(), marked by version type.
This makes things clearer and enforces 1:1 mapping between exposed
version and accepted flash component.
Check VERSION_TYPE_COMPONENT version type during cmd_flash_update()
execution by calling info_get() with different "req" context.
That causes info_get() to lookup the component name instead of
filling-up the netlink message.
Remove "UPDATE_COMPONENT" flag which becomes used.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Wed, 24 Aug 2022 12:20:10 +0000 (14:20 +0200)]
netdevsim: add version fw.mgmt info info_get() and mark as a component
Fix the only component user which is netdevsim. It uses component named
"fw.mgmt" in selftests. So add this version to info_get() output with
version type component.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Wed, 24 Aug 2022 12:20:09 +0000 (14:20 +0200)]
net: devlink: extend info_get() version put to indicate a flash component
Whenever the driver is called by his info_get() op, it may put multiple
version names and values to the netlink message. Extend by additional
helper devlink_info_version_running/stored_put_ext() that allows to
specify a version type that indicates when particular version name
represents a flash component.
This is going to be used in follow-up patch calling info_get() during
flash update command checking if version with this the version type
exists.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adel Abouchaev [Wed, 24 Aug 2022 18:43:51 +0000 (11:43 -0700)]
selftests/net: fix reinitialization of TEST_PROGS in net self tests.
Assinging will drop all previous tests.
Fixes: b690842d12fd ("selftests/net: test l2 tunnel TOS/TTL inheriting")
Signed-off-by: Adel Abouchaev <adel.abushaev@gmail.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lore.kernel.org/r/20220824184351.3759862-1-adel.abushaev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 25 Aug 2022 19:41:41 +0000 (12:41 -0700)]
Merge branch 'net-lantiq_xrx200-fix-errors-under-memory-pressure'
Aleksander Jan Bajkowski says:
====================
net: lantiq_xrx200: fix errors under memory pressure
This series fixes issues that can occur in the driver under memory pressure.
Situations when the system cannot allocate memory are rare, so the mentioned
bugs have been fixed recently. The patches have been tested on a BT Home
router with the Lantiq xRX200 chipset.
Changelog:
v3: - removed netdev_err() log from the first patch
v2:
- the second patch has been changed, so that under memory pressure situation
the driver will not receive packets indefinitely regardless of the NAPI budget,
- the third patch has been added.
====================
Link: https://lore.kernel.org/r/20220824215408.4695-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aleksander Jan Bajkowski [Wed, 24 Aug 2022 21:54:08 +0000 (23:54 +0200)]
net: lantiq_xrx200: restore buffer if memory allocation failed
In a situation where memory allocation fails, an invalid buffer address
is stored. When this descriptor is used again, the system panics in the
build_skb() function when accessing memory.
Fixes: 7ea6cd16f159 ("lantiq: net: fix duplicated skb in rx descriptor ring")
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aleksander Jan Bajkowski [Wed, 24 Aug 2022 21:54:07 +0000 (23:54 +0200)]
net: lantiq_xrx200: fix lock under memory pressure
When the xrx200_hw_receive() function returns -ENOMEM, the NAPI poll
function immediately returns an error.
This is incorrect for two reasons:
* the function terminates without enabling interrupts or scheduling NAPI,
* the error code (-ENOMEM) is returned instead of the number of received
packets.
After the first memory allocation failure occurs, packet reception is
locked due to disabled interrupts from DMA..
Fixes: fe1a56420cf2 ("net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver")
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aleksander Jan Bajkowski [Wed, 24 Aug 2022 21:54:06 +0000 (23:54 +0200)]
net: lantiq_xrx200: confirm skb is allocated before using
xrx200_hw_receive() assumes build_skb() always works and goes straight
to skb_reserve(). However, build_skb() can fail under memory pressure.
Add a check in case build_skb() failed to allocate and return NULL.
Fixes: e015593573b3 ("net: lantiq_xrx200: convert to build_skb")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Wed, 24 Aug 2022 20:34:49 +0000 (22:34 +0200)]
net: stmmac: work around sporadic tx issue on link-up
This is a follow-up to the discussion in [0]. It seems to me that
at least the IP version used on Amlogic SoC's sometimes has a problem
if register MAC_CTRL_REG is written whilst the chip is still processing
a previous write. But that's just a guess.
Adding a delay between two writes to this register helps, but we can
also simply omit the offending second write. This patch uses the second
approach and is based on a suggestion from Qi Duan.
Benefit of this approach is that we can save few register writes, also
on not affected chip versions.
[0] https://www.spinics.net/lists/netdev/msg831526.html
Fixes: bfab27a146ed ("stmmac: add the experimental PCI support")
Suggested-by: Qi Duan <qi.duan@amlogic.com>
Suggested-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/e99857ce-bd90-5093-ca8c-8cd480b5a0a2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 25 Aug 2022 19:40:29 +0000 (12:40 -0700)]
Merge branch '10GbE' of git://git./linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-08-24 (ixgbe, i40e)
This series contains updates to ixgbe and i40e drivers.
Jake stops incorrect resetting of SYSTIME registers when starting
cyclecounter for ixgbe.
Sylwester corrects a check on source IP address when validating destination
for i40e.
* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
i40e: Fix incorrect address type for IPv6 flow rules
ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
====================
Link: https://lore.kernel.org/r/20220824193748.874343-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 25 Aug 2022 19:40:17 +0000 (12:40 -0700)]
Merge branch 'ionic-bug-fixes'
Shannon Nelson says:
====================
ionic: bug fixes
These are a couple of maintenance bug fixes for the Pensando ionic
networking driver.
Mohamed takes care of a "plays well with others" issue where the
VF spec is a bit vague on VF mac addresses, but certain customers
have come to expect behavior based on other vendor drivers.
Shannon addresses a couple of corner cases seen in internal
stress testing.
====================
Link: https://lore.kernel.org/r/20220824165051.6185-1-snelson@pensando.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
R Mohamed Shah [Wed, 24 Aug 2022 16:50:51 +0000 (09:50 -0700)]
ionic: VF initial random MAC address if no assigned mac
Assign a random mac address to the VF interface station
address if it boots with a zero mac address in order to match
similar behavior seen in other VF drivers. Handle the errors
where the older firmware does not allow the VF to set its own
station address.
Newer firmware will allow the VF to set the station mac address
if it hasn't already been set administratively through the PF.
Setting it will also be allowed if the VF has trust.
Fixes: fbb39807e9ae ("ionic: support sr-iov operations")
Signed-off-by: R Mohamed Shah <mohamed@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shannon Nelson [Wed, 24 Aug 2022 16:50:50 +0000 (09:50 -0700)]
ionic: fix up issues with handling EAGAIN on FW cmds
In looping on FW update tests we occasionally see the
FW_ACTIVATE_STATUS command fail while it is in its EAGAIN loop
waiting for the FW activate step to finsh inside the FW. The
firmware is complaining that the done bit is set when a new
dev_cmd is going to be processed.
Doing a clean on the cmd registers and doorbell before exiting
the wait-for-done and cleaning the done bit before the sleep
prevents this from occurring.
Fixes: fbfb8031533c ("ionic: Add hardware init and device commands")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shannon Nelson [Wed, 24 Aug 2022 16:50:49 +0000 (09:50 -0700)]
ionic: clear broken state on generation change
There is a case found in heavy testing where a link flap happens just
before a firmware Recovery event and the driver gets stuck in the
BROKEN state. This comes from the driver getting interrupted by a FW
generation change when coming back up from the link flap, and the call
to ionic_start_queues() in ionic_link_status_check() fails. This can be
addressed by having the fw_up code clear the BROKEN bit if seen, rather
than waiting for a user to manually force the interface down and then
back up.
Fixes: 9e8eaf8427b6 ("ionic: stop watchdog when in broken state")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 24 Aug 2022 16:35:45 +0000 (17:35 +0100)]
rxrpc: Fix locking in rxrpc's sendmsg
Fix three bugs in the rxrpc's sendmsg implementation:
(1) rxrpc_new_client_call() should release the socket lock when returning
an error from rxrpc_get_call_slot().
(2) rxrpc_wait_for_tx_window_intr() will return without the call mutex
held in the event that we're interrupted by a signal whilst waiting
for tx space on the socket or relocking the call mutex afterwards.
Fix this by: (a) moving the unlock/lock of the call mutex up to
rxrpc_send_data() such that the lock is not held around all of
rxrpc_wait_for_tx_window*() and (b) indicating to higher callers
whether we're return with the lock dropped. Note that this means
recvmsg() will not block on this call whilst we're waiting.
(3) After dropping and regaining the call mutex, rxrpc_send_data() needs
to go and recheck the state of the tx_pending buffer and the
tx_total_len check in case we raced with another sendmsg() on the same
call.
Thinking on this some more, it might make sense to have different locks for
sendmsg() and recvmsg(). There's probably no need to make recvmsg() wait
for sendmsg(). It does mean that recvmsg() can return MSG_EOR indicating
that a call is dead before a sendmsg() to that call returns - but that can
currently happen anyway.
Without fix (2), something like the following can be induced:
WARNING: bad unlock balance detected!
5.16.0-rc6-syzkaller #0 Not tainted
-------------------------------------
syz-executor011/3597 is trying to release lock (&call->user_mutex) at:
[<
ffffffff885163a3>] rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748
but there are no more locks to release!
other info that might help us debug this:
no locks held by syz-executor011/3597.
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_unlock_imbalance_bug include/trace/events/lock.h:58 [inline]
__lock_release kernel/locking/lockdep.c:5306 [inline]
lock_release.cold+0x49/0x4e kernel/locking/lockdep.c:5657
__mutex_unlock_slowpath+0x99/0x5e0 kernel/locking/mutex.c:900
rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748
rxrpc_sendmsg+0x420/0x630 net/rxrpc/af_rxrpc.c:561
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
____sys_sendmsg+0x6e8/0x810 net/socket.c:2409
___sys_sendmsg+0xf3/0x170 net/socket.c:2463
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2492
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
[Thanks to Hawkins Jiawei and Khalid Masum for their attempts to fix this]
Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Reported-by: syzbot+7f0483225d0c94cb3441@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: syzbot+7f0483225d0c94cb3441@syzkaller.appspotmail.com
cc: Hawkins Jiawei <yin31149@gmail.com>
cc: Khalid Masum <khalid.masum.92@gmail.com>
cc: Dan Carpenter <dan.carpenter@oracle.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/166135894583.600315.7170979436768124075.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 25 Aug 2022 17:52:16 +0000 (10:52 -0700)]
Merge tag 'cgroup-for-6.0-rc2-fixes-2' of git://git./linux/kernel/git/tj/cgroup
Pull another cgroup fix from Tejun Heo:
"Commit
4f7e7236435c ("cgroup: Fix threadgroup_rwsem <->
cpus_read_lock() deadlock") required the cgroup
core to grab cpus_read_lock() before invoking ->attach().
Unfortunately, it missed adding cpus_read_lock() in
cgroup_attach_task_all(). Fix it"
* tag 'cgroup-for-6.0-rc2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
Tetsuo Handa [Thu, 25 Aug 2022 08:38:38 +0000 (17:38 +0900)]
cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at
cpuset_attach() [1], for commit
4f7e7236435ca0ab ("cgroup: Fix
threadgroup_rwsem <-> cpus_read_lock() deadlock") missed that
cpuset_attach() is also called from cgroup_attach_task_all().
Add cpus_read_lock() like what cgroup_procs_write_start() does.
Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e
Reported-by: syzbot <syzbot+29d3a3b4d86c8136ad9e@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock")
Signed-off-by: Tejun Heo <tj@kernel.org>
Zhengchao Shao [Wed, 24 Aug 2022 00:52:31 +0000 (08:52 +0800)]
net: sched: delete duplicate cleanup of backlog and qlen
qdisc_reset() is clearing qdisc->q.qlen and qdisc->qstats.backlog
_after_ calling qdisc->ops->reset. There is no need to clear them
again in the specific reset function.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220824005231.345727-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Lorenzo Bianconi [Tue, 23 Aug 2022 12:24:07 +0000 (14:24 +0200)]
net: ethernet: mtk_eth_soc: fix hw hash reporting for MTK_NETSYS_V2
Properly report hw rx hash for mt7986 chipset accroding to the new dma
descriptor layout.
Fixes: 197c9e9b17b11 ("net: ethernet: mtk_eth_soc: introduce support for mt7986 chipset")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/091394ea4e705fbb35f828011d98d0ba33808f69.1661257293.git.lorenzo@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Wenjuan Geng [Tue, 23 Aug 2022 09:01:22 +0000 (11:01 +0200)]
nfp: flower: support case of match on ct_state(0/0x3f)
is_post_ct_flow() function will process only ct_state ESTABLISHED,
then offload_pre_check() function will check FLOW_DISSECTOR_KEY_CT flag.
When config tc filter match ct_state(0/0x3f), dissector->used_keys
with FLOW_DISSECTOR_KEY_CT bit, function offload_pre_check() will
return false, so not offload. This is a special case that can be handled
safely.
Therefore, modify to let initial packet which won't go through conntrack
can be offloaded, as long as the cared ct fields are all zero.
Signed-off-by: Wenjuan Geng <wenjuan.geng@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220823090122.403631-1-simon.horman@corigine.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Richard Gobert [Tue, 23 Aug 2022 07:10:49 +0000 (09:10 +0200)]
net: gro: skb_gro_header helper function
Introduce a simple helper function to replace a common pattern.
When accessing the GRO header, we fetch the pointer from frag0,
then test its validity and fetch it from the skb when necessary.
This leads to the pattern
skb_gro_header_fast -> skb_gro_header_hard -> skb_gro_header_slow
recurring many times throughout GRO code.
This patch replaces these patterns with a single inlined function
call, improving code readability.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220823071034.GA56142@debian
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jiri Pirko [Tue, 23 Aug 2022 07:02:13 +0000 (09:02 +0200)]
Documentation: devlink: fix the locking section
As all callbacks are converted now, fix the text reflecting that change.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20220823070213.1008956-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 25 Aug 2022 02:30:19 +0000 (19:30 -0700)]
Merge branch 'add-a-second-bind-table-hashed-by-port-and-address'
Joanne Koong says:
====================
Add a second bind table hashed by port and address
Currently, there is one bind hashtable (bhash) that hashes by port only.
This patchset adds a second bind table (bhash2) that hashes by port and
address.
The motivation for adding bhash2 is to expedite bind requests in situations
where the port has many sockets in its bhash table entry (eg a large number
of sockets bound to different addresses on the same port), which makes checking
bind conflicts costly especially given that we acquire the table entry spinlock
while doing so, which can cause softirq cpu lockups and can prevent new tcp
connections.
We ran into this problem at Meta where the traffic team binds a large number
of IPs to port 443 and the bind() call took a significant amount of time
which led to cpu softirq lockups, which caused packet drops and other failures
on the machine.
When experimentally testing this on a local server for ~24k sockets bound to
the port, the results seen were:
ipv4:
before - 0.002317 seconds
with bhash2 - 0.000020 seconds
ipv6:
before - 0.002431 seconds
with bhash2 - 0.000021 seconds
The additions to the initial bhash2 submission [0] are:
* Updating bhash2 in the cases where a socket's rcv saddr changes after it has
* been bound
* Adding locks for bhash2 hashbuckets
[0] https://lore.kernel.org/netdev/
20220520001834.
2247810-1-kuba@kernel.org/
v3: https://lore.kernel.org/netdev/
20220722195406.
1304948-2-joannelkoong@gmail.com/
v2: https://lore.kernel.org/netdev/
20220712235310.
1935121-1-joannelkoong@gmail.com/
v1: https://lore.kernel.org/netdev/
20220623234242.
2083895-2-joannelkoong@gmail.com/
====================
Link: https://lore.kernel.org/r/20220822181023.3979645-1-joannelkoong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Joanne Koong [Mon, 22 Aug 2022 18:10:23 +0000 (11:10 -0700)]
selftests/net: Add sk_bind_sendto_listen and sk_connect_zero_addr
This patch adds 2 new tests: sk_bind_sendto_listen and
sk_connect_zero_addr.
The sk_bind_sendto_listen test exercises the path where a socket's
rcv saddr changes after it has been added to the binding tables,
and then a listen() on the socket is invoked. The listen() should
succeed.
The sk_bind_sendto_listen test is copied over from one of syzbot's
tests: https://syzkaller.appspot.com/x/repro.c?x=
1673a38df00000
The sk_connect_zero_addr test exercises the path where the socket was
never previously added to the binding tables and it gets assigned a
saddr upon a connect() to address 0.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Joanne Koong [Mon, 22 Aug 2022 18:10:22 +0000 (11:10 -0700)]
selftests/net: Add test for timing a bind request to a port with a populated bhash entry
This test populates the bhash table for a given port with
MAX_THREADS * MAX_CONNECTIONS sockets, and then times how long
a bind request on the port takes.
When populating the bhash table, we create the sockets and then bind
the sockets to the same address and port (SO_REUSEADDR and SO_REUSEPORT
are set). When timing how long a bind on the port takes, we bind on a
different address without SO_REUSEPORT set. We do not set SO_REUSEPORT
because we are interested in the case where the bind request does not
go through the tb->fastreuseport path, which is fragile (eg
tb->fastreuseport path does not work if binding with a different uid).
To run the script:
Usage: ./bind_bhash.sh [-6 | -4] [-p port] [-a address]
6: use ipv6
4: use ipv4
port: Port number
address: ip address
Without any arguments, ./bind_bhash.sh defaults to ipv6 using ip address
"2001:0db8:0:f101::1" on port 443.
On my local machine, I see:
ipv4:
before - 0.002317 seconds
with bhash2 - 0.000020 seconds
ipv6:
before - 0.002431 seconds
with bhash2 - 0.000021 seconds
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Joanne Koong [Mon, 22 Aug 2022 18:10:21 +0000 (11:10 -0700)]
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 25 Aug 2022 02:18:09 +0000 (19:18 -0700)]
Merge git://git./linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix crash with malformed ebtables blob which do not provide all
entry points, from Florian Westphal.
2) Fix possible TCP connection clogging up with default 5-days
timeout in conntrack, from Florian.
3) Fix crash in nf_tables tproxy with unsupported chains, also from Florian.
4) Do not allow to update implicit chains.
5) Make table handle allocation per-netns to fix data race.
6) Do not truncated payload length and offset, and checksum offset.
Instead report EINVAl.
7) Enable chain stats update via static key iff no error occurs.
8) Restrict osf expression to ip, ip6 and inet families.
9) Restrict tunnel expression to netdev family.
10) Fix crash when trying to bind again an already bound chain.
11) Flowtable garbage collector might leave behind pending work to
delete entries. This patch comes with a previous preparation patch
as dependency.
12) Allow net.netfilter.nf_conntrack_frag6_high_thresh to be lowered,
from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_defrag_ipv6: allow nf_conntrack_frag6_high_thresh increases
netfilter: flowtable: fix stuck flows on cleanup due to pending work
netfilter: flowtable: add function to invoke garbage collection immediately
netfilter: nf_tables: disallow binding to already bound chain
netfilter: nft_tunnel: restrict it to netdev family
netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet families
netfilter: nf_tables: do not leave chain stats enabled on error
netfilter: nft_payload: do not truncate csum_offset and csum_type
netfilter: nft_payload: report ERANGE for too long offset and length
netfilter: nf_tables: make table handle allocation per-netns friendly
netfilter: nf_tables: disallow updates of implicit chain
netfilter: nft_tproxy: restrict to prerouting hook
netfilter: conntrack: work around exceeded receive window
netfilter: ebtables: reject blobs that don't provide all entry points
====================
Link: https://lore.kernel.org/r/20220824220330.64283-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lukas Bulwahn [Wed, 24 Aug 2022 07:29:45 +0000 (09:29 +0200)]
MAINTAINERS: rectify file entry in BONDING DRIVER
Commit
c078290a2b76 ("selftests: include bonding tests into the kselftest
infra") adds the bonding tests in the directory:
tools/testing/selftests/drivers/net/bonding/
The file entry in MAINTAINERS for the BONDING DRIVER however refers to:
tools/testing/selftests/net/bonding/
Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a
broken file pattern.
Repair this file entry in BONDING DRIVER.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Link: https://lore.kernel.org/r/20220824072945.28606-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhengchao Shao [Wed, 24 Aug 2022 01:36:21 +0000 (09:36 +0800)]
netlink: fix some kernel-doc comments
Modify the comment of input parameter of nlmsg_ and nla_ function.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220824013621.365103-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Randy Dunlap [Wed, 24 Aug 2022 02:42:16 +0000 (19:42 -0700)]
net: ethernet: ti: davinci_mdio: fix build for mdio bitbang uses
davinci_mdio.c uses mdio bitbang APIs, so it should select
MDIO_BITBANG to prevent build errors.
arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdio_remove':
drivers/net/ethernet/ti/davinci_mdio.c:649: undefined reference to `free_mdio_bitbang'
arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdio_probe':
drivers/net/ethernet/ti/davinci_mdio.c:545: undefined reference to `alloc_mdio_bitbang'
arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdiobb_read':
drivers/net/ethernet/ti/davinci_mdio.c:236: undefined reference to `mdiobb_read'
arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdiobb_write':
drivers/net/ethernet/ti/davinci_mdio.c:253: undefined reference to `mdiobb_write'
Fixes: d04807b80691 ("net: ethernet: ti: davinci_mdio: Add workaround for errata i2329")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Ravi Gunasekaran <r-gunasekaran@ti.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: Sudip Mukherjee (Codethink) <sudipm.mukherjee@gmail.com>
Link: https://lore.kernel.org/r/20220824024216.4939-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bagas Sanjaya [Wed, 24 Aug 2022 03:58:04 +0000 (10:58 +0700)]
Documentation: sysctl: align cells in second content column
Stephen Rothwell reported htmldocs warning when merging net-next tree:
Documentation/admin-guide/sysctl/net.rst:37: WARNING: Malformed table.
Text in column margin in table line 4.
========= =================== = ========== ==================
Directory Content Directory Content
========= =================== = ========== ==================
802 E802 protocol mptcp Multipath TCP
appletalk Appletalk protocol netfilter Network Filter
ax25 AX25 netrom NET/ROM
bridge Bridging rose X.25 PLP layer
core General parameter tipc TIPC
ethernet Ethernet protocol unix Unix domain sockets
ipv4 IP version 4 x25 X.25 protocol
ipv6 IP version 6
========= =================== = ========== ==================
The warning above is caused by cells in second "Content" column of
/proc/sys/net subdirectory table which are in column margin.
Align these cells against the column header to fix the warning.
Link: https://lore.kernel.org/linux-next/20220823134905.57ed08d5@canb.auug.org.au/
Fixes: 1202cdd665315c ("Remove DECnet support from kernel")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20220824035804.204322-1-bagasdotme@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sylwester Dziedziuch [Fri, 19 Aug 2022 10:45:52 +0000 (12:45 +0200)]
i40e: Fix incorrect address type for IPv6 flow rules
It was not possible to create 1-tuple flow director
rule for IPv6 flow type. It was caused by incorrectly
checking for source IP address when validating user provided
destination IP address.
Fix this by changing ip6src to correct ip6dst address
in destination IP address validation for IPv6 flow type.
Fixes: efca91e89b67 ("i40e: Add flow director support for IPv6")
Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Tue, 2 Aug 2022 00:24:19 +0000 (17:24 -0700)]
ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
The ixgbe_ptp_start_cyclecounter is intended to be called whenever the
cyclecounter parameters need to be changed.
Since commit
a9763f3cb54c ("ixgbe: Update PTP to support X550EM_x
devices"), this function has cleared the SYSTIME registers and reset the
TSAUXC DISABLE_SYSTIME bit.
While these need to be cleared during ixgbe_ptp_reset, it is wrong to clear
them during ixgbe_ptp_start_cyclecounter. This function may be called
during both reset and link status change. When link changes, the SYSTIME
counter is still operating normally, but the cyclecounter should be updated
to account for the possibly changed parameters.
Clearing SYSTIME when link changes causes the timecounter to jump because
the cycle counter now reads zero.
Extract the SYSTIME initialization out to a new function and call this
during ixgbe_ptp_reset. This prevents the timecounter adjustment and avoids
an unnecessary reset of the current time.
This also restores the original SYSTIME clearing that occurred during
ixgbe_ptp_reset before the commit above.
Reported-by: Steve Payne <spayne@aurora.tech>
Reported-by: Ilya Evenbach <ievenbach@aurora.tech>
Fixes: a9763f3cb54c ("ixgbe: Update PTP to support X550EM_x devices")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Linus Torvalds [Wed, 24 Aug 2022 17:43:34 +0000 (10:43 -0700)]
Merge tag 'trace-v6.0-rc2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix build warning for when MODULES and FTRACE_WITH_DIRECT_CALLS are
not set. A warning happens with ops_references_rec() defined but not
used.
* tag 'trace-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix build warning for ops_references_rec() not used
Linus Torvalds [Wed, 24 Aug 2022 17:19:20 +0000 (10:19 -0700)]
Merge branch 'dmi-for-linus' of git://git./linux/kernel/git/jdelvare/staging
Pull dmi update from Jean Delvare.
Tiny cleanup.
* 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
firmware: dmi: Use the proper accessor for the version field
David S. Miller [Wed, 24 Aug 2022 12:46:59 +0000 (13:46 +0100)]
Merge branch 'sysctl-data-races'
Kuniyuki Iwashima says:
====================
net: sysctl: Fix data-races around net.core.XXX
This series fixes data-races around all knobs in net_core_table and
netns_core_table except for bpf stuff.
These knobs are skipped:
- 4 bpf knobs
- netdev_rss_key: Written only once by net_get_random_once() and
read-only knob
- rps_sock_flow_entries: Protected with sock_flow_mutex
- flow_limit_cpu_bitmap: Protected with flow_limit_update_mutex
- flow_limit_table_len: Protected with flow_limit_update_mutex
- default_qdisc: Protected with qdisc_mod_lock
- warnings: Unused
- high_order_alloc_disable: Protected with static_key_mutex
- skb_defer_max: Already using READ_ONCE()
- sysctl_txrehash: Already using READ_ONCE()
Note 5th patch fixes net.core.message_cost and net.core.message_burst,
and lib/ratelimit.c does not have an explicit maintainer.
Changes:
v3:
* Fix build failures of CONFIG_SYSCTL=n case in 13th & 14th patches
v2: https://lore.kernel.org/netdev/
20220818035227.81567-1-kuniyu@amazon.com/
* Remove 4 bpf knobs and added 6 knobs
v1: https://lore.kernel.org/netdev/
20220816052347.70042-1-kuniyu@amazon.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:47:00 +0000 (10:47 -0700)]
net: Fix a data-race around sysctl_somaxconn.
While reading sysctl_somaxconn, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:59 +0000 (10:46 -0700)]
net: Fix a data-race around netdev_unregister_timeout_secs.
While reading netdev_unregister_timeout_secs, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 5aa3afe107d9 ("net: make unregister netdev warning timeout configurable")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:58 +0000 (10:46 -0700)]
net: Fix a data-race around gro_normal_batch.
While reading gro_normal_batch, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:57 +0000 (10:46 -0700)]
net: Fix data-races around sysctl_devconf_inherit_init_net.
While reading sysctl_devconf_inherit_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 856c395cfa63 ("net: introduce a knob to control whether to inherit devconf config")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:56 +0000 (10:46 -0700)]
net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.
While reading sysctl_fb_tunnels_only_for_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:55 +0000 (10:46 -0700)]
net: Fix a data-race around netdev_budget_usecs.
While reading netdev_budget_usecs, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 7acf8a1e8a28 ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:54 +0000 (10:46 -0700)]
net: Fix data-races around sysctl_max_skb_frags.
While reading sysctl_max_skb_frags, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:53 +0000 (10:46 -0700)]
net: Fix a data-race around netdev_budget.
While reading netdev_budget, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 51b0bdedb8e7 ("[NET]: Separate two usages of netdev_max_backlog.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:52 +0000 (10:46 -0700)]
net: Fix a data-race around sysctl_net_busy_read.
While reading sysctl_net_busy_read, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 2d48d67fa8cd ("net: poll/select low latency socket support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:51 +0000 (10:46 -0700)]
net: Fix a data-race around sysctl_net_busy_poll.
While reading sysctl_net_busy_poll, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 060212928670 ("net: add low latency socket poll")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:50 +0000 (10:46 -0700)]
net: Fix a data-race around sysctl_tstamp_allow_data.
While reading sysctl_tstamp_allow_data, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: b245be1f4db1 ("net-timestamp: no-payload only sysctl")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:49 +0000 (10:46 -0700)]
net: Fix data-races around sysctl_optmem_max.
While reading sysctl_optmem_max, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:48 +0000 (10:46 -0700)]
ratelimit: Fix data-races in ___ratelimit().
While reading rs->interval and rs->burst, they can be changed
concurrently via sysctl (e.g. net_ratelimit_state). Thus, we
need to add READ_ONCE() to their readers.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:47 +0000 (10:46 -0700)]
net: Fix data-races around netdev_tstamp_prequeue.
While reading netdev_tstamp_prequeue, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 3b098e2d7c69 ("net: Consistent skb timestamping")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:46 +0000 (10:46 -0700)]
net: Fix data-races around netdev_max_backlog.
While reading netdev_max_backlog, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
While at it, we remove the unnecessary spaces in the doc.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:45 +0000 (10:46 -0700)]
net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
While reading weight_p, it can be changed concurrently. Thus, we need
to add READ_ONCE() to its reader.
Also, dev_[rt]x_weight can be read/written at the same time. So, we
need to use READ_ONCE() and WRITE_ONCE() for its access. Moreover, to
use the same weight_p while changing dev_[rt]x_weight, we add a mutex
in proc_do_dev_weight().
Fixes: 3d48b53fb2ae ("net: dev_weight: TX/RX orthogonality")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 23 Aug 2022 17:46:44 +0000 (10:46 -0700)]
net: Fix data-races around sysctl_[rw]mem_(max|default).
While reading sysctl_[rw]mem_(max|default), they can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Aug 2022 12:24:10 +0000 (13:24 +0100)]
Merge branch 'r8169-next'
Heiner Kallweit says:
====================
r8169: remove support for few unused chip versions
There's a number of chip versions that apparently never made it to the
mass market. Detection of these chip versions has been disabled for
few kernel versions now and nobody complained. Therefore remove
support for these chip versions.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 23 Aug 2022 18:38:06 +0000 (20:38 +0200)]
r8169: remove support for chip version 60
Detection of this chip version has been disabled for few kernel versions now.
Nobody complained, so remove support for this chip version.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 23 Aug 2022 18:37:30 +0000 (20:37 +0200)]
r8169: remove support for chip version 50
Detection of this chip version has been disabled for few kernel versions now.
Nobody complained, so remove support for this chip version.
v3:
- rebase patch
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 23 Aug 2022 18:36:07 +0000 (20:36 +0200)]
r8169: remove support for chip version 49
Detection of this chip version has been disabled for few kernel versions now.
Nobody complained, so remove support for this chip version.
v2:
- fix a typo: RTL_GIGA_MAC_VER_40 -> RTL_GIGA_MAC_VER_50
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 23 Aug 2022 18:35:25 +0000 (20:35 +0200)]
r8169: remove support for chip versions 45 and 47
Detection of these chip versions has been disabled for few kernel versions now.
Nobody complained, so remove support for this chip version.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 23 Aug 2022 18:34:52 +0000 (20:34 +0200)]
r8169: remove support for chip version 41
Detection of this chip version has been disabled for few kernel versions now.
Nobody complained, so remove support for this chip version.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Aug 2022 12:19:39 +0000 (13:19 +0100)]
Merge tag 'mlx5-updates-2022-08-22' of git://git./linux/kernel/git/saeed/linux
mlx5-updates-2022-08-22
Roi Dayan Says:
===============
Add support for SF tunnel offload
Mlx5 driver only supports VF tunnel offload.
To add support for SF tunnel offload the driver needs to:
1. Add send-to-vport metadata matching rules like done for VFs.
2. Set an indirect table for SF vport, same as VF vport.
info smaller sub functions for better maintainability.
rules from esw init phase to representor load phase.
SFs could be created after esw initialized and thus the send-to-vport
meta rules would not be created for those SFs.
By moving the creation of the rules to representor load phase
we ensure creating the rules also for SFs created later.
===============
Lama Kayal Says:
================
Make flow steering API loosely coupled from mlx5e_priv, in a manner to
introduce more readable and maintainable modules.
Make TC's private, let mlx5e_flow_steering struct be dynamically allocated,
and introduce its API to maintain the code via setters and getters
instead of publicly exposing it.
Introduce flow steering debug macros to provide an elegant finish to the
decoupled flow steering API, where errors related to flow steering shall
be reported via them.
All flow steering related files will drop any coupling to mlx5e_priv,
instead they will get the relevant members as input. Among these,
fs_tt_redirect, fs_tc, and arfs.
================
lily [Tue, 23 Aug 2022 05:44:11 +0000 (22:44 -0700)]
net/core/skbuff: Check the return value of skb_copy_bits()
skb_copy_bits() could fail, which requires a check on the return
value.
Signed-off-by: Li Zhong <floridsleeves@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jerry Ray [Mon, 22 Aug 2022 21:39:32 +0000 (16:39 -0500)]
micrel: ksz8851: fixes struct pointer issue
Issue found during code review. This bug has no impact as long as the
ks8851_net structure is the first element of the ks8851_net_spi structure.
As long as the offset to the ks8851_net struct is zero, the container_of()
macro is subtracting 0 and therefore no damage done. But if the
ks8851_net_spi struct is ever modified such that the ks8851_net struct
within it is no longer the first element of the struct, then the bug would
manifest itself and cause problems.
struct ks8851_net is contained within ks8851_net_spi.
ks is contained within kss.
kss is the priv_data of the netdev structure.
Signed-off-by: Jerry Ray <jerry.ray@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 22 Aug 2022 21:15:28 +0000 (21:15 +0000)]
tcp: annotate data-race around tcp_md5sig_pool_populated
tcp_md5sig_pool_populated can be read while another thread
changes its value.
The race has no consequence because allocations
are protected with tcp_md5sig_mutex.
This patch adds READ_ONCE() and WRITE_ONCE() to document
the race and silence KCSAN.
Reported-by: Abhishek Shah <abhishek.shah@columbia.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksandr Mazur [Mon, 22 Aug 2022 18:03:15 +0000 (21:03 +0300)]
net: marvell: prestera: implement br_port_locked flag offloading
Both <port> br_port_locked and <lag> interfaces's flag
offloading is supported. No new ABI is being added,
rather existing (port_param_set) API call gets extended.
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
V2:
add missing receipents (linux-kernel, netdev)
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Aug 2022 11:51:50 +0000 (12:51 +0100)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2022-08-24
1) Fix a refcount leak in __xfrm_policy_check.
From Xin Xiong.
2) Revert "xfrm: update SA curlft.use_time". This
violates RFC 2367. From Antony Antony.
3) Fix a comment on XFRMA_LASTUSED.
From Antony Antony.
4) x->lastused is not cloned in xfrm_do_migrate.
Fix from Antony Antony.
5) Serialize the calls to xfrm_probe_algs.
From Herbert Xu.
6) Fix a null pointer dereference of dst->dev on a metadata
dst in xfrm_lookup_with_ifid. From Nikolay Aleksandrov.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Csókás Bence [Mon, 22 Aug 2022 08:10:52 +0000 (10:10 +0200)]
fec: Restart PPS after link state change
On link state change, the controller gets reset,
causing PPS to drop out and the PHC to lose its
time and calibration. So we restart it if needed,
restoring calibration and time registers.
Changes since v2:
* Add `fec_ptp_save_state()`/`fec_ptp_restore_state()`
* Use `ktime_get_real_ns()`
* Use `BIT()` macro
Changes since v1:
* More ECR #define's
* Stop PPS in `fec_ptp_stop()`
Signed-off-by: Csókás Bence <csokas.bence@prolan.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Aug 2022 08:52:04 +0000 (09:52 +0100)]
Merge branch 'j7200-support'
Siddharth Vadapalli says:
====================
J7200: CPSW5G: Add support for QSGMII mode to am65-cpsw driver
Add support for QSGMII mode to am65-cpsw driver.
Change log:
v4-> v5:
1. Move ti,j7200-cpswxg-nuss compatible to the line above the
ti,j721e-cpsw-nuss compatible.
2. Add allOf and move if-then statements within it to allow future if-then
statements to be added easily.
v3 -> v4:
1. Update bindings to disallow ports based on compatible, instead of
adding a new if/then statement for the new compatible.
2. Add Else-If condition for RMII mode in the set of supported interfaces.
Support for RMII mode is already present in the driver and I had
missed out adding a condition for RMII mode in the previous patches.
v2 -> v3:
1. In ti,k3-am654-cpsw-nuss.yaml, restrict if/then statement to port
nodes.
v1 -> v2:
1. Add new compatible for CPSW5G in ti,k3-am654-cpsw-nuss.yaml and extend
properties for new compatible.
2. Add extra_modes member to struct am65_cpsw_pdata to be used for QSGMII
mode by new compatible.
3. Add check for phylink supported modes to ensure that only one phy mode
is advertised as supported.
4. Check if extra_modes supports QSGMII mode in am65_cpsw_nuss_mac_config()
for register write.
5. Add check for assigning port->sgmii_base only when extra_modes is valid.
v4: https://lore.kernel.org/r/
20220816060139.111934-1-s-vadapalli@ti.com/
v3: https://lore.kernel.org/r/
20220606110443.30362-1-s-vadapalli@ti.com/
v2: https://lore.kernel.org/r/
20220602114558.6204-1-s-vadapalli@ti.com/
v1: https://lore.kernel.org/r/
20220531113058.23708-1-s-vadapalli@ti.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Siddharth Vadapalli [Mon, 22 Aug 2022 07:01:25 +0000 (12:31 +0530)]
net: ethernet: ti: am65-cpsw: Move phy_set_mode_ext() to correct location
In TI's J7200 SoC CPSW5G ports, each of the 4 ports can be configured
as a QSGMII main or QSGMII-SUB port. This configuration is performed
by phy-gmii-sel driver on invoking the phy_set_mode_ext() function.
It is necessary for the QSGMII main port to be configured before any of
the QSGMII-SUB interfaces are brought up. Currently, the QSGMII-SUB
interfaces come up before the QSGMII main port is configured.
Fix this by moving the call to phy_set_mode_ext() from
am65_cpsw_nuss_ndo_slave_open() to am65_cpsw_nuss_init_slave_ports(),
thereby ensuring that the QSGMII main port is configured before any of
the QSGMII-SUB ports are brought up.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Siddharth Vadapalli [Mon, 22 Aug 2022 07:01:24 +0000 (12:31 +0530)]
net: ethernet: ti: am65-cpsw: Add support for J7200 CPSW5G
CPSW5G in J7200 supports additional modes like QSGMII and SGMII.
Add new compatible for J7200 and enable QSGMII mode in am65-cpsw driver.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Siddharth Vadapalli [Mon, 22 Aug 2022 07:01:23 +0000 (12:31 +0530)]
dt-bindings: net: ti: k3-am654-cpsw-nuss: Update bindings for J7200 CPSW5G
Update bindings for TI K3 J7200 SoC which contains 5 ports (4 external
ports) CPSW5G module and add compatible for it.
Changes made:
- Add new compatible ti,j7200-cpswxg-nuss for CPSW5G.
- Extend pattern properties for new compatible.
- Change maximum number of CPSW ports to 4 for new compatible.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Yingliang [Mon, 22 Aug 2022 02:53:46 +0000 (10:53 +0800)]
net: neigh: don't call kfree_skb() under spin_lock_irqsave()
It is not allowed to call kfree_skb() from hardware interrupt
context or with interrupts being disabled. So add all skb to
a tmp list, then free them after spin_unlock_irqrestore() at
once.
Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop")
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Menglong Dong [Sun, 21 Aug 2022 05:18:58 +0000 (13:18 +0800)]
net: skb: prevent the split of kfree_skb_reason() by gcc
Sometimes, gcc will optimize the function by spliting it to two or
more functions. In this case, kfree_skb_reason() is splited to
kfree_skb_reason and kfree_skb_reason.part.0. However, the
function/tracepoint trace_kfree_skb() in it needs the return address
of kfree_skb_reason().
This split makes the call chains becomes:
kfree_skb_reason() -> kfree_skb_reason.part.0 -> trace_kfree_skb()
which makes the return address that passed to trace_kfree_skb() be
kfree_skb().
Therefore, introduce '__fix_address', which is the combination of
'__noclone' and 'noinline', and apply it to kfree_skb_reason() to
prevent to from being splited or made inline.
(Is it better to simply apply '__noclone oninline' to kfree_skb_reason?
I'm thinking maybe other functions have the same problems)
Meanwhile, wrap 'skb_unref()' with 'unlikely()', as the compiler thinks
it is likely return true and splits kfree_skb_reason().
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 23 Aug 2022 23:38:48 +0000 (16:38 -0700)]
netfilter: nf_defrag_ipv6: allow nf_conntrack_frag6_high_thresh increases
Currently, net.netfilter.nf_conntrack_frag6_high_thresh can only be lowered.
I found this issue while investigating a probable kernel issue
causing flakes in tools/testing/selftests/net/ip_defrag.sh
In particular, these sysctl changes were ignored:
ip netns exec "${NETNS}" sysctl -w net.netfilter.nf_conntrack_frag6_high_thresh=
9000000 >/dev/null 2>&1
ip netns exec "${NETNS}" sysctl -w net.netfilter.nf_conntrack_frag6_low_thresh=
7000000 >/dev/null 2>&1
This change is inline with commit
836196239298 ("net/ipfrag: let ip[6]frag_high_thresh
in ns be higher than in init_net")
Fixes: 8db3d41569bb ("netfilter: nf_defrag_ipv6: use net_generic infra")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Thu, 18 Nov 2021 21:24:15 +0000 (22:24 +0100)]
netfilter: flowtable: fix stuck flows on cleanup due to pending work
To clear the flow table on flow table free, the following sequence
normally happens in order:
1) gc_step work is stopped to disable any further stats/del requests.
2) All flow table entries are set to teardown state.
3) Run gc_step which will queue HW del work for each flow table entry.
4) Waiting for the above del work to finish (flush).
5) Run gc_step again, deleting all entries from the flow table.
6) Flow table is freed.
But if a flow table entry already has pending HW stats or HW add work
step 3 will not queue HW del work (it will be skipped), step 4 will wait
for the pending add/stats to finish, and step 5 will queue HW del work
which might execute after freeing of the flow table.
To fix the above, this patch flushes the pending work, then it sets the
teardown flag to all flows in the flowtable and it forces a garbage
collector run to queue work to remove the flows from hardware, then it
flushes this new pending work and (finally) it forces another garbage
collector run to remove the entry from the software flowtable.
Stack trace:
[47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460
[47773.883634] Write of size 8 at addr
ffff888103b45aa8 by task kworker/u20:6/543704
[47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2
[47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)
[47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table]
[47773.889727] Call Trace:
[47773.890214] dump_stack+0xbb/0x107
[47773.890818] print_address_description.constprop.0+0x18/0x140
[47773.892990] kasan_report.cold+0x7c/0xd8
[47773.894459] kasan_check_range+0x145/0x1a0
[47773.895174] down_read+0x99/0x460
[47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table]
[47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table]
[47773.913372] process_one_work+0x8ac/0x14e0
[47773.921325]
[47773.921325] Allocated by task 592159:
[47773.922031] kasan_save_stack+0x1b/0x40
[47773.922730] __kasan_kmalloc+0x7a/0x90
[47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct]
[47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct]
[47773.925207] tcf_action_init_1+0x45b/0x700
[47773.925987] tcf_action_init+0x453/0x6b0
[47773.926692] tcf_exts_validate+0x3d0/0x600
[47773.927419] fl_change+0x757/0x4a51 [cls_flower]
[47773.928227] tc_new_tfilter+0x89a/0x2070
[47773.936652]
[47773.936652] Freed by task 543704:
[47773.937303] kasan_save_stack+0x1b/0x40
[47773.938039] kasan_set_track+0x1c/0x30
[47773.938731] kasan_set_free_info+0x20/0x30
[47773.939467] __kasan_slab_free+0xe7/0x120
[47773.940194] slab_free_freelist_hook+0x86/0x190
[47773.941038] kfree+0xce/0x3a0
[47773.941644] tcf_ct_flow_table_cleanup_work
Original patch description and stack trace by Paul Blakey.
Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Paul Blakey <paulb@nvidia.com>
Tested-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Mon, 22 Aug 2022 21:13:00 +0000 (23:13 +0200)]
netfilter: flowtable: add function to invoke garbage collection immediately
Expose nf_flow_table_gc_run() to force a garbage collector run from the
offload infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Mon, 22 Aug 2022 09:06:39 +0000 (11:06 +0200)]
netfilter: nf_tables: disallow binding to already bound chain
Update nft_data_init() to report EINVAL if chain is already bound.
Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Reported-by: Gwangun Jung <exsociety@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 21 Aug 2022 14:32:44 +0000 (16:32 +0200)]
netfilter: nft_tunnel: restrict it to netdev family
Only allow to use this expression from NFPROTO_NETDEV family.
Fixes: af308b94a2a4 ("netfilter: nf_tables: add tunnel support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 21 Aug 2022 14:25:07 +0000 (16:25 +0200)]
netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet families
As it was originally intended, restrict extension to supported families.
Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 21 Aug 2022 10:41:33 +0000 (12:41 +0200)]
netfilter: nf_tables: do not leave chain stats enabled on error
Error might occur later in the nf_tables_addchain() codepath, enable
static key only after transaction has been created.
Fixes: 9f08ea848117 ("netfilter: nf_tables: keep chain counters away from hot path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 21 Aug 2022 09:55:19 +0000 (11:55 +0200)]
netfilter: nft_payload: do not truncate csum_offset and csum_type
Instead report ERANGE if csum_offset is too long, and EOPNOTSUPP if type
is not support.
Fixes: 7ec3f7b47b8d ("netfilter: nft_payload: add packet mangling support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 21 Aug 2022 09:47:04 +0000 (11:47 +0200)]
netfilter: nft_payload: report ERANGE for too long offset and length
Instead of offset and length are truncation to u8, report ERANGE.
Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 21 Aug 2022 08:52:48 +0000 (10:52 +0200)]
netfilter: nf_tables: make table handle allocation per-netns friendly
mutex is per-netns, move table_netns to the pernet area.
*read-write* to 0xffffffff883a01e8 of 8 bytes by task 6542 on cpu 0:
nf_tables_newtable+0x6dc/0xc00 net/netfilter/nf_tables_api.c:1221
nfnetlink_rcv_batch net/netfilter/nfnetlink.c:513 [inline]
nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:634 [inline]
nfnetlink_rcv+0xa6a/0x13a0 net/netfilter/nfnetlink.c:652
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x652/0x730 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x643/0x740 net/netlink/af_netlink.c:1921
Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-by: Abhishek Shah <abhishek.shah@columbia.edu>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 21 Aug 2022 08:28:25 +0000 (10:28 +0200)]
netfilter: nf_tables: disallow updates of implicit chain
Updates on existing implicit chain make no sense, disallow this.
Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Linus Torvalds [Wed, 24 Aug 2022 02:33:28 +0000 (19:33 -0700)]
Merge tag 'cgroup-for-6.0-rc2-fixes' of git://git./linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- The psi data structure was changed to be allocated dynamically but
it wasn't being cleared leading to it reporting garbage values and
triggering spurious oom kills.
- A deadlock involving cpuset and cpu hotplug.
- When a controller is moved across cgroup hierarchies,
css->rstat_css_node didn't get RCU drained properly from the previous
list.
* tag 'cgroup-for-6.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Fix race condition at rebind_subsystems()
cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock
sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS
sched/psi: Remove unused parameter nbytes of psi_trigger_create()
sched/psi: Zero the memory of struct psi_group
Linus Torvalds [Wed, 24 Aug 2022 02:26:48 +0000 (19:26 -0700)]
Merge tag 'audit-pr-
20220823' of git://git./linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"A single fix for a potential double-free on a fsnotify error path"
* tag 'audit-pr-
20220823' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: fix potential double free on error path from fsnotify_add_inode_mark
Linus Torvalds [Wed, 24 Aug 2022 02:17:26 +0000 (19:17 -0700)]
Merge tag 'fs.fixes.v6.0-rc3' of git://git./linux/kernel/git/vfs/idmapping
Pull file_remove_privs() fix from Christian Brauner:
"As part of Stefan's and Jens' work to add async buffered write
support to xfs we refactored file_remove_privs() and added
__file_remove_privs() to avoid calling __remove_privs() when
IOCB_NOWAIT is passed.
While debugging a recent performance regression report I found that
during review we missed that commit
faf99b563558 ("fs: add
__remove_file_privs() with flags parameter") accidently changed
behavior when dentry_needs_remove_privs() returns zero.
Before the commit it would still call inode_has_no_xattr() setting
the S_NOSEC bit and thereby avoiding even calling into
dentry_needs_remove_privs() the next time this function is called.
After that commit inode_has_no_xattr() would only be called if
__remove_privs() had to be called.
Restore the old behavior. This is likely the cause of the performance
regression"
* tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fs: __file_remove_privs(): restore call to inode_has_no_xattr()