Petr Machata [Fri, 8 Mar 2024 12:59:53 +0000 (13:59 +0100)]
mlxsw: spectrum_router: Support nexthop group hardware statistics
When hw_stats is set on a group, install nexthop counters on members of a
group.
Counter allocation request is moved from nexthop object initialization to
the update code. The previous placement made sense: when the counters are
enabled by dpipe, the counters are installed to all existing nexthops and
all nexthops created from then on get them. For the finer-grained nexthop
group statistics, this is unsuitable. The existing placement was kept for
the IPv4 and IPv6 nexthops.
Resilient group replacement emits a pre_replace notification, and then any
bucket_replace notifications if there were any replacements at all. If the
group is balanced and the nexthop composition of the replaced group didn't
change, there will be no such notifiers. Therefore hook to the pre_replace
notifier and mark all buckets for update, to un/install the counters.
When reporting deltas for resilient groups, use the nexthop ID that we
stored in a previous patch to look up to which nexthop a bucket
contributes.
Co-developed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/87495a72f187df2e5d491d02729c550d235fcc85.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Fri, 8 Mar 2024 12:59:52 +0000 (13:59 +0100)]
mlxsw: spectrum_router: Track NH ID's of group members
The core interfaces for collecting per-NH statistics are built around
nexthops even for resilient groups. Because mlxsw models each bucket as a
nexthop, the core next hop that a given bucket contributes to needs to be
looked up. In order to be able to match the two up, we need to track
nexthop ID for members of group nexthop objects. For simplicity, do it for
all nexthop objects, not just group members.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/184ceb6b154e08f5bcf116a705b0fcb01c31895c.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Fri, 8 Mar 2024 12:59:51 +0000 (13:59 +0100)]
mlxsw: spectrum_router: Add helpers for nexthop counters
The next patch will add the ability to share nexthop counters among
mlxsw nexthops backed by the same core nexthop. To have a place to store
reference count, the counter should be kept in a dedicated structure. In
this patch, introduce the structure together with the related helpers, sans
the refcount, which comes in the next patch.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/61f23fa4f8c5d7879f68dacd793d8ab7425f33c0.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Fri, 8 Mar 2024 12:59:50 +0000 (13:59 +0100)]
mlxsw: spectrum_router: Avoid allocating NH counters twice
mlxsw_sp_nexthop_counter_disable() decays to a nop when called on a
disabled counter, but mlxsw_sp_nexthop_counter_enable() can't similarly
be called on an enabled counter. This would be useful in the following
patches. Add the missing condition.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/0cc9050e196366c1387ab5ee47f1cee8ecde9c86.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Fri, 8 Mar 2024 12:59:49 +0000 (13:59 +0100)]
mlxsw: spectrum: Allow fetch-and-clear of flow counters
For the report_delta-like interface like a previous patch has added for
collection of NH group statistics, it's easiest to read the counter and
have the HW clear it right away. Thus, change mlxsw_sp_flow_counter_get()
to take a bool indicating whether this should be done.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/6a096ede8ee92d5041e3832242c3bbc137198aba.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Fri, 8 Mar 2024 12:59:48 +0000 (13:59 +0100)]
mlxsw: spectrum_router: Have mlxsw_sp_nexthop_counter_enable() return int
In order to be able to diagnose failures in counter allocation, have the
function mlxsw_sp_nexthop_counter_enable() return an error code.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/e0bb5c0cc6234ade2ade1e92abac991359c3f446.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Fri, 8 Mar 2024 12:59:47 +0000 (13:59 +0100)]
mlxsw: spectrum_router: Rename two functions
The function mlxsw_sp_nexthop_counter_alloc() doesn't directly allocate
anything, and mlxsw_sp_nexthop_counter_free() doesn't directly free. For
the following patches, we will need names for functions that actually do
those things. Therefore rename to mlxsw_sp_nexthop_counter_enable() and
mlxsw_sp_nexthop_counter_disable() to free up the namespace.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/f59272958697a718f090f59f892d32beabcd8972.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Fri, 8 Mar 2024 12:59:46 +0000 (13:59 +0100)]
net: nexthop: Have all NH notifiers carry NH ID
When sending the notifications to collect NH statistics for resilient
groups, the driver will need to know the nexthop IDs in individual buckets
to look up the right counter. To that end, move the nexthop ID from struct
nh_notifier_grp_entry_info to nh_notifier_single_info.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/8f964cd50b1a56d3606ce7ab4c50354ae019c43b.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Fri, 8 Mar 2024 12:59:45 +0000 (13:59 +0100)]
net: nexthop: Initialize NH group ID in resilient NH group notifiers
The NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE notifier currently keeps the group
ID unset. That makes it impossible to look up the group for which the
notifier is intended. This is not an issue at the moment, because the only
client is netdevsim, and that just so that it veto replacements, which is a
static property not tied to a particular group. But for any practical use,
the ID is necessary. Set it.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/025fef095dcfb408042568bb5439da014d47239e.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Fri, 8 Mar 2024 10:22:30 +0000 (10:22 +0000)]
net: gro: move two declarations to include/net/gro.h
Move gro_find_receive_by_type() and gro_find_complete_by_type()
to include/net/gro.h where they belong.
Also use _NET_GRO_H instead of _NET_IPV6_GRO_H to protect
include/net/gro.h from multiple inclusions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240308102230.296224-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthew Wood [Fri, 8 Mar 2024 00:25:24 +0000 (16:25 -0800)]
net: netconsole: Add continuation line prefix to userdata messages
Add a space (' ') prefix to every userdata line to match docs for
dev-kmsg. To account for this extra character in each userdata entry,
reduce userdata entry names (directory name) from 54 characters to 53.
According to the dev-kmsg docs, a space is used for subsequent lines to
mark them as continuation lines.
> A line starting with ' ', is a continuation line, adding
> key/value pairs to the log message, which provide the machine
> readable context of the message, for reliable processing in
> userspace.
Testing for this patch::
cd /sys/kernel/config/netconsole && mkdir cmdline0
cd cmdline0
mkdir userdata/test && echo "hello" > userdata/test/value
mkdir userdata/test2 && echo "hello2" > userdata/test2/value
echo "message" > /dev/kmsg
Outputs::
6.8.0-rc5-virtme,12,493,
231373579,-;message
test=hello
test2=hello2
And I confirmed all testing works as expected from the original patchset
Fixes: df03f830d099 ("net: netconsole: cache userdata formatted string in netconsole_target")
Signed-off-by: Matthew Wood <thepacketgeek@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240308002525.248672-1-thepacketgeek@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Thu, 7 Mar 2024 21:23:20 +0000 (22:23 +0100)]
r8169: switch to new function phy_support_eee
Switch to new function phy_support_eee. This allows to simplify
the code because data->tx_lpi_enabled is now populated by
phy_ethtool_get_eee().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/92462328-5c9b-4d82-9ce4-ea974cda4900@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Thu, 7 Mar 2024 21:16:12 +0000 (22:16 +0100)]
net: phy: simplify a check in phy_check_link_status
Handling case err == 0 in the other branch allows to simplify the
code. In addition I assume in "err & phydev->eee_cfg.tx_lpi_enabled"
it should have been a logical and operator. It works as expected also
with the bitwise and, but using a bitwise and with a bool value looks
ugly to me.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/de37bf30-61dd-49f9-b645-2d8ea11ddb5d@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko [Thu, 7 Mar 2024 12:23:45 +0000 (14:23 +0200)]
net: phy: marvell-88x2222: Remove unused of_gpio.h
of_gpio.h is deprecated and subject to remove.
The driver doesn't use it, simply remove the unused header.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240307122346.3677534-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Justin Swartz [Tue, 5 Mar 2024 04:39:51 +0000 (06:39 +0200)]
net: dsa: mt7530: disable LEDs before reset
Disable LEDs just before resetting the MT7530 to avoid
situations where the ESW_P4_LED_0 and ESW_P3_LED_0 pin
states may cause an unintended external crystal frequency
to be selected.
The HT_XTAL_FSEL (External Crystal Frequency Selection)
field of HWTRAP (the Hardware Trap register) stores a
2-bit value that represents the state of the ESW_P4_LED_0
and ESW_P4_LED_0 pins (seemingly) sampled just after the
MT7530 has been reset, as:
ESW_P4_LED_0 ESW_P3_LED_0 Frequency
-----------------------------------------
0 1 20MHz
1 0 40MHz
1 1 25MHz
The value of HT_XTAL_FSEL is bootstrapped by pulling
ESW_P4_LED_0 and ESW_P3_LED_0 up or down accordingly,
but:
if a 40MHz crystal has been selected and
the ESW_P3_LED_0 pin is high during reset,
or a 20MHz crystal has been selected and
the ESW_P4_LED_0 pin is high during reset,
then the value of HT_XTAL_FSEL will indicate
that a 25MHz crystal is present.
By default, the state of the LED pins is PHY controlled
to reflect the link state.
To illustrate, if a board has:
5 ports with active low LED control,
and HT_XTAL_FSEL bootstrapped for 40MHz.
When the MT7530 is powered up without any external
connection, only the LED associated with Port 3 is
illuminated as ESW_P3_LED_0 is low.
In this state, directly after mt7530_setup()'s reset
is performed, the HWTRAP register (0x7800) reflects
the intended HT_XTAL_FSEL (HWTRAP bits 10:9) of 40MHz:
mt7530-mdio mdio-bus:1f: mt7530_read:
00007800 ==
00007dcf
>>> bin(0x7dcf >> 9 & 0b11)
'0b10'
But if a cable is connected to Port 3 and the link
is active before mt7530_setup()'s reset takes place,
then HT_XTAL_FSEL seems to be set for 25MHz:
mt7530-mdio mdio-bus:1f: mt7530_read:
00007800 ==
00007fcf
>>> bin(0x7fcf >> 9 & 0b11)
'0b11'
Once HT_XTAL_FSEL reflects 25MHz, none of the ports
are functional until the MT7621 (or MT7530 itself)
is reset.
By disabling the LED pins just before reset, the chance
of an unintended HT_XTAL_FSEL value is reduced.
Signed-off-by: Justin Swartz <justin.swartz@risingedge.co.za>
Link: https://lore.kernel.org/r/20240305043952.21590-1-justin.swartz@risingedge.co.za
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko [Thu, 7 Mar 2024 12:22:31 +0000 (14:22 +0200)]
net: mdio_bus: Remove unused of_gpio.h
of_gpio.h is deprecated and subject to remove.
The driver doesn't use it, simply remove the unused header.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240307122231.3677241-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ricardo B. Marliere [Tue, 5 Mar 2024 20:11:27 +0000 (17:11 -0300)]
ptp: make ptp_class constant
Since commit
43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the ptp_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240305-ptp-v1-1-ed253eb33c20@marliere.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hangbin Liu [Fri, 8 Mar 2024 08:12:39 +0000 (16:12 +0800)]
netlink: specs: support unterminated-ok
ynl-gen-c.py supports check unterminated-ok, but the yaml schemas don't
have this key. Add this to the yaml files.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240308081239.3281710-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hangbin Liu [Mon, 11 Mar 2024 14:07:27 +0000 (22:07 +0800)]
tools: ynl-gen: support using pre-defined values in attr checks
Support using pre-defined values in checks so we don't need to use hard
code number for the string, binary length. e.g. we have a definition like
#define TEAM_STRING_MAX_LEN 32
Which defined in yaml like:
definitions:
-
name: string-max-len
type: const
value: 32
It can be used in the attribute-sets like
attribute-sets:
-
name: attr-option
name-prefix: team-attr-option-
attributes:
-
name: name
type: string
checks:
len: string-max-len
With this patch it will be converted to
[TEAM_ATTR_OPTION_NAME] = { .type = NLA_STRING, .len = TEAM_STRING_MAX_LEN, }
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240311140727.109562-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mina Almasry [Fri, 8 Mar 2024 20:44:58 +0000 (12:44 -0800)]
net: page_pool: factor out page_pool recycle check
The check is duplicated in 2 places, factor it out into a common helper.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20240308204500.1112858-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David S. Miller [Mon, 11 Mar 2024 10:37:41 +0000 (10:37 +0000)]
Merge branch 'tcp-wmem-data-races'
Jason Xing says:
====================
annotate data-races around sysctl_tcp_wmem[0]
Adding simple READ_ONCE() can avoid reading the sysctl knob meanwhile
someone is trying to change it.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Xing [Fri, 8 Mar 2024 11:25:04 +0000 (19:25 +0800)]
tcp: annotate a data-race around sysctl_tcp_wmem[0]
When reading wmem[0], it could be changed concurrently without
READ_ONCE() protection. So add one annotation here.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Xing [Fri, 8 Mar 2024 11:25:03 +0000 (19:25 +0800)]
mptcp: annotate a data-race around sysctl_tcp_wmem[0]
It's possible that writer and the reader can manipulate the same
sysctl knob concurrently. Using READ_ONCE() to prevent reading
an old value.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Mar 2024 22:11:22 +0000 (14:11 -0800)]
ynl: samples: fix recycling rate calculation
Running the page-pool sample on production machines under moderate
networking load shows recycling rate higher than 100%:
$ page-pool
eth0[2] page pools: 14 (zombies: 0)
refs: 89088 bytes:
364904448 (refs: 0 bytes: 0)
recycling: 100.3% (alloc: 1392:
2290247724 recycle:
469289484:
1828235386)
Note that outstanding refs (89088) == slow alloc * cache size (1392 * 64)
which means this machine is recycling page pool pages perfectly, not
a single page has been released.
The extra 0.3% is because sample ignores allocations from the ptr_ring.
Treat those the same as alloc_fast, the ring vs cache alloc is
already captured accurately enough by recycling stats.
With the fix:
$ page-pool
eth0[2] page pools: 14 (zombies: 0)
refs: 89088 bytes:
364904448 (refs: 0 bytes: 0)
recycling: 100.0% (alloc: 1392:
2331141604 recycle:
473625579:
1857460661)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 7 Mar 2024 22:00:16 +0000 (22:00 +0000)]
udp: no longer touch sk->sk_refcnt in early demux
After commits
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
and
7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
UDP early demux no longer need to grab a refcount on the UDP socket.
This save two atomic operations per incoming packet for connected
sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Joe Stringer <joe@wand.net.nz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 11 Mar 2024 09:53:22 +0000 (09:53 +0000)]
Merge branch 'getsockopt-parameter-validation'
Gavrilov Ilia says:
====================
fix incorrect parameter validation in the *_get_sockopt() functions
This v2 series fix incorrent parameter validation in *_get_sockopt()
functions in several places.
version 2 changes:
- reword the patch description
- add two patches for net/kcm and net/x25
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavrilov Ilia [Thu, 7 Mar 2024 14:23:50 +0000 (14:23 +0000)]
net/x25: fix incorrect parameter validation in the x25_getsockopt() function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavrilov Ilia [Thu, 7 Mar 2024 14:23:50 +0000 (14:23 +0000)]
net: kcm: fix incorrect parameter validation in the kcm_getsockopt) function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavrilov Ilia [Thu, 7 Mar 2024 14:23:50 +0000 (14:23 +0000)]
udp: fix incorrect parameter validation in the udp_lib_getsockopt() function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavrilov Ilia [Thu, 7 Mar 2024 14:23:50 +0000 (14:23 +0000)]
l2tp: fix incorrect parameter validation in the pppol2tp_getsockopt() function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 3557baabf280 ("[L2TP]: PPP over L2TP driver core")
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavrilov Ilia [Thu, 7 Mar 2024 14:23:50 +0000 (14:23 +0000)]
ipmr: fix incorrect parameter validation in the ip_mroute_getsockopt() function
The 'olr' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'olr' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavrilov Ilia [Thu, 7 Mar 2024 14:23:49 +0000 (14:23 +0000)]
tcp: fix incorrect parameter validation in the do_tcp_getsockopt() function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 11 Mar 2024 09:36:11 +0000 (09:36 +0000)]
Merge branch 'qmc-hdlc'
Herve Codina says:
====================
Add support for QMC HDLC
This series introduces the QMC HDLC support.
Patches were previously sent as part of a full feature series and were
previously reviewed in that context:
"Add support for QMC HDLC, framer infrastructure and PEF2256 framer" [1]
In order to ease the merge, the full feature series has been split and
needed parts were merged in v6.8-rc1:
- "Prepare the PowerQUICC QMC and TSA for the HDLC QMC driver" [2]
- "Add support for framer infrastructure and PEF2256 framer" [3]
This series contains patches related to the QMC HDLC part (QMC HDLC
driver):
- Introduce the QMC HDLC driver (patches 1 and 2)
- Add timeslots change support in QMC HDLC (patch 3)
- Add framer support as a framer consumer in QMC HDLC (patch 4)
Compare to the original full feature series, a modification was done on
patch 3 in order to use a coherent prefix in the commit title.
I kept the patches unsquashed as they were previously sent and reviewed.
Of course, I can squash them if needed.
Compared to the previous iteration:
https://lore.kernel.org/linux-kernel/
20240306080726.167338-1-herve.codina@bootlin.com/
this v7 series mainly:
- Rename a variable.
- Fix reverse xmas tree declarations.
- Add 'Acked-by' tag.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Herve Codina [Thu, 7 Mar 2024 11:39:08 +0000 (12:39 +0100)]
net: wan: fsl_qmc_hdlc: Add framer support
Add framer support in the fsl_qmc_hdlc driver in order to be able to
signal carrier changes to the network stack based on the framer status
Also use this framer to provide information related to the E1/T1 line
interface on IF_GET_IFACE and configure the line interface according to
IF_IFACE_{E1,T1} information.
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herve Codina [Thu, 7 Mar 2024 11:39:07 +0000 (12:39 +0100)]
net: wan: fsl_qmc_hdlc: Add runtime timeslots changes support
QMC channels support runtime timeslots changes but nothing is done at
the QMC HDLC driver to handle these changes.
Use existing IFACE ioctl in order to configure the timeslots to use.
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Thu, 7 Mar 2024 11:39:06 +0000 (12:39 +0100)]
lib/bitmap: Introduce bitmap_scatter() and bitmap_gather() helpers
These helpers scatters or gathers a bitmap with the help of the mask
position bits parameter.
bitmap_scatter() does the following:
src:
0000000001011010
||||||
+------+|||||
| +----+||||
| |+----+|||
| || +-+||
| || | ||
mask: ...v..vv...v..vv
...0..11...0..10
dst:
0000001100000010
and bitmap_gather() performs this one:
mask: ...v..vv...v..vv
src:
0000001100000010
^ ^^ ^ 0
| || | 10
| || > 010
| |+--> 1010
| +--> 11010
+----> 011010
dst:
0000000000011010
bitmap_gather() can the seen as the reverse bitmap_scatter() operation.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/lkml/20230926052007.3917389-3-andriy.shevchenko@linux.intel.com/
Co-developed-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herve Codina [Thu, 7 Mar 2024 11:39:05 +0000 (12:39 +0100)]
MAINTAINERS: Add the Freescale QMC HDLC driver entry
After contributing the driver, add myself as the maintainer for the
Freescale QMC HDLC driver.
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herve Codina [Thu, 7 Mar 2024 11:39:04 +0000 (12:39 +0100)]
net: wan: Add support for QMC HDLC
The QMC HDLC driver provides support for HDLC using the QMC (QUICC
Multichannel Controller) to transfer the HDLC data.
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 11 Mar 2024 09:33:01 +0000 (09:33 +0000)]
Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
ethtool: ice: Support for RSS settings to GTP
Takeru Hayasaka enables RSS functionality for GTP packets on ice driver
with ethtool.
A user can include TEID and make RSS work for GTP-U over IPv4 by doing the
following:`ethtool -N ens3 rx-flow-hash gtpu4 sde`
In addition to gtpu(4|6), we now support gtpc(4|6),gtpc(4|6)t,gtpu(4|6)e,
gtpu(4|6)u, and gtpu(4|6)d.
gtpc(4|6): Used for GTP-C in IPv4 and IPv6, where the GTP header format does
not include a TEID.
gtpc(4|6)t: Used for GTP-C in IPv4 and IPv6, with a GTP header format that
includes a TEID.
gtpu(4|6): Used for GTP-U in both IPv4 and IPv6 scenarios.
gtpu(4|6)e: Used for GTP-U with extended headers in both IPv4 and IPv6.
gtpu(4|6)u: Used when the PSC (PDU session container) in the GTP-U extended
header includes Uplink, applicable to both IPv4 and IPv6.
gtpu(4|6)d: Used when the PSC in the GTP-U extended header includes Downlink,
for both IPv4 and IPv6.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sat, 9 Mar 2024 04:45:17 +0000 (20:45 -0800)]
Merge tag 'mlx5-socket-direct-v3' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Support Multi-PF netdev (Socket Direct)
This series adds support for combining multiple devices (PFs) of the
same port under one netdev instance. Passing traffic through different
devices belonging to different NUMA sockets saves cross-numa traffic and
allows apps running on the same netdev from different numas to still
feel a sense of proximity to the device and achieve improved
performance.
We achieve this by grouping PFs together, and creating the netdev only
once all group members are probed. Symmetrically, we destroy the netdev
once any of the PFs is removed.
The channels are distributed between all devices, a proper configuration
would utilize the correct close numa when working on a certain app/cpu.
We pick one device to be a primary (leader), and it fills a special
role. The other devices (secondaries) are disconnected from the network
in the chip level (set to silent mode). All RX/TX traffic is steered
through the primary to/from the secondaries.
Currently, we limit the support to PFs only, and up to two devices
(sockets).
* tag 'mlx5-socket-direct-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
Documentation: networking: Add description for multi-pf netdev
net/mlx5: Enable SD feature
net/mlx5e: Block TLS device offload on combined SD netdev
net/mlx5e: Support per-mdev queue counter
net/mlx5e: Support cross-vhca RSS
net/mlx5e: Let channels be SD-aware
net/mlx5e: Create EN core HW resources for all secondary devices
net/mlx5e: Create single netdev per SD group
net/mlx5: SD, Add debugfs
net/mlx5: SD, Add informative prints in kernel log
net/mlx5: SD, Implement steering for primary and secondaries
net/mlx5: SD, Implement devcom communication and primary election
net/mlx5: SD, Implement basic query and instantiation
net/mlx5: SD, Introduce SD lib
net/mlx5: Add MPIR bit in mcam_access_reg
====================
Link: https://lore.kernel.org/r/20240307084229.500776-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 9 Mar 2024 04:37:32 +0000 (20:37 -0800)]
Merge tag 'for-net-next-2024-03-08' of git://git./linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- hci_conn: Only do ACL connections sequentially
- hci_core: Cancel request on command timeout
- Remove CONFIG_BT_HS
- btrtl: Add the support for RTL8852BT/RTL8852BE-VT
- btusb: Add support Mediatek MT7920
- btusb: Add new VID/PID 13d3/3602 for MT7925
- Add new quirk for broken read key length on ATS2851
* tag 'for-net-next-2024-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (52 commits)
Bluetooth: hci_sync: Fix UAF in hci_acl_create_conn_sync
Bluetooth: Fix eir name length
Bluetooth: ISO: Align broadcast sync_timeout with connection timeout
Bluetooth: Add new quirk for broken read key length on ATS2851
Bluetooth: mgmt: remove NULL check in add_ext_adv_params_complete()
Bluetooth: mgmt: remove NULL check in mgmt_set_connectable_complete()
Bluetooth: btusb: Add support Mediatek MT7920
Bluetooth: btmtk: Add MODULE_FIRMWARE() for MT7922
Bluetooth: btnxpuart: Fix btnxpuart_close
Bluetooth: ISO: Clean up returns values in iso_connect_ind()
Bluetooth: fix use-after-free in accessing skb after sending it
Bluetooth: af_bluetooth: Fix deadlock
Bluetooth: bnep: Fix out-of-bound access
Bluetooth: btusb: Fix memory leak
Bluetooth: msft: Fix memory leak
Bluetooth: hci_core: Fix possible buffer overflow
Bluetooth: btrtl: fix out of bounds memory access
Bluetooth: hci_h5: Add ability to allocate memory for private data
Bluetooth: hci_sync: Fix overwriting request callback
Bluetooth: hci_sync: Use QoS to determine which PHY to scan
...
====================
Link: https://lore.kernel.org/r/20240308181056.120547-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 9 Mar 2024 04:35:32 +0000 (20:35 -0800)]
Merge tag 'ieee802154-for-net-next-2024-03-07' of git://git./linux/kernel/git/wpan/wpan-next
Stefan Schmidt says:
====================
pull-request: ieee802154-next 2024-03-07
Various cross tree patches for ieee802154v drivers and a resource leak
fix for ieee802154 llsec.
Andy Shevchenko changed GPIO header usage for at86rf230 and mcr20a to
only include needed headers.
Bo Liu converted the at86rf230, mcr20a and mrf24j40 driver regmap
support to use the maple tree register cache.
Fedor Pchelkin fixed a resource leak in the llsec key deletion path.
Ricardo B. Marliere made wpan_phy_class const.
Tejun Heo removed WQ_UNBOUND from a workqueue call in ca8210.
* tag 'ieee802154-for-net-next-2024-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan-next:
ieee802154: cfg802154: make wpan_phy_class constant
ieee802154: mcr20a: Remove unused of_gpio.h
ieee802154: at86rf230: Replace of_gpio.h by proper one
mac802154: fix llsec key resources release in mac802154_llsec_key_del
ieee802154: ca8210: Drop spurious WQ_UNBOUND from alloc_ordered_workqueue() call
net: ieee802154: mrf24j40: convert to use maple tree register cache
net: ieee802154: mcr20a: convert to use maple tree register cache
net: ieee802154: at86rf230: convert to use maple tree register cache
====================
Link: https://lore.kernel.org/r/20240307195105.292085-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Colin Ian King [Fri, 8 Mar 2024 08:44:58 +0000 (08:44 +0000)]
tools: ynl: Fix spelling mistake "Constructred" -> "Constructed"
There is a spelling mistake in an error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20240308084458.2045266-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 7 Mar 2024 16:30:20 +0000 (16:30 +0000)]
ipv4: raw: check sk->sk_rcvbuf earlier
There is no point cloning an skb and having to free the clone
if the receive queue of the raw socket is full.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240307163020.2524409-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 7 Mar 2024 16:29:43 +0000 (16:29 +0000)]
ipv6: raw: check sk->sk_rcvbuf earlier
There is no point cloning an skb and having to free the clone
if the receive queue of the raw socket is full.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240307162943.2523817-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Thu, 7 Mar 2024 15:47:27 +0000 (17:47 +0200)]
nexthop: Simplify dump error handling
The only error that can happen during a nexthop dump is insufficient
space in the skb caring the netlink messages (EMSGSIZE). If this happens
and some messages were already filled in, the nexthop code returns the
skb length to signal the netlink core that more objects need to be
dumped.
After commit
b5a899154aa9 ("netlink: handle EMSGSIZE errors in the
core") there is no need to handle this error in the nexthop code as it
is now handled in the core.
Simplify the code and simply return the error to the core.
No regressions in nexthop tests:
# ./fib_nexthops.sh
Tests passed: 234
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240307154727.3555462-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 7 Mar 2024 12:34:46 +0000 (12:34 +0000)]
net: add skb_data_unref() helper
Similar to skb_unref(), add skb_data_unref() to save an expensive
atomic operation (and cache line dirtying) when last reference
on shinfo->dataref is released.
I saw this opportunity on hosts with RAW sockets accidentally
bound to UDP protocol, forcing an skb_clone() on all received packets.
These RAW sockets had their receive queue full, so all clone
packets were immediately dropped.
When UDP recvmsg() consumes later the original skb, skb_release_data()
is hitting atomic_sub_return() quite badly, because skb->clone
has been set permanently.
Note that this patch helps TCP TX performance, because
TCP stack also use (fast) clones.
This means that at least one of the two packets (the main skb or
its clone) will no longer have to perform this atomic operation
in skb_release_data().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240307123446.2302230-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 8 Mar 2024 17:05:48 +0000 (09:05 -0800)]
Merge tag 'wireless-next-2024-03-08' of git://git./linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.9
The fourth "new features" pull request for v6.9 with changes both in
stack and in drivers. The theme in this pull request is to fix sparse
warnings but we still have some left in wireless subsystem. Otherwise
quite normal.
Major changes:
rtw89
* NL80211_EXT_FEATURE_SCAN_RANDOM_SN support
* NL80211_EXT_FEATURE_SET_SCAN_DWELL support
rtw88
* support for more rtw8811cu and rtw8821cu devices
mt76
* mt76x2u: add Netgear WNDA3100v3 USB
* mt7915: newer ADIE version support
* mt7925: radio temperature sensor support
* mt7996: remove GCMP IGTK offload
* tag 'wireless-next-2024-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (125 commits)
wifi: rtw89: wow: move release offload packet earlier for WoWLAN mode
wifi: rtw89: wow: set security engine options for 802.11ax chips only
wifi: rtw89: update suspend/resume for different generation
wifi: rtw89: wow: update config mac function with different generation
wifi: rtw89: update DMA function with different generation
wifi: rtw89: wow: update WoWLAN status register for different generation
wifi: rtw89: wow: update WoWLAN reason register for different chips
wifi: brcm80211: handle pmk_op allocation failure
wifi: rtw89: coex: Add coexistence policy to decrease WiFi packet CRC-ERR
wifi: rtw89: coex: When Bluetooth not available don't set power/gain
wifi: rtw89: coex: add return value to ensure H2C command is success or not
wifi: rtw89: coex: Reorder H2C command index to align with firmware
wifi: rtw89: coex: add BTC ctrl_info version 7 and related logic
wifi: rtw89: coex: add init_info H2C command format version 7
wifi: rtw89: 8922a: add coexistence helpers of SW grant
wifi: rtw89: mac: add coexistence helpers {cfg/get}_plt
wifi: cw1200: restore endian swapping
wifi: wlcore: sdio: Rate limit wl12xx_sdio_raw_{read,write}() failures warns
wifi: rtlwifi: Remove rtl_intf_ops.read_efuse_byte
wifi: rtw88: 8821c: Fix false alarm count
...
====================
Link: https://lore.kernel.org/r/20240308100429.B8EA2C433F1@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Luiz Augusto von Dentz [Fri, 8 Mar 2024 16:02:48 +0000 (11:02 -0500)]
Bluetooth: hci_sync: Fix UAF in hci_acl_create_conn_sync
This fixes the following error caused by hci_conn being freed while
hcy_acl_create_conn_sync is pending:
==================================================================
BUG: KASAN: slab-use-after-free in hci_acl_create_conn_sync+0xa7/0x2e0
Write of size 2 at addr
ffff888002ae0036 by task kworker/u3:0/848
CPU: 0 PID: 848 Comm: kworker/u3:0 Not tainted
6.8.0-rc6-g2ab3e8d67fc1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38
04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
<TASK>
dump_stack_lvl+0x21/0x70
print_report+0xce/0x620
? preempt_count_sub+0x13/0xc0
? __virt_addr_valid+0x15f/0x310
? hci_acl_create_conn_sync+0xa7/0x2e0
kasan_report+0xdf/0x110
? hci_acl_create_conn_sync+0xa7/0x2e0
hci_acl_create_conn_sync+0xa7/0x2e0
? __pfx_hci_acl_create_conn_sync+0x10/0x10
? __pfx_lock_release+0x10/0x10
? __pfx_hci_acl_create_conn_sync+0x10/0x10
hci_cmd_sync_work+0x138/0x1c0
process_one_work+0x405/0x800
? __pfx_lock_acquire+0x10/0x10
? __pfx_process_one_work+0x10/0x10
worker_thread+0x37b/0x670
? __pfx_worker_thread+0x10/0x10
kthread+0x19b/0x1e0
? kthread+0xfe/0x1e0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 847:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0x8f/0xa0
hci_conn_add+0xc6/0x970
hci_connect_acl+0x309/0x410
pair_device+0x4fb/0x710
hci_sock_sendmsg+0x933/0xef0
sock_write_iter+0x2c3/0x2d0
do_iter_readv_writev+0x21a/0x2e0
vfs_writev+0x21c/0x7b0
do_writev+0x14a/0x180
do_syscall_64+0x77/0x150
entry_SYSCALL_64_after_hwframe+0x6c/0x74
Freed by task 847:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0xfa/0x150
kfree+0xcb/0x250
device_release+0x58/0xf0
kobject_put+0xbb/0x160
hci_conn_del+0x281/0x570
hci_conn_hash_flush+0xfc/0x130
hci_dev_close_sync+0x336/0x960
hci_dev_close+0x10e/0x140
hci_sock_ioctl+0x14a/0x5c0
sock_ioctl+0x58a/0x5d0
__x64_sys_ioctl+0x480/0xf60
do_syscall_64+0x77/0x150
entry_SYSCALL_64_after_hwframe+0x6c/0x74
Fixes: 45340097ce6e ("Bluetooth: hci_conn: Only do ACL connections sequentially")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Frédéric Danis [Thu, 7 Mar 2024 16:42:05 +0000 (17:42 +0100)]
Bluetooth: Fix eir name length
According to Section 1.2 of Core Specification Supplement Part A the
complete or short name strings are defined as utf8s, which should not
include the trailing NULL for variable length array as defined in Core
Specification Vol1 Part E Section 2.9.3.
Removing the trailing NULL allows PTS to retrieve the random address based
on device name, e.g. for SM/PER/KDU/BV-02-C, SM/PER/KDU/BV-08-C or
GAP/BROB/BCST/BV-03-C.
Fixes: f61851f64b17 ("Bluetooth: Fix append max 11 bytes of name to scan rsp data")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
David S. Miller [Fri, 8 Mar 2024 12:01:33 +0000 (12:01 +0000)]
Merge branch 'hns3-fixes'
Jijie Shao says:
====================
There are some bugfix for the HNS3 ethernet driver
There are some bugfix for the HNS3 ethernet driver
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 7 Mar 2024 01:01:15 +0000 (09:01 +0800)]
net: hns3: add checking for vf id of mailbox
Add checking for vf id of mailbox, in order to avoid array
out-of-bounds risk.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jie Wang [Thu, 7 Mar 2024 01:01:14 +0000 (09:01 +0800)]
net: hns3: fix port duplex configure error in IMP reset
Currently, the mac port is fixed to configured as full dplex mode in
hclge_mac_init() when driver initialization or reset restore. Users may
change the mode to half duplex with ethtool, so it may cause the user
configuration dropped after reset.
To fix it, don't change the duplex mode when resetting.
Fixes: 2d03eacc0b7e ("net: hns3: Only update mac configuation when necessary")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peiyang Wang [Thu, 7 Mar 2024 01:01:13 +0000 (09:01 +0800)]
net: hns3: fix reset timeout under full functions and queues
The cmdq reset command times out when all VFs are enabled and the queue is
full. The hardware processing time exceeds the timeout set by the driver.
In order to avoid the above extreme situations, the driver extends the
reset timeout to 1 second.
Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jijie Shao [Thu, 7 Mar 2024 01:01:12 +0000 (09:01 +0800)]
net: hns3: fix delete tc fail issue
When the tc is removed during reset, hns3 driver will return a errcode.
But kernel ignores this errcode, As a result,
the driver status is inconsistent with the kernel status.
This patch retains the deletion status when the deletion fails
and continues to delete after the reset to ensure that
the status of the driver is consistent with that of kernel.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yonglong Liu [Thu, 7 Mar 2024 01:01:11 +0000 (09:01 +0800)]
net: hns3: fix kernel crash when 1588 is received on HIP08 devices
The HIP08 devices does not register the ptp devices, so the
hdev->ptp is NULL, but the hardware can receive 1588 messages,
and set the HNS3_RXD_TS_VLD_B bit, so, if match this case, the
access of hdev->ptp->flags will cause a kernel crash:
[ 5888.946472] Unable to handle kernel NULL pointer dereference at virtual address
0000000000000018
[ 5888.946475] Unable to handle kernel NULL pointer dereference at virtual address
0000000000000018
...
[ 5889.266118] pc : hclge_ptp_get_rx_hwts+0x40/0x170 [hclge]
[ 5889.272612] lr : hclge_ptp_get_rx_hwts+0x34/0x170 [hclge]
[ 5889.279101] sp :
ffff800012c3bc50
[ 5889.283516] x29:
ffff800012c3bc50 x28:
ffff2040002be040
[ 5889.289927] x27:
ffff800009116484 x26:
0000000080007500
[ 5889.296333] x25:
0000000000000000 x24:
ffff204001c6f000
[ 5889.302738] x23:
ffff204144f53c00 x22:
0000000000000000
[ 5889.309134] x21:
0000000000000000 x20:
ffff204004220080
[ 5889.315520] x19:
ffff204144f53c00 x18:
0000000000000000
[ 5889.321897] x17:
0000000000000000 x16:
0000000000000000
[ 5889.328263] x15:
0000004000140ec8 x14:
0000000000000000
[ 5889.334617] x13:
0000000000000000 x12:
00000000010011df
[ 5889.340965] x11:
bbfeff4d22000000 x10:
0000000000000000
[ 5889.347303] x9 :
ffff800009402124 x8 :
0200f78811dfbb4d
[ 5889.353637] x7 :
2200000000191b01 x6 :
ffff208002a7d480
[ 5889.359959] x5 :
0000000000000000 x4 :
0000000000000000
[ 5889.366271] x3 :
0000000000000000 x2 :
0000000000000000
[ 5889.372567] x1 :
0000000000000000 x0 :
ffff20400095c080
[ 5889.378857] Call trace:
[ 5889.382285] hclge_ptp_get_rx_hwts+0x40/0x170 [hclge]
[ 5889.388304] hns3_handle_bdinfo+0x324/0x410 [hns3]
[ 5889.394055] hns3_handle_rx_bd+0x60/0x150 [hns3]
[ 5889.399624] hns3_clean_rx_ring+0x84/0x170 [hns3]
[ 5889.405270] hns3_nic_common_poll+0xa8/0x220 [hns3]
[ 5889.411084] napi_poll+0xcc/0x264
[ 5889.415329] net_rx_action+0xd4/0x21c
[ 5889.419911] __do_softirq+0x130/0x358
[ 5889.424484] irq_exit+0x134/0x154
[ 5889.428700] __handle_domain_irq+0x88/0xf0
[ 5889.433684] gic_handle_irq+0x78/0x2c0
[ 5889.438319] el1_irq+0xb8/0x140
[ 5889.442354] arch_cpu_idle+0x18/0x40
[ 5889.446816] default_idle_call+0x5c/0x1c0
[ 5889.451714] cpuidle_idle_call+0x174/0x1b0
[ 5889.456692] do_idle+0xc8/0x160
[ 5889.460717] cpu_startup_entry+0x30/0xfc
[ 5889.465523] secondary_start_kernel+0x158/0x1ec
[ 5889.470936] Code:
97ffab78 f9411c14 91408294 f9457284 (
f9400c80)
[ 5889.477950] SMP: stopping secondary CPUs
[ 5890.514626] SMP: failed to stop secondary CPUs 0-69,71-95
[ 5890.522951] Starting crashdump kernel...
Fixes: 0bf5eb788512 ("net: hns3: add support for PTP")
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hao Lan [Thu, 7 Mar 2024 01:01:10 +0000 (09:01 +0800)]
net: hns3: Disable SerDes serial loopback for HiLink H60
When the hilink version is H60, the serdes serial loopback test is not
supported. This patch add hilink version detection. When the version
is H60, the serdes serial loopback test will be disable.
Signed-off-by: Hao Lan <lanhao@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hao Lan [Thu, 7 Mar 2024 01:01:09 +0000 (09:01 +0800)]
net: hns3: add new 200G link modes for hisilicon device
The hisilicon device now supports a new 200G link interface,
which query from firmware in a new bit. Therefore,
the HCLGE_SUPPORT_200G_R4_BIT capability bit has been added.
The HCLGE_SUPPORT_200G_BIT has been renamed as
HCLGE_SUPPORT_200G_R4_EXT_BIT, and the firmware has
extended support for this mode.
Fixes: ae6f010cb1a7 ("net: hns3: add support for 200G device")
Signed-off-by: Hao Lan <lanhao@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jijie Shao [Thu, 7 Mar 2024 01:01:08 +0000 (09:01 +0800)]
net: hns3: fix wrong judgment condition issue
In hns3_dcbnl_ieee_delapp, should check ieee_delapp not ieee_setapp.
This path fix the wrong judgment.
Fixes: 0ba22bcb222d ("net: hns3: add support config dscp map to tc")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Mar 2024 11:54:35 +0000 (11:54 +0000)]
Merge branch 'ionic-diet'
Shannon Nelson says:
====================
ionic: putting ionic on a diet
Building on the performance work done in the previous patchset
[Link] https://lore.kernel.org/netdev/
20240229193935.14197-1-shannon.nelson@amd.com/
this patchset puts the ionic driver on a diet, decreasing the memory
requirements per queue, and simplifies a few more bits of logic.
We trimmed the queue management structs and gained some ground, but
the most savings came from trimming the individual buffer descriptors.
The original design used a single generic buffer descriptor for Tx, Rx and
Adminq needs, but the Rx and Adminq descriptors really don't need all the
info that the Tx descriptors track. By splitting up the descriptor types
we can significantly reduce the descriptor sizes for Rx and Adminq use.
There is a small reduction in the queue management structs, saving about
3 cachelines per queuepair:
ionic_qcq:
Before: /* size: 2176, cachelines: 34, members: 23 */
After: /* size: 2048, cachelines: 32, members: 23 */
We also remove an array of completion descriptor pointers, or about
8 Kbytes per queue.
But the biggest savings came from splitting the desc_info struct into
queue specific structs and trimming out what was unnecessary.
Before:
ionic_desc_info:
/* size: 496, cachelines: 8, members: 10 */
After:
ionic_tx_desc_info:
/* size: 496, cachelines: 8, members: 6 */
ionic_rx_desc_info:
/* size: 224, cachelines: 4, members: 2 */
ionic_admin_desc_info:
/* size: 8, cachelines: 1, members: 1 */
In a 64 core host the ionic driver will default to 64 queuepairs of
1024 descriptors for Rx, 1024 for Tx, and 80 for Adminq and Notifyq.
The total memory usage for 64 queues:
Before:
65 * sizeof(ionic_qcq) 141,440
+ 64 * 1024 * sizeof(ionic_desc_info) 32,505,856
+ 64 * 1024 * sizeof(ionic_desc_info) 32,505,856
+ 64 * 1024 * 2 * sizeof(ionic_qc_info) 16,384
+ 1 * 80 * sizeof(ionic_desc_info) 39,690
----------
65,201,038
After:
65 * sizeof(ionic_qcq) 133,120
+ 64 * 1024 * sizeof(ionic_tx_desc_info) 32,505,856
+ 64 * 1024 * sizeof(ionic_rx_desc_info) 14,680,064
+ (removed) 0
+ 1 * 80 * sizeof(ionic_admin desc_info) 640
----------
47,319,680
This saves us approximately 18 Mbytes per port in a 64 core machine,
a 28% savings in our memory needs.
In addition, this improves our simple single thread / single queue
iperf case on a 9100 MTU connection from 86.7 to 95 Gbits/sec.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:59 +0000 (15:29 -0800)]
ionic: keep stats struct local to error handling
When possible, keep the stats struct references strictly
in the error handling blocks and out of the fastpath.
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:58 +0000 (15:29 -0800)]
ionic: better dma-map error handling
Fix up a couple of small dma_addr handling issues
- don't double-count dma-map-err stat in ionic_tx_map_skb()
or ionic_xdp_post_frame()
- return 0 on error from both ionic_tx_map_single() and
ionic_tx_map_frag() and check for !dma_addr in ionic_tx_map_skb()
and ionic_xdp_post_frame()
- be sure to unmap buf_info[0] in ionic_tx_map_skb() error path
- don't assign rx buf->dma_addr until error checked in ionic_rx_page_alloc()
- remove unnecessary dma_addr_t casts
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:57 +0000 (15:29 -0800)]
ionic: remove unnecessary NULL test
We call ionic_rx_page_alloc() only on existing buf_info structs from
ionic_rx_fill(). There's no need for the additional NULL test.
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:56 +0000 (15:29 -0800)]
ionic: rearrange ionic_queue for better layout
A simple change to the struct ionic_queue layout removes some
unnecessary padding and saves us a cacheline in the struct
ionic_qcq layout.
struct ionic_queue {
Before: /* size: 256, cachelines: 4, members: 29 */
After: /* size: 192, cachelines: 3, members: 29 */
struct ionic_qcq {
Before: /* size: 2112, cachelines: 33, members: 23 */
After: /* size: 2048, cachelines: 32, members: 23 */
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:55 +0000 (15:29 -0800)]
ionic: rearrange ionic_qcq
Rearange a few fields for better cache use and to put the
flags field up into the first cacheline rather than the last.
struct ionic_qcq
Before: /* size: 2176, cachelines: 34, members: 23 */
After: /* size: 2112, cachelines: 33, members: 23 */
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:54 +0000 (15:29 -0800)]
ionic: carry idev in ionic_cq struct
Remove the idev field from ionic_queue, which saves us a
bit of space, and add it into ionic_cq where there's room
within some cacheline padding. Use this pointer rather
than doing a multi level reference from lif->ionic.
Suggested-by: Neel Patel <npatel2@amd.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:53 +0000 (15:29 -0800)]
ionic: refactor skb building
The existing ionic_rx_frags() code is a bit of a mess and can
be cleaned up by unrolling the first frag/header setup from
the loop, then reworking the do-while-loop into a for-loop. We
rename the function to a more descriptive ionic_rx_build_skb().
We also change a couple of related variable names for readability.
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:52 +0000 (15:29 -0800)]
ionic: fold adminq clean into service routine
Since the AdminQ clean is a simple action called from only
one place, fold it back into the service routine.
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:51 +0000 (15:29 -0800)]
ionic: use specialized desc info structs
Make desc_info structure specific to the queue type, which
allows us to cut down the Rx and AdminQ descriptor sizes by
not including all the fields needed for the Tx desriptors.
Before:
struct ionic_desc_info {
/* size: 464, cachelines: 8, members: 6 */
After:
struct ionic_tx_desc_info {
/* size: 464, cachelines: 8, members: 6 */
struct ionic_rx_desc_info {
/* size: 224, cachelines: 4, members: 2 */
struct ionic_admin_desc_info {
/* size: 8, cachelines: 1, members: 1 */
Suggested-by: Neel Patel <npatel2@amd.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:50 +0000 (15:29 -0800)]
ionic: remove the cq_info to save more memory
With a little simple math we don't need another struct array to
find the completion structs, so we can remove the ionic_cq_info
altogether. This doesn't really save anything in the ionic_cq
since it gets padded out to the cacheline, but it does remove
the parallel array allocation of 8 * num_descriptors, or about
8 Kbytes per queue in a default configuration.
Suggested-by: Neel Patel <npatel2@amd.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:49 +0000 (15:29 -0800)]
ionic: remove callback pointer from desc_info
By reworking the queue service routines to have their own
servicing loops we can remove the cb pointer from desc_info
to save another 8 bytes per descriptor,
This simplifies some of the queue handling indirection and makes
the code a little easier to follow, and keeps service code in
one place rather than jumping between code files.
struct ionic_desc_info
Before: /* size: 472, cachelines: 8, members: 7 */
After: /* size: 464, cachelines: 8, members: 6 */
Suggested-by: Neel Patel <npatel2@amd.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:48 +0000 (15:29 -0800)]
ionic: move adminq-notifyq handling to main file
Move the AdminQ and NotifyQ queue handling to ionic_main.c with
the rest of the adminq code.
Suggested-by: Neel Patel <npatel2@amd.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:47 +0000 (15:29 -0800)]
ionic: drop q mapping
Now that we're not using desc_info pointers mapped in every q
we can simplify and drop the unnecessary utility functions.
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 6 Mar 2024 23:29:46 +0000 (15:29 -0800)]
ionic: remove desc, sg_desc and cmb_desc from desc_info
Remove the struct pointers from desc_info to use less space.
Instead of pointers in every desc_info to its descriptor,
we can use the queue descriptor index to find the individual
desc, desc_info, and sgl structs in their parallel arrays.
struct ionic_desc_info
Before: /* size: 496, cachelines: 8, members: 10 */
After: /* size: 472, cachelines: 8, members: 7 */
Suggested-by: Neel Patel <npatel2@amd.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Mar 2024 11:43:21 +0000 (11:43 +0000)]
Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2024-03-06 (iavf, i40e, ixgbe)
This series contains updates to iavf, i40e, and ixgbe drivers.
Alexey Kodanev removes duplicate calls related to cloud filters on iavf
and unnecessary null checks on i40e.
Maciej adds helper functions for common code relating to updating
statistics for ixgbe.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 6 Mar 2024 15:47:03 +0000 (07:47 -0800)]
Add Jeff Kirsher to .get_maintainer.ignore
Jeff was retired as the Intel driver maintainer in
commit
6667df916fce ("MAINTAINERS: Update MAINTAINERS for
Intel ethernet drivers"), and his address bounces.
But he has signed-off a lot of patches over the years
so get_maintainer insists on CCing him.
We haven't heard from him since he left Intel, so remapping
the address via mailmap is also pointless. Add to ignored
addresses.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Mar 2024 11:15:36 +0000 (11:15 +0000)]
Merge branch 'ipv6-lockless-dump-addrs'
Eric Dumazet says:
====================
ipv6: lockless inet6_dump_addr()
This series removes RTNL locking to dump ipv6 addresses.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 6 Mar 2024 15:51:44 +0000 (15:51 +0000)]
ipv6: remove RTNL protection from inet6_dump_addr()
We can now remove RTNL acquisition while running
inet6_dump_addr(), inet6_dump_ifmcaddr()
and inet6_dump_ifacaddr().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 6 Mar 2024 15:51:43 +0000 (15:51 +0000)]
ipv6: use xa_array iterator to implement inet6_dump_addr()
inet6_dump_addr() can use the new xa_array iterator
for better scalability.
Make it ready for RCU-only protection.
RTNL use is removed in the following patch.
Also properly return 0 at the end of a dump to avoid
and extra recvmsg() to get NLMSG_DONE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 6 Mar 2024 15:51:42 +0000 (15:51 +0000)]
ipv6: make in6_dump_addrs() lockless
in6_dump_addrs() is called with RCU protection.
There is no need holding idev->lock to iterate through unicast addresses.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 6 Mar 2024 15:51:41 +0000 (15:51 +0000)]
ipv6: make inet6_fill_ifaddr() lockless
Make inet6_fill_ifaddr() lockless, and add approriate annotations
on ifa->tstamp, ifa->valid_lft, ifa->preferred_lft, ifa->ifa_proto
and ifa->rt_priority.
Also constify 2nd argument of inet6_fill_ifaddr(), inet6_fill_ifmcaddr()
and inet6_fill_ifacaddr().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Mar 2024 10:56:05 +0000 (10:56 +0000)]
Merge tag 'ipsec-next-2024-03-06' of git://git./linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
1) Introduce forwarding of ICMP Error messages. That is specified
in RFC 4301 but was never implemented. From Antony Antony.
2) Use KMEM_CACHE instead of kmem_cache_create in xfrm6_tunnel_init()
and xfrm_policy_init(). From Kunwu Chan.
3) Do not allocate stats in the xfrm interface driver, this can be done
on net core now. From Breno Leitao.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Mar 2024 10:35:48 +0000 (10:35 +0000)]
Merge branch 'nexthop-group-stats'
Petr Machata says:
====================
Support for nexthop group statistics
ECMP is a fundamental component in L3 designs. However, it's fragile. Many
factors influence whether an ECMP group will operate as intended: hash
policy (i.e. the set of fields that contribute to ECMP hash calculation),
neighbor validity, hash seed (which might lead to polarization) or the type
of ECMP group used (hash-threshold or resilient).
At the same time, collection of statistics that would help an operator
determine that the group performs as desired, is difficult.
A solution that we present in this patchset is to add counters to next hop
group entries. For SW-datapath deployments, this will on its own allow
collection and evaluation of relevant statistics. For HW-datapath
deployments, we further add a way to request that HW counters be installed
for a given group, in-kernel interfaces to collect the HW statistics, and
netlink interfaces to query them.
For example:
# ip nexthop replace id 4000 group 4001/4002 hw_stats on
# ip -s -d nexthop show id 4000
id 4000 group 4001/4002 scope global proto unspec offload hw_stats on used on
stats:
id 4001 packets 5002 packets_hw 5000
id 4002 packets 4999 packets_hw 4999
The point of the patchset is visibility of ECMP balance, and that is
influenced by packet headers, not their payload. Correspondingly, we only
include packet counters in the statistics, not byte counters.
We also decided to model HW statistics as a nexthop group attribute, not an
arbitrary nexthop one. The latter would count any traffic going through a
given nexthop, regardless of which ECMP group it is in, or any at all. The
reason is again hat the point of the patchset is ECMP balance visibility,
not arbitrary inspection of how busy a particular nexthop is.
Implementation of individual-nexthop statistics is certainly possible, and
could well follow the general approach we are taking in this patchset.
For resilient groups, per-bucket statistics could be done in a similar
manner as well.
This patchset contains the core code. mlxsw support will be sent in a
follow-up patch set.
This patchset progresses as follows:
- Patches #1 and #2 add support for a new next-hop object attribute,
NHA_OP_FLAGS. That is meant to carry various op-specific signaling, in
particular whether SW- and HW-collected nexthop stats should be part of
the get or dump response. The idea is to avoid wasting message space, and
time for collection of HW statistics, when the values are not needed.
- Patches #3 and #4 add SW-datapath stats and corresponding UAPI.
- Patches #5, #6 and #7 add support fro HW-datapath stats and UAPI.
Individual drivers still need to contribute the appropriate HW-specific
support code.
v4:
- Patch #2:
- s/nla_get_bitfield32/nla_get_u32/ in __nh_valid_dump_req().
v3:
- Patch #3:
- Convert to u64_stats_t
- Patch #4:
- Give a symbolic name to the set of all valid dump flags
for the NHA_OP_FLAGS attribute.
- Convert to u64_stats_t
- Patch #6:
- Use a named constant for the NHA_HW_STATS_ENABLE policy.
v2:
- Patch #2:
- Change OP_FLAGS to u32, enforce through NLA_POLICY_MASK
- Patch #3:
- Set err on nexthop_create_group() error path
- Patch #4:
- Use uint to encode NHA_GROUP_STATS_ENTRY_PACKETS
- Rename jump target in nla_put_nh_group_stats() to avoid
having to rename further in the patchset.
- Patch #7:
- Use uint to encode NHA_GROUP_STATS_ENTRY_PACKETS_HW
- Do not cancel outside of nesting in nla_put_nh_group_stats()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 6 Mar 2024 12:49:21 +0000 (13:49 +0100)]
net: nexthop: Expose nexthop group HW stats to user space
Add netlink support for reading NH group hardware stats.
Stats collection is done through a new notifier,
NEXTHOP_EVENT_HW_STATS_REPORT_DELTA. Drivers that implement HW counters for
a given NH group are thereby asked to collect the stats and report back to
core by calling nh_grp_hw_stats_report_delta(). This is similar to what
netdevice L3 stats do.
Besides exposing number of packets that passed in the HW datapath, also
include information on whether any driver actually realizes the counters.
The core can tell based on whether it got any _report_delta() reports from
the drivers. This allows enabling the statistics at the group at any time,
with drivers opting into supporting them. This is also in line with what
netdevice L3 stats are doing.
So as not to waste time and space, tie the collection and reporting of HW
stats with a new op flag, NHA_OP_FLAG_DUMP_HW_STATS.
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Kees Cook <keescook@chromium.org> # For the __counted_by bits
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 6 Mar 2024 12:49:20 +0000 (13:49 +0100)]
net: nexthop: Add ability to enable / disable hardware statistics
Add netlink support for enabling collection of HW statistics on nexthop
groups.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 6 Mar 2024 12:49:19 +0000 (13:49 +0100)]
net: nexthop: Add hardware statistics notifications
Add hw_stats field to several notifier structures to communicate to the
drivers that HW statistics should be configured for nexthops within a given
group.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 6 Mar 2024 12:49:18 +0000 (13:49 +0100)]
net: nexthop: Expose nexthop group stats to user space
Add netlink support for reading NH group stats.
This data is only for statistics of the traffic in the SW datapath. HW
nexthop group statistics will be added in the following patches.
Emission of the stats is keyed to a new op_stats flag to avoid cluttering
the netlink message with stats if the user doesn't need them:
NHA_OP_FLAG_DUMP_STATS.
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 6 Mar 2024 12:49:17 +0000 (13:49 +0100)]
net: nexthop: Add nexthop group entry stats
Add nexthop group entry stats to count the number of packets forwarded
via each nexthop in the group. The stats will be exposed to user space
for better data path observability in the next patch.
The per-CPU stats pointer is placed at the beginning of 'struct
nh_grp_entry', so that all the fields accessed for the data path reside
on the same cache line:
struct nh_grp_entry {
struct nexthop * nh; /* 0 8 */
struct nh_grp_entry_stats * stats; /* 8 8 */
u8 weight; /* 16 1 */
/* XXX 7 bytes hole, try to pack */
union {
struct {
atomic_t upper_bound; /* 24 4 */
} hthr; /* 24 4 */
struct {
struct list_head uw_nh_entry; /* 24 16 */
u16 count_buckets; /* 40 2 */
u16 wants_buckets; /* 42 2 */
} res; /* 24 24 */
}; /* 24 24 */
struct list_head nh_list; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct nexthop * nh_parent; /* 64 8 */
/* size: 72, cachelines: 2, members: 6 */
/* sum members: 65, holes: 1, sum holes: 7 */
/* last cacheline: 8 bytes */
};
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 6 Mar 2024 12:49:16 +0000 (13:49 +0100)]
net: nexthop: Add NHA_OP_FLAGS
In order to add per-nexthop statistics, but still not increase netlink
message size for consumers that do not care about them, there needs to be a
toggle through which the user indicates their desire to get the statistics.
To that end, add a new attribute, NHA_OP_FLAGS. The idea is to be able to
use the attribute for carrying of arbitrary operation-specific flags, i.e.
not make it specific for get / dump.
Add the new attribute to get and dump policies, but do not actually allow
any flags yet -- those will come later as the flags themselves are defined.
Add the necessary parsing code.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 6 Mar 2024 12:49:15 +0000 (13:49 +0100)]
net: nexthop: Adjust netlink policy parsing for a new attribute
A following patch will introduce a new attribute, op-specific flags to
adjust the behavior of an operation. Different operations will recognize
different flags.
- To make the differentiation possible, stop sharing the policies for get
and del operations.
- To allow querying for presence of the attribute, have all the attribute
arrays sized to NHA_MAX, regardless of what is permitted by policy, and
pass the corresponding value to nlmsg_parse() as well.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sai Krishna [Tue, 5 Mar 2024 18:16:06 +0000 (23:46 +0530)]
octeontx2-pf: Add TC flower offload support for TCP flags
This patch adds TC offload support for matching TCP flags
from TCP header.
Example usage:
tc qdisc add dev eth0 ingress
TC rule to drop the TCP SYN packets:
tc filter add dev eth0 ingress protocol ip flower ip_proto tcp tcp_flags
0x02/0x3f skip_sw action drop
Signed-off-by: Sai Krishna <saikrishnag@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fuyuanli [Tue, 5 Mar 2024 03:04:17 +0000 (11:04 +0800)]
tcp: Add skb addr and sock addr to arguments of tracepoint tcp_probe.
It is useful to expose skb addr and sock addr to user in tracepoint
tcp_probe, so that we can get more information while monitoring
receiving of tcp data, by ebpf or other ways.
For example, we need to identify a packet by seq and end_seq when
calculate transmit latency between layer 2 and layer 4 by ebpf, but which is
not available in tcp_probe, so we can only use kprobe hooking
tcp_rcv_established to get them. But we can use tcp_probe directly if skb
addr and sock addr are available, which is more efficient.
Signed-off-by: fuyuanli <fuyuanli@didiglobal.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 4 Mar 2024 14:08:47 +0000 (06:08 -0800)]
net: dqs: add NIC stall detector based on BQL
softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?
Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.
We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.
This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).
The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.
To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.
Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.
We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Thu, 7 Mar 2024 11:22:37 +0000 (11:22 +0000)]
net: chelsio: remove unused function calc_tx_descs
The inlined helper function calc_tx_descs is not used and is redundant.
Remove it.
Cleans up clang scan build warning:
drivers/net/ethernet/chelsio/cxgb4/sge.c:814:28: warning: unused
function 'calc_tx_descs' [-Wunused-function]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 8 Mar 2024 05:13:28 +0000 (21:13 -0800)]
Merge branch 'netdev-add-per-queue-statistics'
Jakub Kicinski says:
====================
netdev: add per-queue statistics
Per queue stats keep coming up, so it's about time someone laid
the foundation. This series adds the uAPI, a handful of stats
and a sample support for bnxt. It's not very comprehensive in
terms of stat types or driver support. The expectation is that
the support will grow organically. If we have the basic pieces
in place it will be easy for reviewers to request new stats,
or use of the API in place of ethtool -S.
See patch 3 for sample output.
v2: https://lore.kernel.org/all/
20240229010221.
2408413-1-kuba@kernel.org/
v1: https://lore.kernel.org/all/
20240226211015.
1244807-1-kuba@kernel.org/
rfc: https://lore.kernel.org/all/
20240222223629.158254-1-kuba@kernel.org/
====================
Link: https://lore.kernel.org/r/20240306195509.1502746-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 6 Mar 2024 19:55:09 +0000 (11:55 -0800)]
eth: bnxt: support per-queue statistics
Support per-queue statistics API in bnxt.
$ ethtool -S eth0
NIC statistics:
[0]: rx_ucast_packets: 1418
[0]: rx_mcast_packets: 178
[0]: rx_bcast_packets: 0
[0]: rx_discards: 0
[0]: rx_errors: 0
[0]: rx_ucast_bytes:
1141815
[0]: rx_mcast_bytes: 16766
[0]: rx_bcast_bytes: 0
[0]: tx_ucast_packets: 1734
...
$ ./cli.py --spec netlink/specs/netdev.yaml \
--dump qstats-get --json '{"scope": "queue"}'
[{'ifindex': 2,
'queue-id': 0,
'queue-type': 'rx',
'rx-alloc-fail': 0,
'rx-bytes':
1164931,
'rx-packets': 1641},
...
{'ifindex': 2,
'queue-id': 0,
'queue-type': 'tx',
'tx-bytes': 631494,
'tx-packets': 1771},
...
Reset the per queue counters:
$ ethtool -L eth0 combined 4
Inspect again:
$ ./cli.py --spec netlink/specs/netdev.yaml \
--dump qstats-get --json '{"scope": "queue"}'
[{'ifindex': 2,
'queue-id': 0,
'queue-type': 'rx',
'rx-alloc-fail': 0,
'rx-bytes': 32397,
'rx-packets': 145},
...
{'ifindex': 2,
'queue-id': 0,
'queue-type': 'tx',
'tx-bytes': 37481,
'tx-packets': 196},
...
$ ethtool -S eth0 | head
NIC statistics:
[0]: rx_ucast_packets: 174
[0]: rx_mcast_packets: 3
[0]: rx_bcast_packets: 0
[0]: rx_discards: 0
[0]: rx_errors: 0
[0]: rx_ucast_bytes: 37151
[0]: rx_mcast_bytes: 267
[0]: rx_bcast_bytes: 0
[0]: tx_ucast_packets: 267
...
Totals are still correct:
$ ./cli.py --spec netlink/specs/netdev.yaml --dump qstats-get
[{'ifindex': 2,
'rx-alloc-fail': 0,
'rx-bytes':
281949995,
'rx-packets': 216524,
'tx-bytes':
52694905,
'tx-packets': 75546}]
$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 14:23:f2:61:05:40 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
282519546 218100 0 0 0 516
TX: bytes packets errors dropped carrier collsns
53323054 77674 0 0 0 0
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240306195509.1502746-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 6 Mar 2024 19:55:08 +0000 (11:55 -0800)]
netdev: add queue stat for alloc failures
Rx alloc failures are commonly counted by drivers.
Support reporting those via netdev-genl queue stats.
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240306195509.1502746-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 6 Mar 2024 19:55:07 +0000 (11:55 -0800)]
netdev: add per-queue statistics
The ethtool-nl family does a good job exposing various protocol
related and IEEE/IETF statistics which used to get dumped under
ethtool -S, with creative names. Queue stats don't have a netlink
API, yet, and remain a lion's share of ethtool -S output for new
drivers. Not only is that bad because the names differ driver to
driver but it's also bug-prone. Intuitively drivers try to report
only the stats for active queues, but querying ethtool stats
involves multiple system calls, and the number of stats is
read separately from the stats themselves. Worse still when user
space asks for values of the stats, it doesn't inform the kernel
how big the buffer is. If number of stats increases in the meantime
kernel will overflow user buffer.
Add a netlink API for dumping queue stats. Queue information is
exposed via the netdev-genl family, so add the stats there.
Support per-queue and sum-for-device dumps. Latter will be useful
when subsequent patches add more interesting common stats than
just bytes and packets.
The API does not currently distinguish between HW and SW stats.
The expectation is that the source of the stats will either not
matter much (good packets) or be obvious (skb alloc errors).
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240306195509.1502746-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 8 Mar 2024 05:12:45 +0000 (21:12 -0800)]
Merge branch 'net-group-together-hot-data'
Eric Dumazet says:
====================
net: group together hot data
While our recent structure reorganizations were focused
on increasing max throughput, there is still an
area where improvements are much needed.
In many cases, a cpu handles one packet at a time,
instead of a nice batch.
Hardware interrupt.
-> Software interrupt.
-> Network/Protocol stacks.
If the cpu was idle or busy in other layers,
it has to pull many cache lines.
This series adds a new net_hotdata structure, where
some critical (and read-mostly) data used in
rx and tx path is packed in a small number of cache lines.
Synthetic benchmarks will not see much difference,
but latency of single packet should improve.
net_hodata current size on 64bit is 416 bytes,
but might grow in the future.
Also move RPS definitions to a new include file.
====================
Link: https://lore.kernel.org/r/20240306160031.874438-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Wed, 6 Mar 2024 16:00:31 +0000 (16:00 +0000)]
net: move rps_sock_flow_table to net_hotdata
rps_sock_flow_table and rps_cpu_mask are used in fast path.
Move them to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-19-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>