git.maquefel.me Git - linux.git/log

net: ethernet: renesas: rcar_gen4_ptp: Break out to module

The Gen4 gPTP support will be shared between the existing Renesas
Ethernet Switch driver and the upcoming Renesas Ethernet-TSN driver. In
preparation for this break out the gPTP support to its own module.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: renesas: rcar_gen4_ptp: Get clock increment from clock rate

Instead of using hard coded clock increment values for each SoC derive
the clock increment from the module clock. This is done in preparation
to support a second platform, R-Car V4H that uses a 200Mhz clock
compared with the 320Mhz clock used on R-Car S4.

Tested on both SoCs,

S4 reports a clock of 320000000Hz which gives a value of 0x19000000.
Documentation says a 320Mhz clock is used and the correct increment for
that clock is 0x19000000.

V4H reports a clock of 199999992Hz which gives a value of 0x2800001a.
Documentation says a 200Mhz clock is used and the correct increment for
that clock is 0x28000000.

Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: renesas: rcar_gen4_ptp: Prepare for shared register layout

All known R-Car Gen4 SoC share the same register layout, rename the
R-Car S4 specific identifiers so they can be shared with the upcoming
R-Car V4H support.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: renesas: rcar_gen4_ptp: Fail on unknown register layout

Instead of printing a warning and proceeding with an unknown register
layout return an error. The only call site is already prepared to
propagate the error.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: renesas: rcar_gen4_ptp: Remove incorrect comment

The comments intent was to indicates which function uses the enum. While
upstreaming rcar_gen4_ptp the function was renamed but this comment was
left with the old function name.

Instead of correcting the comment remove it, it adds little value.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: Add support for HW-accelerated VLAN stripping

Current implementation supports driver level VLAN tag stripping only.
The features is always on if CONFIG_VLAN_8021Q is enabled in kernel
config and is not user configurable.

This patch add support to MAC level VLAN tag stripping and can be
configured through ethtool. If the rx-vlan-offload is off, the VLAN tag
will be stripped by driver. If the rx-vlan-offload is on, the VLAN tag
will be stripped by MAC.

Command: ethtool -K <interface> rx-vlan-offload off | on

Signed-off-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Gan, Yi Fang <yi.fang.gan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hsr: Add support for MC filtering at the slave device

When MC (multicast) list is updated by the networking layer due to a
user command and as well as when allmulti flag is set, it needs to be
passed to the enslaved Ethernet devices. This patch allows this
to happen by implementing ndo_change_rx_flags() and ndo_set_rx_mode()
API calls that in turns pass it to the slave devices using
existing API calls.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-21

We've added 85 non-merge commits during the last 12 day(s) which contain
a total of 63 files changed, 4464 insertions(+), 1484 deletions(-).

The main changes are:

1) Huge batch of verifier changes to improve BPF register bounds logic
   and range support along with a large test suite, and verifier log
   improvements, all from Andrii Nakryiko.

2) Add a new kfunc which acquires the associated cgroup of a task within
   a specific cgroup v1 hierarchy where the latter is identified by its id,
   from Yafang Shao.

3) Extend verifier to allow bpf_refcount_acquire() of a map value field
   obtained via direct load which is a use-case needed in sched_ext,
   from Dave Marchevsky.

4) Fix bpf_get_task_stack() helper to add the correct crosstask check
   for the get_perf_callchain(), from Jordan Rome.

5) Fix BPF task_iter internals where lockless usage of next_thread()
   was wrong. The rework also simplifies the code, from Oleg Nesterov.

6) Fix uninitialized tail padding via LIBBPF_OPTS_RESET, and another
   fix for certain BPF UAPI structs to fix verifier failures seen
   in bpf_dynptr usage, from Yonghong Song.

7) Add BPF selftest fixes for map_percpu_stats flakes due to per-CPU BPF
   memory allocator not being able to allocate per-CPU pointer successfully,
   from Hou Tao.

8) Add prep work around dynptr and string handling for kfuncs which
   is later going to be used by file verification via BPF LSM and fsverity,
   from Song Liu.

9) Improve BPF selftests to update multiple prog_tests to use ASSERT_*
   macros, from Yuran Pereira.

10) Optimize LPM trie lookup to check prefixlen before walking the trie,
    from Florian Lehner.

11) Consolidate virtio/9p configs from BPF selftests in config.vm file
    given they are needed consistently across archs, from Manu Bretelle.

12) Small BPF verifier refactor to remove register_is_const(),
    from Shung-Hsi Yu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca
  selftests/bpf: reduce verboseness of reg_bounds selftest logs
  bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
  bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
  bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
  bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
  bpf: smarter verifier log number printing logic
  bpf: omit default off=0 and imm=0 in register state log
  bpf: emit map name in register state if applicable and available
  bpf: print spilled register state in stack slot
  bpf: extract register state printing
  bpf: move verifier state printing code to kernel/bpf/log.c
  bpf: move verbose_linfo() into kernel/bpf/log.c
  bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
  bpf: Remove test for MOVSX32 with offset=32
  selftests/bpf: add iter test requiring range x range logic
  veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag
  ...
====================

Link: https://lore.kernel.org/r/20231122000500.28126-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bnxt_en-prepare-to-support-new-p7-chips'

Michael Chan says:

====================
bnxt_en: Prepare to support new P7 chips

This patchset is to prepare the driver to support the new P7 chips by
refactoring and modifying the code.  The P7 chip is built on the P5
chip and many code paths can be modified to support both chips.  The
whole patchset to have basic support for P7 chips is about 20 patches so
a follow-on patchset will complete the support and add the new PCI IDs.

The first 8 patches are changes to the backing store logic to support
both chips with mostly common code paths and datastructures.  Both
chips require host backing store memory but the relevant firmware APIs
have been modified to make it easier to support new backing store
memory types.

The next 4 patches are changes to TX and RX ring indexing logic and NAPI
logic.  The main changes are to increment the TX and RX producers
unbounded and to do any masking only when needed.  These changes are
needed to support the P7 chips which require additional higher bits in
these producer indices.  The NAPI logic is also slightly modifed.

The last patch is a rename of BNXT_FLAG_CHIP_P5 to BNXT_FLAG_P5_PLUS and
other related macro changes to make it clear that the P5_PLUS macro
applies to P5 and newer chips.
====================

Link: https://lore.kernel.org/r/20231120234405.194542-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Rename some macros for the P5 chips

In preparation to support a new P7 chip which has a lot of similarities
with the P5 chip, rename the BNXT_FLAG_CHIP_P5 flag to
BNXT_FLAG_CHIP_P5_PLUS. This will make it clear that the flag is for
P5 and newer chips.

Also, since there are no additional P5 variants in production, rename
BNXT_FLAG_CHIP_P5_THOR() to BNXT_FLAG_CHIP_P5() to keep the naming
more simple.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Randy Schacher <stuart.schacher@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-14-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Modify the NAPI logic for the new P7 chips

Modify the NAPI logic for the new doorbell mechanism on P7 chips.
These changes are compatible with the current P5 chips.

In the current logic, bnxt_poll_p5() services 1 or more CQs for each
MSIX.  Each MSIX has an associated NQ and each NQ has 1 or more
associated CQs.  If any CQ reaches NAPI budget, we'll stay in polling
mode and will unconditionally check and service all CQs until we exit
polling.  We always re-arm all CQs when we exit polling.

To be compatible with the new Toggle bit mechanism in P7 chips, we need
to modify the logic so that we service and re-arm the CQ only if we
receive an NQE notification for work for that CQ.  We add a new
had_nqe_notify bit to the cp_ring_info structure and it gets set when we
see the NQE notification for that CQ anytime during polling.  We'll
service and re-arm only the CQs with the had_nqe_notify bits set.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-13-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Modify RX ring indexing logic.

Modify the RX indexing logic for both RX ring and RX aggregation ring just
like the TX logic. Change it so that the index increments unbounded and
mask it only when needed.

Modify the existing RX macros so that the index is not masked. Add new
macros RING_RX()/RING_RX_AGG() to mask it only when needed to get the
index of rxr->rx_buf_ring[] and rxr->rx_agg_ring[].

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-12-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Modify TX ring indexing logic.

Change the TX ring logic so that the index increments unbounded and
mask it only when needed.

Modify the existing macros so that the index is not masked. Add a
new macro RING_TX() to mask it only when needed to get the index of
txr->tx_buf_ring[].

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-11-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Add db_ring_mask and related macro to bnxt_db_info struct.

This allows the doorbell related logic to mask the doorbell index
to the proper range before writing the doorbell.

The current code masks the doorbell index immediately to keep it in the
legal ranges for the most part.  Subsequent patches will change the
logic so that the index increments unbounded and it only gets masked
before use.  This is preparation work for the new chip that requires an
additional Epoch bit in the doorbell that needs to toggle when the index
has wrapped around.

This patch just adds the basic infrastructure and the logic is largely
unchanged.  We now replace RING_CMP() with the new DB_RING_IDX() at
appropriate places where we mask the completion ring index before
writing the doorbell.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-10-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Add support for HWRM_FUNC_BACKING_STORE_CFG_V2 firmware calls

Newer chips starting with 57600 will use this new firmware HWRM call to
configure backing store memory. Add this new call if it is supported
by the firmware.

Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-9-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Add support for new backing store query firmware API

Use the new v2 firmware API if supported by the firmware. We now have the
infrastructure to support the v2 API.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-8-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Add bnxt_setup_ctxm_pg_tbls() helper function

In bnxt_alloc_ctx_mem(), the logic to set up the context memory entries
and to allocate the context memory tables is done repetitively. Add
a helper function to simplify the code.

The setup of the Fast Path TQM entries relies on some information from
the Slow Path TQM entries. Copy the SP_TQM entries to the FP_TQM
entries to simplify the logic.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-7-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Use the pg_info field in bnxt_ctx_mem_type struct

Use the newly added pg_info field in bnxt_ctx_mem_type struct and
remove the standalone page info structures in bnxt_ctx_mem_info.
This now completes the reorganization of the context memory
structures to work better with the new and more flexible firmware
interface for newer chips.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Add page info to struct bnxt_ctx_mem_type

This will further improve the organization of the bnxt_ctx_mem_info
structure by moving the standalone page info structures into the
bnxt_ctx_mem_type array. Add the allocation and free logic first and
the next patch will migrate to use the new infrastructure.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Restructure context memory data structures

The current code uses a flat bnxt_ctx_mem_info structure to store 8
types of context memory for the NIC.  All the context memory types
are very similar and have similar parameters.  They can all share a
common structure to improve the organization.  Also, new firmware
interface will provide a new API to retrieve each type of context
memory by calling the API repeatedly.

This patch reorganizes the bnxt_ctx_mem_info structure to fit better
with the new firmware interface.  It will also work with the legacy
firmware interface.  The flat fields in bnxt_ctx_mem_info are replaced
by the bnxt_ctx_mem_type array.  The bnxt_mem_init array info will no
longer be needed.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Free bp->ctx inside bnxt_free_ctx_mem()

We always free bp->ctx right after calling bnxt_free_ctx_mem(), so just
free it at the end of that function to make things simpler.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: The caller of bnxt_alloc_ctx_mem() should always free bp->ctx

bnxt_alloc_ctx_mem() calls bnxt_hwrm_func_backing_store_qcaps() to
allocate the memory for bp->ctx.  Initialize bp->ctx with the allocated
memory and let the caller free it during unwind.  The unwind logic is
already there, we just need to always set bp->ctx to the allocated
memory so the caller will always free it.  This simplifies the logic
and makes it easier to expand on the backing store logic.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-page_pool-add-netlink-based-introspection-part1'

Jakub Kicinski says:

====================
net: page_pool: plit the page_pool_params into fast and slow

Small refactoring in prep for adding more page pool params
which won't be needed on the fast path.

v1: https://lore.kernel.org/all/20231024160220.3973311-1-kuba@kernel.org/
RFC: https://lore.kernel.org/all/20230816234303.3786178-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20231121000048.789613-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: page_pool: avoid touching slow on the fastpath

To fully benefit from previous commit add one byte of state
in the first cache line recording if we need to look at
the slow part.

The packing isn't all that impressive right now, we create
a 7B hole. I'm expecting Olek's rework will reshuffle this,
anyway.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20231121000048.789613-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: page_pool: split the page_pool_params into fast and slow

struct page_pool is rather performance critical and we use
16B of the first cache line to store 2 pointers used only
by test code. Future patches will add more informational
(non-fast path) attributes.

It's convenient for the user of the API to not have to worry
which fields are fast and which are slow path. Use struct
groups to split the params into the two categories internally.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20231121000048.789613-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlxsw-preparations-for-support-of-cff-flood-mode'

Petr Machata says:

====================
mlxsw: Preparations for support of CFF flood mode

PGT is an in-HW table that maps addresses to sets of ports. Then when some
HW process needs a set of ports as an argument, instead of embedding the
actual set in the dynamic configuration, what gets configured is the
address referencing the set. The HW then works with the appropriate PGT
entry.

Among other allocations, the PGT currently contains two large blocks for
bridge flooding: one for 802.1q and one for 802.1d. Within each of these
blocks are three tables, for unknown-unicast, multicast and broadcast
flooding:

      . . . |    802.1q    |    802.1d    | . . .
            | UC | MC | BC | UC | MC | BC |
             \______ _____/ \_____ ______/
                    v             v
                   FID flood vectors

Thus each FID (which corresponds to an 802.1d bridge or one VLAN in an
802.1q bridge) uses three flood vectors spread across a fairly large region
of PGT.

This way of organizing the flood table (called "controlled") is not very
flexible. E.g. to decrease a bridge scale and store more IP MC vectors, one
would need to completely rewrite the bridge PGT blocks, or resort to hacks
such as storing individual MC flood vectors into unused part of the bridge
table.

In order to address these shortcomings, Spectrum-2 and above support what
is called CFF flood mode, for Compressed FID Flooding. In CFF flood mode,
each FID has a little table of its own, with three entries adjacent to each
other, one for unknown-UC, one for MC, one for BC. This allows for a much
more fine-grained approach to PGT management, where bits of it are
allocated on demand.

      . . . | FID | FID | FID | FID | FID | . . .
            |U|M|B|U|M|B|U|M|B|U|M|B|U|M|B|
             \_____________ _____________/
                           v
                   FID flood vectors

Besides the FID table organization, the CFF flood mode also impacts Router
Subport (RSP) table. This table contains flood vectors for rFIDs, which are
FIDs that reference front panel ports or LAGs. The RSP table contains two
entries per front panel port and LAG, one for unknown-UC traffic, and one
for everything else. Currently, the FW allocates and manages the table in
its own part of PGT. rFIDs are marked with flood_rsp bit and managed
specially. In CFF mode, rFIDs are managed as all other FIDs. The driver
therefore has to allocate and maintain the flood vectors. Like with bridge
FIDs, this is more work, but increases flexibility of the system.

The FW currently supports both the controlled and CFF flood modes. To shed
complexity, in the future it should only support CFF flood mode. Hence this
patchset, which is the first in series of two to add CFF flood mode support
to mlxsw.

There are FW versions out there that do not support CFF flood mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both controlled flood mode as well as CFF.

Another aspect is that at least on Spectrum-1, there are FW versions out
there that claim to support CFF flood mode, but then reject or ignore
configurations enabling the same. The driver thus has to have a say in
whether an attempt to configure CFF flood mode should even be made.

Much like with the LAG mode, the feature is therefore expressed in terms of
"does the driver prefer CFF flood mode?", and "what flood mode the PCI
module managed to configure the FW with". This gives to the driver a chance
to determine whether CFF flood mode configuration should be attempted.

In this patchset, we lay the ground with new definitions, registers and
their fields, and some minor code shaping. The next patchset will be more
focused on introducing necessary abstractions and implementation.

- Patches #1 and #2 add CFF-related items to the command interface.

- Patch #3 adds a new resource, for maximum number of flood profiles
  supported. (A flood profile is a mapping between traffic type and offset
  in the per-FID flood vector table.)

- Patches #4 to #8 adjust reg.h. The SFFP register is added, which is used
  for configuring the abovementioned traffic-type-to-offset mapping. The
  SFMR, register, which serves for FID configuration, is extended with
  fields specific to CFF mode. And other minor adjustments.

- Patches #9 and #10 add the plumbing for CFF mode: a way to request that
  CFF flood mode be configured, and a way to query the flood mode that was
  actually configured.

- Patch #11 removes dead code.

- Patches #12 and #13 add helpers that the next patchset will make use of.
  Patch #14 moves RIF setup ahead so that FID code can make use of it.
====================

Link: https://lore.kernel.org/r/cover.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_router: Call RIF setup before obtaining FID

For subport RIFs, the setup initializes, among other things, RIF port and
LAG numbers. Those are important to determine where in the PGT the RIF FID
will be stored. Therefore, call the RIF setup before fid_get.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/f24d8cad7e4748b8e8e0e16894ca6a20704dea32.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_router: Add a helper to get subport number from a RIF

In the CFF flood mode, responsibility for management of the PGT entries for
rFIDs is moved from FW to the driver. All rFIDs are based off either a
front panel port, or a LAG port. The flood vectors for port-based rFIDs
enable just the port itself, the ones for LAG-based rFIDs enable all member
ports of the LAG in question.

Since all rFIDs based off the same port have the same flood vector, and
similarly for LAG-based rFIDs, the flood entries are shared. The PGT
address of the flood vector is therefore determined based on the port (or
LAG) number of the RIF connected with the rFID.

Add a helper to determine subport number given a RIF, to be used in these
calculations.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/d7ab43cf5b021f785f363f236e4b6780d10eea93.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_fid: Extract SFMR packing into a helper

Both mlxsw_sp_fid_op() and mlxsw_sp_fid_edit_op() pack the core of SFMR the
same way. Extract the common code into a helper and call that. Extract out
of that a wrapper that just calls mlxsw_reg_sfmr_pack(), because it will
be useful for the dummy family later on.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/31f32b4d767183f6cb197148d0792feab2efadba.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_fid: Drop unnecessary conditions

The caller already only calls mlxsw_sp_fid_flood_tables_init() and
mlxsw_sp_fid_flood_tables_fini() if (fid_family->flood_tables). There
is no configuration where the pointer is non-NULL, but the number of
tables is zero. So drop the conditions.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/897c6841bc756ac632b797bf67ac83c6a66ba359.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: pci: Permit enabling CFF mode

There are FW versions out there that do not support CFF flood mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both controlled flood mode as well as CFF. There
are also FW versions out there that claim to support CFF flood mode, but
then reject or ignore configurations enabling the same. The driver thus has
to have a say in whether an attempt to configure CFF flood mode should even
be made, and what to use as a fallback.

Hence express the feature in terms of "does the driver prefer CFF flood
mode?", and "what flood mode the PCI module managed to configure the FW
with". This gives to the driver a chance to determine whether CFF flood
mode configuration should be attempted.

The latter bit was added in previous patches. In this patch, add the bit
that allows the driver to determine whether CFF enablement should be
attempted, and the enablement code itself.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/41640a0ee58e0a9538f820f7b601a0e35f6449e4.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core, pci: Add plumbing related to CFF mode

CFF mode, for Compressed FID Flooding, is a way of organizing flood vectors
in the PGT table. The bus module determines whether CFF is supported, can
configure flood mode to CFF if it is, and knows what flood mode has been
configured. Therefore add a bus callback to determine the configured flood
mode. Also add to core an API to query it.

Since after this patch, we rely on mlxsw_pci->flood_mode being set, it
becomes a coding error if a driver invokes this function with a set of
fields that misses the initialization. Warn and bail out in that case.

The CFF mode is not used as of this patch. The code to actually use it will
be added later.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/889d58759dd40f5037f2206b9fc4a78a9240da80.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: reg: Add to SFMR register the fields related to CFF flood mode

Add the field cff_mid_base, which specifies at which point in PGT the
per-FID flood table is stored. Add cff_prf_id, the profile ID, which
determines on which row of the flood table a flood vector can be found for
a given traffic type.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/3ad7ae38cf6534bedcd876f16090d109a814b3e3.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: reg: Extract flood-mode specific part of mlxsw_reg_sfmr_pack()

In CFF mode, it is necessary to set a different set of SFMR fields. Leave
in mlxsw_reg_sfmr_pack() only the common bits, and move the parts relevant
to controlled flood mode directly to the call site.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/6f29639ebc3ca0722272e6c644ca910096469413.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: reg: Drop unnecessary writes from mlxsw_reg_sfmr_pack()

The MLXSW_REG_ZERO at the beginning of the function wipes the whole
payload. There's no need to set vtfp and vv to false explicitly.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/04a51ea7cf31eea0ef7707311d8e864e2d9ef307.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: reg: Mark SFGC & some SFMR fields as reserved in CFF mode

Some existing fields and the whole register of SFGC are reserved in CFF
mode. Backport the reservation note to these fields.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/e1d5977a8cb778227e4ea2fd1515529957ce5de7.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: reg: Add Switch FID Flooding Profiles Register

The SFFP register populates the fid flooding profile tables used for the
NVE flooding and Compressed-FID Flooding (CFF).

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/ca42eb67763bd0c7cf035afc62ef73632f3f61a6.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: resources: Add max_cap_nve_flood_prf

max_cap_nve_flood_prf describes maximum number of NVE flooding profiles.
The same value then applies for flooding profiles for flooding in CFF mode.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/064a2e013d879e5f5494167a6c120c4bb85a2204.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: cmd: Add MLXSW_CMD_MBOX_CONFIG_PROFILE_FLOOD_MODE_CFF

PGT, a port-group table is an in-HW block of specialized memory that holds
sets of ports. Allocated within the PGT are series of flood tables that
describe to which ports traffic of various types (unknown UC, BC, MC)
should be flooded from which FID. The hitherto-used layout of these flood
tables is being replaced with a more flexible scheme, called compressed FID
flooding (CFF). CFF can be configured through CONFIG_PROFILE.flood_mode.

In this patch, add MLXSW_CMD_MBOX_CONFIG_PROFILE_FLOOD_MODE_CFF, the value
to use to enable the CFF mode.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/fc2e063742856492f8f22b0b87abf431ea6d53d0.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: cmd: Add cmd_mbox.query_fw.cff_support

PGT, a port-group table is an in-HW block of specialized memory that holds
sets of ports. Allocated within the PGT are series of flood tables that
describe to which ports traffic of various types (unknown UC, BC, MC)
should be flooded from which FID. The hitherto-used layout of these flood
tables is being replaced with a more flexible scheme, called compressed FID
flooding (CFF). CFF can be configured through CONFIG_PROFILE.flood_mode.

cff_support determines whether CONFIG_PROFILE.flood_mode can be set to CFF.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/af727d0e1095e30fa45c7e60404637cdc491aeec.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: do not send a MOVE event when netdev changes netns

Networking supports changing netdevice's netns and name
at the same time. This allows avoiding name conflicts
and having to rename the interface in multiple steps.
E.g. netns1={eth0, eth1}, netns2={eth1} - we want
to move netns1:eth1 to netns2 and call it eth0 there.
If we can't rename "in flight" we'd need to (1) rename
eth1 -> $tmp, (2) change netns, (3) rename $tmp -> eth0.

To rename the underlying struct device we have to call
device_rename(). The rename()'s MOVE event, however, doesn't
"belong" to either the old or the new namespace.
If there are conflicts on both sides it's actually impossible
to issue a real MOVE (old name -> new name) without confusing
user space. And Daniel reports that such confusions do in fact
happen for systemd, in real life.

Since we already issue explicit REMOVE and ADD events
manually - suppress the MOVE event completely. Move
the ADD after the rename, so that the REMOVE uses
the old name, and the ADD the new one.

If there is no rename this changes the picture as follows:

Before:

old ns | KERNEL[213.399289] remove   /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401302] add      /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401397] move     /devices/virtual/net/eth0 (net)

After:

old ns | KERNEL[266.774257] remove   /devices/virtual/net/eth0 (net)
new ns | KERNEL[266.774509] add      /devices/virtual/net/eth0 (net)

If there is a rename and a conflict (using the exact eth0/eth1
example explained above) we get this:

Before:

old ns | KERNEL[224.316833] remove   /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.318551] add      /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.319662] move     /devices/virtual/net/eth0 (net)

After:

old ns | KERNEL[333.033166] remove   /devices/virtual/net/eth1 (net)
new ns | KERNEL[333.035098] add      /devices/virtual/net/eth0 (net)

Note that "in flight" rename is only performed when needed.
If there is no conflict for old name in the target netns -
the rename will be performed separately by dev_change_name(),
as if the rename was a different command, and there will still
be a MOVE event for the rename:

Before:

old ns | KERNEL[194.416429] remove   /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418809] add      /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418869] move     /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.420866] move     /devices/virtual/net/eth1 (net)

After:

old ns | KERNEL[71.917520] remove   /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.919155] add      /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.920729] move     /devices/virtual/net/eth1 (net)

If deleting the MOVE event breaks some user space we should insert
an explicit kobject_uevent(MOVE) after the ADD, like this:

@@ -11192,6 +11192,12 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
netdev_adjacent_add_links(dev);

+ /* User space wants an explicit MOVE event, issue one unless
+ * dev_change_name() will get called later and issue one.
+ */
+ if (!pat || new_name[0])
+ kobject_uevent(&dev->dev.kobj, KOBJ_MOVE);
+
/* Adapt owner in case owning user namespace of target network
* namespace is different from the original one.
*/

Reported-by: Daniel Gröber <dxld@darkboxed.org>
Link: https://lore.kernel.org/all/20231010121003.x3yi6fihecewjy4e@House.clients.dxld.at/
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20231120184140.578375-1-kuba@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: ax88179_178a: avoid two consecutive device resets

The device is always reset two consecutive times (ax88179_reset is called
twice), one from usbnet_probe during the device binding and the other from
usbnet_open.

Remove the non-necessary reset during the device binding and let the reset
operation from open to keep the normal behavior (tested with generic ASIX
Electronics Corp. AX88179 Gigabit Ethernet device).

Reported-by: Herb Wei <weihao.bj@ieisystem.com>
Tested-by: Herb Wei <weihao.bj@ieisystem.com>
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Link: https://lore.kernel.org/r/20231120121239.54504-1-jtornosm@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-bpf-update-multiple-prog_tests-to-use-assert_-macros'

Yuran Pereira says:

====================
selftests/bpf: Update multiple prog_tests to use ASSERT_ macros

Multiple files/programs in `tools/testing/selftests/bpf/prog_tests/` still
heavily use the `CHECK` macro, even when better `ASSERT_` alternatives are
available.

As it was already pointed out by Yonghong Song [1] in the bpf selftests the use
of the ASSERT_* series of macros is preferred over the CHECK macro.

This patchset replaces the usage of `CHECK(` macros to the equivalent `ASSERT_`
family of macros in the following prog_tests:
- bind_perm.c
- bpf_obj_id.c
- bpf_tcp_ca.c
- vmlinux.c

[1] https://lore.kernel.org/lkml/0a142924-633c-44e6-9a92-2dc019656bf2@linux.dev

Changes in v3:
- Addressed the following points mentioned by Yonghong Song
- Improved `bpf_map_lookup_elem` assertion in bpf_tcp_ca.
- Replaced assertion introduced in v2 with one that checks `thread_ret`
  instead of `pthread_join`. This ensures that `server`'s return value
  (thread_ret) is the one being checked, as oposed to `pthread_join`'s
  return value, since the latter one is less likely to fail.

Changes in v2:
- Fixed pthread_join assertion that broke the previous test

Previous version:
v2 - https://lore.kernel.org/lkml/GV1PR10MB6563AECF8E94798A1E5B36A4E8B6A@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM
v1 - https://lore.kernel.org/lkml/GV1PR10MB6563FCFF1C5DEBE84FEA985FE8B0A@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM
====================

Link: https://lore.kernel.org/r/GV1PR10MB6563BEFEA4269E1DDBC264B1E8BBA@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux

vmlinux.c uses the `CHECK` calls even though the use of ASSERT_ series
of macros is preferred in the bpf selftests.

This patch replaces all `CHECK` calls for equivalent `ASSERT_`
macro calls.

Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/GV1PR10MB6563ED1023A2A3AEF30BDA5DE8BBA@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM

selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id

bpf_obj_id uses the `CHECK` calls even though the use of
ASSERT_ series of macros is preferred in the bpf selftests.

This patch replaces all `CHECK` calls for equivalent `ASSERT_`
macro calls.

Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/GV1PR10MB65639AA3A10B4BBAA79952C7E8BBA@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM

selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm

bind_perm uses the `CHECK` calls even though the use of
ASSERT_ series of macros is preferred in the bpf selftests.

This patch replaces all `CHECK` calls for equivalent `ASSERT_`
macro calls.

Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/GV1PR10MB656314F467E075A106CA02BFE8BBA@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM

selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca

bpf_tcp_ca uses the `CHECK` calls even though the use of
ASSERT_ series of macros is preferred in the bpf selftests.

This patch replaces all `CHECK` calls for equivalent `ASSERT_`
macro calls.

Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/GV1PR10MB6563F180C0F2BB4F6CFA5130E8BBA@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM

net: phylink: use for_each_set_bit()

Use for_each_set_bit() rather than open coding the for() test_bit()
loop.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://lore.kernel.org/r/E1r4p15-00Cpxe-C7@rmk-PC.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: reduce dma ring display code duplication

The code to show extended descriptor is identical to normal one.
Consolidate the code to remove duplication.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Link: https://lore.kernel.org/r/a2a5c5ce9338bdea60ec71d7eeb00fe757281557.1700372381.git.baruch@tkos.co.il
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: remove extra newline from descriptors display

One newline per line should be enough. Reduce the verbosity of
descriptors dump.

Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Link: https://lore.kernel.org/r/444f3b1dd409fdb14ed2a1ae7679a86b110dadcd.1700372381.git.baruch@tkos.co.il
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bonding: return -ENOMEM instead of BUG in alb_upper_dev_walk

If failed to allocate "tags" or could not find the final upper device from
start_dev's upper list in bond_verify_device_path(), only the loopback
detection of the current upper device should be affected, and the system is
no need to be panic.
So return -ENOMEM in alb_upper_dev_walk to stop walking, print some warn
information when failed to allocate memory for vlan tags in
bond_verify_device_path.

I also think that the following function calls
netdev_walk_all_upper_dev_rcu
---->>>alb_upper_dev_walk
---------->>>bond_verify_device_path
From this way, "end device" can eventually be obtained from "start device"
in bond_verify_device_path, IS_ERR(tags) could be instead of
IS_ERR_OR_NULL(tags) in alb_upper_dev_walk.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20231118081653.1481260-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeon_ep: support Octeon CN10K devices

Add PCI Endpoint NIC support for Octeon CN10K devices.
CN10K devices are part of Octeon 10 family products with
similar PCI NIC characteristics. These include:
- CN10KA
- CNF10KA
- CNF10KB
- CN10KB

Update supported device list in Documentation

Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Link: https://lore.kernel.org/r/20231117103817.2468176-1-srasheed@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: mtk_wed: add support for devices with more than 4GB of dram

Introduce WED offloading support for boards with more than 4GB of
memory.

Co-developed-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/1c7efdf5d384ea7af3c0209723e40b2ee0f956bf.1700239272.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-tc-testing-more-updates-to-tdc'

Pedro Tammela says:

====================
selftests: tc-testing: more updates to tdc

Address the issues making tdc timeout on downstream CIs like lkp and
tuxsuite.
====================

Link: https://lore.kernel.org/r/20231117171208.2066136-1-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tc-testing: report number of workers in use

Report the number of workers in use to process the test batches.
Since the number is now subject to a limit, avoid users getting
confused.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231117171208.2066136-7-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tc-testing: timeout on unbounded loops

In the spirit of failing early, timeout on unbounded loops that take
longer than 20 ticks to complete. Such loops are to ensure that objects
created are already visible so tests can proceed without any issues.

If a test setup takes more than 20 ticks to see an object, there's
definetely something wrong.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231117171208.2066136-6-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tc-testing: leverage -all in suite ns teardown

Instead of listing lingering ns pinned files and delete them one by one, leverage '-all'
from iproute2 to do it in a single process fork.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231117171208.2066136-5-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tc-testing: use netns delete from pyroute2

When pyroute2 is available, use the native netns delete routine instead
of calling iproute2 to do it. As forks are expensive with some kernel
configs, minimize its usage to avoid kselftests timeouts.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231117171208.2066136-4-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tc-testing: move back to per test ns setup

Surprisingly in kernel configs with most of the debug knobs turned on,
pre-allocating the test resources makes tdc run much slower overall than
when allocating resources on a per test basis.

As these knobs are used in kselftests in downstream CIs, let's go back
to the old way of doing things to avoid kselftests timeouts.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202311161129.3b45ed53-oliver.sang@intel.com
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231117171208.2066136-3-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tc-testing: cap parallel tdc to 4 cores

We have observed a lot of lock contention and test instability when running with >8 cores.
Enough to actually make the tests run slower than with fewer cores.

Cap the maximum cores of parallel tdc to 4 which showed in testing to
be a reasonable number for efficiency and stability in different kernel
config scenarios.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231117171208.2066136-2-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'nfp-add-flow-steering-support'

Louis Peens says:

====================
nfp: add flow-steering support

This short series adds flow steering support for the nfp driver.
The first patch adds the part to communicate with ethtool but
stubs out the HW offload parts. The second patch implements the
HW communication and offloads flow steering.

After this series the user can now use 'ethtool -N/-n' to configure
and display rx classification rules.
====================

Link: https://lore.kernel.org/r/20231117071114.10667-1-louis.peens@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfp: offload flow steering to the nfp

This is the second part to implement flow steering. Mailbox is used
for the communication between driver and HW.

Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231117071114.10667-3-louis.peens@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfp: add ethtool flow steering callbacks

This is the first part to implement flow steering. The communication
between ethtool and driver is done. User can use following commands
to display and set flows:

ethtool -n <netdev>
ethtool -N <netdev> flow-type ...

Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231117071114.10667-2-louis.peens@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-axienet-introduce-dmaengine'

Radhey Shyam Pandey says:

====================
net: axienet: Introduce dmaengine

The axiethernet driver can use the dmaengine framework to communicate
with the xilinx DMAengine driver(AXIDMA, MCDMA). The inspiration behind
this dmaengine adoption is to reuse the in-kernel xilinx dma engine
driver[1] and remove redundant dma programming sequence[2] from the
ethernet driver. This simplifies the ethernet driver and also makes
it generic to be hooked to any complaint dma IP i.e AXIDMA, MCDMA
without any modification.

The dmaengine framework was extended for metadata API support during
the axidma RFC[3] discussion. However, it still needs further
enhancements to make it well suited for ethernet usecases.

Comments, suggestions, thoughts to implement remaining functional
features are very welcome!

[1]: https://github.com/torvalds/linux/blob/master/drivers/dma/xilinx/xilinx_dma.c
[2]: https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/xilinx/xilinx_axienet_main.c#L238
[3]: http://lkml.iu.edu/hypermail/linux/kernel/1804.0/00367.html
[4]: https://lore.kernel.org/all/20221124102745.2620370-1-sarath.babu.naidu.gaddam@amd.com
====================

Link: https://lore.kernel.org/r/1700074613-1977070-1-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: axienet: Introduce dmaengine support

Add dmaengine framework to communicate with the xilinx DMAengine
driver(AXIDMA).

Axi ethernet driver uses separate channels for transmit and receive.
Add support for these channels to handle TX and RX with skb and
appropriate callbacks. Also add axi ethernet core interrupt for
dmaengine framework support.

The dmaengine framework was extended for metadata API support.
However it still needs further enhancements to make it well suited for
ethernet usecases. The ethernet features i.e ethtool set/get of DMA IP
properties, ndo_poll_controller,(mentioned in TODO) are not supported
and it requires follow-up discussions.

dmaengine support has a dependency on xilinx_dma as it uses
xilinx_vdma_channel_set_config() API to reset the DMA IP
which internally reset MAC prior to accessing MDIO.

Benchmark with netperf:

xilinx-zcu102-20232:~$ netperf -H 192.168.10.20 -t TCP_STREAM
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.10.20 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    10.02     886.69

xilinx-zcu102-20232:~$ netperf -H 192.168.10.20 -t UDP_STREAM
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.10.20 () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   10.00       15851      0     830.66
212992           10.00       15851            830.66

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Link: https://lore.kernel.org/r/1700074613-1977070-4-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: axienet: Preparatory changes for dmaengine support

The axiethernet driver has inbuilt dma programming. In order to add
dmaengine support and make it's integration seamless the current axidma
inbuilt programming code is put under use_dmaengine check.

It also performs minor code reordering to minimize conditional
use_dmaengine checks and there is no functional change. It uses
"dmas" property to identify whether it should use a dmaengine
framework or inbuilt axidma programming.

Signed-off-by: Sarath Babu Naidu Gaddam <sarath.babu.naidu.gaddam@amd.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Link: https://lore.kernel.org/r/1700074613-1977070-3-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: xlnx,axi-ethernet: Introduce DMA support

Xilinx 1G/2.5G Ethernet Subsystem provides 32-bit AXI4-Stream buses to
move transmit and receive Ethernet data to and from the subsystem.

These buses are designed to be used with an AXI Direct Memory Access(DMA)
IP or AXI Multichannel Direct Memory Access (MCDMA) IP core, AXI4-Stream
Data FIFO, or any other custom logic in any supported device.

Primary high-speed DMA data movement between system memory and stream
target is through the AXI4 Read Master to AXI4 memory-mapped to stream
(MM2S) Master, and AXI stream to memory-mapped (S2MM) Slave to AXI4
Write Master. AXI DMA/MCDMA enables channel of data movement on both
MM2S and S2MM paths in scatter/gather mode.

AXI DMA has two channels where as MCDMA has 16 Tx and 16 Rx channels.
To uniquely identify each channel use 'chan' suffix. Depending on the
usecase AXI ethernet driver can request any combination of multichannel
DMA channels using generic dmas, dma-names properties.

Example:
dma-names = tx_chan0, rx_chan0, tx_chan1, rx_chan1;

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/1700074613-1977070-2-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: verify fq per-band packet limit

Commit 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR
scheduling") introduces multiple traffic bands, and per-band maximum
packet count.

Per-band limits ensures that packets in one class cannot fill the
entire qdisc and so cause DoS to the traffic in the other classes.

Verify this behavior:
  1. set the limit to 10 per band
  2. send 20 pkts on band A: verify that 10 are queued, 10 dropped
  3. send 20 pkts on band A: verify that  0 are queued, 20 dropped
  4. send 20 pkts on band B: verify that 10 are queued, 10 dropped

Packets must remain queued for a period to trigger this behavior.
Use SO_TXTIME to store packets for 100 msec.

The test reuses existing upstream test infra. The script is a fork of
cmsg_time.sh. The scripts call cmsg_sender.

The test extends cmsg_sender with two arguments:

* '-P' SO_PRIORITY
  There is a subtle difference between IPv4 and IPv6 stack behavior:
  PF_INET/IP_TOS        sets IP header bits and sk_priority
  PF_INET6/IPV6_TCLASS  sets IP header bits BUT NOT sk_priority

* '-n' num pkts
  Send multiple packets in quick succession.
  I first attempted a for loop in the script, but this is too slow in
  virtualized environments, causing flakiness as the 100ms timeout is
  reached and packets are dequeued.

Also do not wait for timestamps to be queued unless timestamps are
requested.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231116203449.2627525-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: microchip: lan743x : bidirectional throughput improvement

The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but the
device supports placing the descriptors in memory back to back or
reserving space in between them using its DMA_DESCRIPTOR_SPACE (DSPACE)
configurable hardware setting. Currently DSPACE is unnecessarily set to
match the host's L1 cache line size, resulting in space reserved in
between descriptors in most platforms and causing a suboptimal behavior
(single PCIe Mem transaction per descriptor). By changing the setting
to DSPACE=16 many descriptors can be packed in a single PCIe Mem
transaction resulting in a massive performance improvement in
bidirectional tests without any negative effects.
Tested and verified improvements on x64 PC and several ARM platforms
(typical data below)

Test setup 1: x64 PC with LAN7430 ---> x64 PC

iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-10.00  sec   170 MBytes   143 Mbits/sec  sender
[  5][TX-C]   0.00-10.04  sec   169 MBytes   141 Mbits/sec  receiver
[  7][RX-C]   0.00-10.00  sec  1.02 GBytes   876 Mbits/sec  sender
[  7][RX-C]   0.00-10.04  sec  1.02 GBytes   870 Mbits/sec  receiver

iperf3 UDP bidirectional with DSPACE set to 16 Bytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-10.00  sec  1.11 GBytes   956 Mbits/sec  sender
[  5][TX-C]   0.00-10.04  sec  1.11 GBytes   951 Mbits/sec  receiver
[  7][RX-C]   0.00-10.00  sec  1.10 GBytes   948 Mbits/sec  sender
[  7][RX-C]   0.00-10.04  sec  1.10 GBytes   942 Mbits/sec  receiver

Test setup 2 : RK3399 with LAN7430 ---> x64 PC

RK3399 Spec:
The SOM-RK3399 is ARM module designed and developed by FriendlyElec.
Cores: 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53
Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz)
PCIe: PCIe x4, compatible with PCIe 2.1, Dual operation mode

iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-10.00  sec   534 MBytes   448 Mbits/sec  sender
[  5][TX-C]   0.00-10.05  sec   534 MBytes   446 Mbits/sec  receiver
[  7][RX-C]   0.00-10.00  sec  1.12 GBytes   961 Mbits/sec  sender
[  7][RX-C]   0.00-10.05  sec  1.11 GBytes   946 Mbits/sec  receiver

iperf3 UDP bidirectional with DSPACE set to 16 Bytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-10.00  sec   966 MBytes   810 Mbits/sec   sender
[  5][TX-C]   0.00-10.04  sec   965 MBytes   806 Mbits/sec   receiver
[  7][RX-C]   0.00-10.00  sec  1.11 GBytes   956 Mbits/sec   sender
[  7][RX-C]   0.00-10.04  sec  1.07 GBytes   919 Mbits/sec   receiver

Signed-off-by: Vishvambar Panth S <vishvambarpanth.s@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20231116054350.620420-1-vishvambarpanth.s@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/bpf: reduce verboseness of reg_bounds selftest logs

Reduce verboseness of test_progs' output in reg_bounds set of tests with
two changes.

First, instead of each different operator (<, <=, >, ...) being it's own
subtest, combine all different ops for the same (x, y, init_t, cond_t)
values into single subtest. Instead of getting 6 subtests, we get one
generic one, e.g.:

#192/53 reg_bounds_crafted/(s64)[0xffffffffffffffff; 0] (s64)<op> 0xffffffff00000000:OK

Second, for random generated test cases, treat all of them as a single
test to eliminate very verbose output with random values in them. So now
we'll just get one line per each combination of (init_t, cond_t),
instead of 6 x 25 = 150 subtests before this change:

#225 reg_bounds_rand_consts_s32_s32:OK

Given we reduce verboseness so much, it makes sense to do a bit more
random testing, so we also bump default number of random tests to 100,
up from 25. This doesn't increase runtime significantly, especially in
parallelized mode.

With all the above changes we still make sure that we have all the
information necessary for reproducing test case if it happens to fail.
That includes reporting random seed and specific operator that is
failing. Those will only be printed to console if related test/subtest
fails, so it doesn't have any added verboseness implications.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231120180452.145849-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net: ethernet: mtk_wed: rely on __dev_alloc_page in mtk_wed_tx_buffer_alloc

Simplify the code and use __dev_alloc_page() instead of __dev_alloc_pages()
with order 0 in mtk_wed_tx_buffer_alloc routine

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'am65-cpsw-ethtool-mac-stats'

Roger Quadros says:

===================
net: eth: am65-cpsw: add ethtool MAC stats

Gets 'ethtool -S eth0 --groups eth-mac' command to work.

Also set default TX channels to maximum available and does
cleanup in am65_cpsw_nuss_common_open() error path.

Changelog:
v2:
- add __iomem to *stats, to prevent sparse warning
- clean up RX descriptors and free up SKB in error handling of
am65_cpsw_nuss_common_open()
- Re-arrange some funcitons to avoid forward declaration
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: am65-cpsw: Fix error handling in am65_cpsw_nuss_common_open()

k3_udma_glue_enable_rx/tx_chn returns error code on failure.
Bail out on error while enabling TX/RX channel.

In the error path, clean up the RX descriptors and SKBs.
Get rid of kmemleak_not_leak() as it seems unnecessary now.

Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: am65-cpsw: Set default TX channels to maximum

am65-cpsw supports 8 TX hardware queues. Set this as default.

The rationale is that some am65-cpsw devices can have up to 4 ethernet
ports. If the number of TX channels have to be changed then all
interfaces have to be brought down and up as the old default of 1
TX channel is too restrictive for any mqprio/taprio usage.

Another reason for this change is to allow testing using
kselftest:net/forwarding:ethtool_mm.sh out of the box.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: am65-cpsw: Re-arrange functions to avoid forward declaration

Re-arrange am65_cpsw_nuss_rx_cleanup(), am65_cpsw_nuss_xmit_free() and
am65_cpsw_nuss_tx_cleanup() to avoid forward declaration.

No functional change.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: am65-cpsw: Add standard Ethernet MAC stats to ethtool

Gets 'ethtool -S eth0 --groups eth-mac' command to work.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-kernel-bpf-task_iter-c-don-t-abuse-next_thread'

Oleg Nesterov says:

====================
bpf: kernel/bpf/task_iter.c: don't abuse next_thread()

Compile tested.

Every lockless usage of next_thread() was wrong, bpf/task_iter.c is
the last user and is no exception.

====================

Link: https://lore.kernel.org/r/20231114163211.GA874@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)

This looks more clear and simplifies the code. While at it, remove the
unnecessary initialization of pos/task at the start of bpf_iter_task_new().

Note that we can even kill kit->task, we can just use pos->group_leader,
but I don't understand the BUILD_BUG_ON() checks in bpf_iter_task_new().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163239.GA903@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()

Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.

bpf_iter_task_next() can loop forever, "kit->pos == kit->task" can never
happen if kit->pos execs. Change this code to use __next_thread().

With or without this change the usage of kit->pos/task and next_task()
doesn't look nice, see the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163237.GA897@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()

Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.

task_group_seq_get_next() can return the group leader twice if it races
with mt-thread exec which changes the group->leader's pid.

Change the main loop to use __next_thread(), kill "next_tid == common->pid"
check.

__next_thread() can't loop forever, we can also change this code to retry
if next_tid == 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163234.GA890@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rtnetlink: introduce nlmsg_new_large and use it in rtnl_getlink

if a PF has 256 or more VFs, ip link command will allocate an order 3
memory or more, and maybe trigger OOM due to memory fragment,
the VFs needed memory size is computed in rtnl_vfinfo_size.

so introduce nlmsg_new_large which calls netlink_alloc_large_skb in
which vmalloc is used for large memory, to avoid the failure of
allocating memory

    ip invoked oom-killer: gfp_mask=0xc2cc0(GFP_KERNEL|__GFP_NOWARN|\
__GFP_COMP|__GFP_NOMEMALLOC), order=3, oom_score_adj=0
    CPU: 74 PID: 204414 Comm: ip Kdump: loaded Tainted: P           OE
    Call Trace:
    dump_stack+0x57/0x6a
    dump_header+0x4a/0x210
    oom_kill_process+0xe4/0x140
    out_of_memory+0x3e8/0x790
    __alloc_pages_slowpath.constprop.116+0x953/0xc50
    __alloc_pages_nodemask+0x2af/0x310
    kmalloc_large_node+0x38/0xf0
    __kmalloc_node_track_caller+0x417/0x4d0
    __kmalloc_reserve.isra.61+0x2e/0x80
    __alloc_skb+0x82/0x1c0
    rtnl_getlink+0x24f/0x370
    rtnetlink_rcv_msg+0x12c/0x350
    netlink_rcv_skb+0x50/0x100
    netlink_unicast+0x1b2/0x280
    netlink_sendmsg+0x355/0x4a0
    sock_sendmsg+0x5b/0x60
    ____sys_sendmsg+0x1ea/0x250
    ___sys_sendmsg+0x88/0xd0
    __sys_sendmsg+0x5e/0xa0
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f95a65a5b70

Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20231115120108.3711-1-lirongqing@baidu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ice: one by one port representors creation

Michal Swiatkowski says:

Currently ice supports creating port representors only for VFs. For that
use case they can be created and removed in one step.

This patchset is refactoring current flow to support port representor
creation also for subfunctions and SIOV. In this case port representors
need to be created and removed one by one. Also, they can be added and
removed while other port representors are running.

To achieve that we need to change the switchdev configuration flow.
Three first patches are only cosmetic (renaming, removing not used code).
Next few ones are preparation for new flow. The most important one
is "add VF representor one by one". It fully implements new flow.

New type of port representor (for subfunction) will be introduced in
follow up patchset.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: reserve number of CP queues
  ice: adjust switchdev rebuild path
  ice: add VF representors one by one
  ice: realloc VSI stats arrays
  ice: set Tx topology every time new repr is added
  ice: allow changing SWITCHDEV_CTRL VSI queues
  ice: return pointer to representor
  ice: make representor code generic
  ice: remove VF pointer reference in eswitch code
  ice: track port representors in xarray
  ice: use repr instead of vf->repr
  ice: track q_id in representor
  ice: remove unused control VSI parameter
  ice: remove redundant max_vsi_num variable
  ice: rename switchdev to eswitch
====================

Link: https://lore.kernel.org/r/20231114181449.1290117-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '1GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
igc: Add support for physical + free-running timers

Vinicius Costa Gomes says:

The objective is to allow having functionality that depends on the
physical timer (taprio and ETF offloads, for example) and vclocks
operating together.

The "big" missing piece is the implementation of the .getcyclesx64()
function in igc, as i225/i226 have multiple timers, we use one of
those timers (timer 1) as a free-running (non adjustable) timer.

The complication is that only implementing .getcyclesx64() and nothing
else will break synchronization when using vclocks, as reading the clock
will retrieve the free-running value but timnestamps will come from the
adjustable timer. The solution is to modify "in one go" the timestamping
code to be able to retrieve the timestamp from the correct timer (if a
socket is "phc_bound" to a vclock the timestamp will come from the
free-running timer).

I was debating whether or not to do the adjustments for the internal latencies
for the free-running timestamps, decided to do the adjustments so the path
delay when using vclocks is similar to the one when using the physical clock.

One future improvement is to implement the .getcrosscycles() function.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
igc: Add support for PTP .getcyclesx64()
igc: Simplify setting flags in the TX data descriptor
====================

Link: https://lore.kernel.org/r/20231114183640.1303163-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-sched-cls_u32-use-proper-refcounts'

Pedro Tammela says:

====================
net/sched: cls_u32: use proper refcounts

In u32 we are open coding refcounts of hashtables with integers which is
far from ideal. Update those with proper refcount and add a couple of
tests to tdc that exercise the refcounts explicitly.
====================

Link: https://lore.kernel.org/r/20231114141856.974326-1-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/tc-testing: add hashtable tests for u32

Add tests to specifically check for the refcount interactions of
hashtables created by u32. These tables should not be deleted when
referenced and the flush order should respect a tree like composition.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231114141856.974326-3-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: cls_u32: replace int refcounts with proper refcounts

Proper refcounts will always warn splat when something goes wrong,
be it underflow, saturation or object resurrection. As these are always
a source of bugs, use it in cls_u32 as a safeguard to prevent/catch issues.
Another benefit is that the refcount API self documents the code, making
clear when transitions to dead are expected.

For such an update we had to make minor adaptations on u32 to fit the refcount
API. First we set explicitly to '1' when objects are created, then the
objects are alive until a 1 -> 0 happens, which is then released appropriately.

The above made clear some redundant operations in the u32 code
around the root_ht handling that were removed. The root_ht is created
with a refcnt set to 1. Then when it's associated with tcf_proto it increments the refcnt to 2.
Throughout the entire code the root_ht is an exceptional case and can never be referenced,
therefore the refcnt never incremented/decremented.
Its lifetime is always bound to tcf_proto, meaning if you delete tcf_proto
the root_ht is deleted as well. The code made up for the fact that root_ht refcnt is 2 and did
a double decrement to free it, which is not a fit for the refcount API.

Even though refcount_t is implemented using atomics, we should observe
a negligible control plane impact.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231114141856.974326-2-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: partial revert of the "Make timestamping selectable: series

Revert following commits:

commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp")
commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers")
commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command")

They need more time for reviews.

Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: improve RTL8411b phy-down fixup

Mirsad proposed a patch to reduce the number of spinlock lock/unlock
operations and the function code size. This can be further improved
because the function sets a consecutive register block.

Suggested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-verifier-log-improvements'

Andrii Nakryiko says:

====================
BPF verifier log improvements

This patch set moves a big chunk of verifier log related code from gigantic
verifier.c file into more focused kernel/bpf/log.c. This is not essential to
the rest of functionality in this patch set, so I can undo it, but it felt
like it's good to start chipping away from 20K+ verifier.c whenever we can.

The main purpose of the patch set, though, is in improving verifier log
further.

Patches #3-#4 start printing out register state even if that register is
spilled into stack slot. Previously we'd get only spilled register type, but
no additional information, like SCALAR_VALUE's ranges. Super limiting during
debugging. For cases of register spills smaller than 8 bytes, we also print
out STACK_MISC/STACK_ZERO/STACK_INVALID markers. This, among other things,
will make it easier to write tests for these mixed spill/misc cases.

Patch #5 prints map name for PTR_TO_MAP_VALUE/PTR_TO_MAP_KEY/CONST_PTR_TO_MAP
registers. In big production BPF programs, it's important to map assembly to
actual map, and it's often non-trivial. Having map name helps.

Patch #6 just removes visual noise in form of ubiquitous imm=0 and off=0. They
are default values, omit them.

Patch #7 is probably the most controversial, but it reworks how verifier log
prints numbers. For small valued integers we use decimals, but for large ones
we switch to hexadecimal. From personal experience this is a much more useful
convention. We can tune what consitutes "small value", for now it's 16-bit
range.

Patch #8 prints frame number for PTR_TO_CTX registers, if that frame is
different from the "current" one. This removes ambiguity and confusion,
especially in complicated cases with multiple subprogs passing around
pointers.

v2->v3:
  - adjust reg_bounds tester to parse hex form of reg state as well;
  - print reg->range as unsigned (Alexei);
v1->v2:
  - use verbose_snum() for range and offset in register state (Eduard);
  - fixed typos and added acks from Eduard and Stanislav.
====================

Link: https://lore.kernel.org/r/20231118034623.3320920-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: emit frameno for PTR_TO_STACK regs if it differs from current one

It's possible to pass a pointer to parent's stack to child subprogs. In
such case verifier state output is ambiguous not showing whether
register container a pointer to "current" stack, belonging to current
subprog (frame), or it's actually a pointer to one of parent frames.

So emit this information if frame number differs between the state which
register is part of. E.g., if current state is in frame 2 and it has
a register pointing to stack in grand parent state (frame #0), we'll see
something like 'R1=fp[0]-16', while "local stack pointer" will be just
'R2=fp-16'.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: smarter verifier log number printing logic

Instead of always printing numbers as either decimals (and in some
cases, like for "imm=%llx", in hexadecimals), decide the form based on
actual values. For numbers in a reasonably small range (currently,
[0, U16_MAX] for unsigned values, and [S16_MIN, S16_MAX] for signed ones),
emit them as decimals. In all other cases, even for signed values,
emit them in hexadecimals.

For large values hex form is often times way more useful: it's easier to
see an exact difference between 0xffffffff80000000 and 0xffffffff7fffffff,
than between 18446744071562067966 and 18446744071562067967, as one
particular example.

Small values representing small pointer offsets or application
constants, on the other hand, are way more useful to be represented in
decimal notation.

Adjust reg_bounds register state parsing logic to take into account this
change.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: omit default off=0 and imm=0 in register state log

Simplify BPF verifier log further by omitting default (and frequently
irrelevant) off=0 and imm=0 parts for non-SCALAR_VALUE registers. As can
be seen from fixed tests, this is often a visual noise for PTR_TO_CTX
register and even for PTR_TO_PACKET registers.

Omitting default values follows the rest of register state logic: we
omit default values to keep verifier log succinct and to highlight
interesting state that deviates from default one. E.g., we do the same
for var_off, when it's unknown, which gives no additional information.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: emit map name in register state if applicable and available

In complicated real-world applications, whenever debugging some
verification error through verifier log, it often would be very useful
to see map name for PTR_TO_MAP_VALUE register. Usually this needs to be
inferred from key/value sizes and maybe trying to guess C code location,
but it's not always clear.

Given verifier has the name, and it's never too long, let's just emit it
for ptr_to_map_key, ptr_to_map_value, and const_ptr_to_map registers. We
reshuffle the order a bit, so that map name, key size, and value size
appear before offset and immediate values, which seems like a more
logical order.

Current output:

R1_w=map_ptr(map=array_map,ks=4,vs=8,off=0,imm=0)

But we'll get rid of useless off=0 and imm=0 parts in the next patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: print spilled register state in stack slot

Print the same register state representation when printing stack state,
as we do for normal registers. Note that if stack slot contains
subregister spill (1, 2, or 4 byte long), we'll still emit "m0?" mask
for those bytes that are not part of spilled register.

While means we can get something like fp-8=0000scalar() for a 4-byte
spill with other 4 bytes still being STACK_ZERO.

Some example before and after, taken from the log of
pyperf_subprogs.bpf.o:

49: (7b) *(u64 *)(r10 -256) = r1      ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx
49: (7b) *(u64 *)(r10 -256) = r1      ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx(off=0,imm=0)

150: (7b) *(u64 *)(r10 -264) = r0     ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null
150: (7b) *(u64 *)(r10 -264) = r0     ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0)

5192: (61) r1 = *(u32 *)(r10 -272)    ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=
5192: (61) r1 = *(u32 *)(r10 -272)    ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=????scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))

While at it, do a few other simple clean ups:
  - skip slot if it's not scratched before detecting whether it's valid;
  - move taking spilled_reg pointer outside of switch (only DYNPTR has
    to adjust that to get to the "main" slot);
  - don't recalculate types_buf second time for MISC/ZERO/default case.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: extract register state printing

Extract printing register state representation logic into a separate
helper, as we are going to reuse it for spilled register state printing
in the next patch. This also nicely reduces code nestedness.

No functional changes.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: move verifier state printing code to kernel/bpf/log.c

Move a good chunk of code from verifier.c to log.c: verifier state
verbose printing logic. This is an important and very much
logging/debugging oriented code. It fits the overlall log.c's focus on
verifier logging, and moving it allows to keep growing it without
unnecessarily adding to verifier.c code that otherwise contains a core
verification logic.

There are not many shared dependencies between this code and the rest of
verifier.c code, except a few single-line helpers for various register
type checks and a bit of state "scratching" helpers. We move all such
trivial helpers into include/bpf/bpf_verifier.h as static inlines.

No functional changes in this patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: move verbose_linfo() into kernel/bpf/log.c

verifier.c is huge. Let's try to move out parts that are logging-related
into log.c, as we previously did with bpf_log() and other related stuff.
This patch moves line info verbose output routines: it's pretty
self-contained and isolated code, so there is no problem with this.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'mlx5-updates-2023-11-13' of git://git./linux/kernel/git/saeed/linux

mlx5-updates-2023-11-13

1) Cleanup patches, leftovers from previous cycle

2) Allow sync reset flow when BF MGT interface device is present

3) Trivial ptp refactorings and improvements

4) Add local loopback counter to vport rep stats

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'batadv-next-pullrequest-20231115' of git://git.open-mesh.org/linux-merge

This feature/cleanup patchset includes the following patches:

- bump version strings, by Simon Wunderlich

- Implement new multicast packet type, including its transmission,
forwarding and parsing, by Linus Lüssing (3 patches)

- Switch to new headers for sprintf and array size,
by Sven Eckelmann (2 patches)

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-new-reset-flow'

Petr Machata says:

====================
mlxsw: Add support for new reset flow

Ido Schimmel writes:

This patchset changes mlxsw to issue a PCI reset during probe and
devlink reload so that the PCI firmware could be upgraded without a
reboot.

Unlike the old version of this patchset [1], in this version the driver
no longer tries to issue a PCI reset by triggering a PCI link toggle on
its own, but instead calls the PCI core to issue the reset.

The PCI APIs require the device lock to be held which is why patches

Patches #7 adds reset method quirk for NVIDIA Spectrum devices.

Patch #8 adds a debug level print in PCI core so that device ready delay
will be printed even if it is shorter than one second.

Patches #9-#11 are straightforward preparations in mlxsw.

Patch #12 finally implements the new reset flow in mlxsw.

Patch #13 adds PCI reset handlers in mlxsw to avoid user space from
resetting the device from underneath an unaware driver. Instead, the
driver is gracefully de-initialized before the PCI reset and then
initialized again after it.

Patch #14 adds a PCI reset selftest to make sure this code path does not
regress.

[1] https://lore.kernel.org/netdev/cover.1679502371.git.petrm@nvidia.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>