Quan Tian [Wed, 6 Mar 2024 17:24:02 +0000 (01:24 +0800)]
netfilter: nf_tables: Fix a memory leak in nf_tables_updchain
If nft_netdev_register_hooks() fails, the memory associated with
nft_stats is not freed, causing a memory leak.
This patch fixes it by moving nft_stats_alloc() down after
nft_netdev_register_hooks() succeeds.
Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Signed-off-by: Quan Tian <tianquan23@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Thu, 14 Mar 2024 17:51:38 +0000 (18:51 +0100)]
netfilter: nf_tables: do not compare internal table flags on updates
Restore skipping transaction if table update does not modify flags.
Fixes: 179d9ba5559a ("netfilter: nf_tables: fix table flag updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 10 Mar 2024 09:02:41 +0000 (10:02 +0100)]
netfilter: nft_set_pipapo: release elements in clone only from destroy path
Clone already always provides a current view of the lookup table, use it
to destroy the set, otherwise it is possible to destroy elements twice.
This fix requires:
212ed75dc5fb ("netfilter: nf_tables: integrate pipapo into commit protocol")
which came after:
9827a0e6e23b ("netfilter: nft_set_pipapo: release elements in clone from abort path").
Fixes: 9827a0e6e23b ("netfilter: nft_set_pipapo: release elements in clone from abort path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Wed, 20 Mar 2024 10:49:08 +0000 (10:49 +0000)]
Merge branch 'octeontx2-pf-mbox-fixes'
Subbaraya Sundeep says:
====================
octeontx2-pf: RVU Mailbox fixes
This patchset fixes the problems related to RVU mailbox.
During long run tests some times VF commands like setting
MTU or toggling interface fails because VF mailbox is timedout
waiting for response from PF.
Below are the fixes
Patch 1: There are two types of messages in RVU mailbox namely up and down
messages. Down messages are synchronous messages where a PF/VF sends
a message to AF and AF replies back with response. UP messages are
notifications and are asynchronous like AF sending link events to
PF. When VF sends a down message to PF, PF forwards to AF and sends
the response from AF back to VF. PF has to forward VF messages since
there is no path in hardware for VF to send directly to AF.
There is one mailbox interrupt from AF to PF when raised could mean
two scenarios one is where AF sending reply to PF for a down message
sent by PF and another one is AF sending up message asynchronously
when link changed for that PF. Receiving the up message interrupt while
PF is in middle of forwarding down message causes mailbox errors.
Fix this by receiver detecting the type of message from the mbox data register
set by sender.
Patch 2:
During VF driver remove, VF has to wait until last message is
completed and then turn off mailbox interrupts from PF.
Patch 3:
Do not use ordered workqueue for message processing since multiple works are
queued simultaneously by all the VFs and PF link UP messages.
Patch 4:
When sending link event to VF by PF check whether VF is really up to
receive this message.
Patch 5:
In AF driver, use separate interrupt handlers for the AF-VF interrupt and
AF-PF interrupt. Sometimes both interrupts are raised to two CPUs at same
time and both CPUs execute same function at same time corrupting the data.
v2 changes:
Added missing mutex unlock in error path in patch 1
Refactored if else logic in patch 1 as suggested by Paolo Abeni
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Subbaraya Sundeep [Mon, 18 Mar 2024 09:29:58 +0000 (14:59 +0530)]
octeontx2-af: Use separate handlers for interrupts
For PF to AF interrupt vector and VF to AF vector same
interrupt handler is registered which is causing race condition.
When two interrupts are raised to two CPUs at same time
then two cores serve same event corrupting the data.
Fixes: 7304ac4567bc ("octeontx2-af: Add mailbox IRQ and msg handlers")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subbaraya Sundeep [Mon, 18 Mar 2024 09:29:57 +0000 (14:59 +0530)]
octeontx2-pf: Send UP messages to VF only when VF is up.
When PF sending link status messages to VF, it is possible
that by the time link_event_task work function is executed
VF might have brought down. Hence before sending VF link
status message check whether VF is up to receive it.
Fixes: ad513ed938c9 ("octeontx2-vf: Link event notification support")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subbaraya Sundeep [Mon, 18 Mar 2024 09:29:56 +0000 (14:59 +0530)]
octeontx2-pf: Use default max_active works instead of one
Only one execution context for the workqueue used for PF and
VFs mailbox communication is incorrect since multiple works are
queued simultaneously by all the VFs and PF link UP messages.
Hence use default number of execution contexts by passing zero
as max_active to alloc_workqueue function. With this fix in place,
modify UP messages also to wait until completion.
Fixes: d424b6c02415 ("octeontx2-pf: Enable SRIOV and added VF mbox handling")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subbaraya Sundeep [Mon, 18 Mar 2024 09:29:55 +0000 (14:59 +0530)]
octeontx2-pf: Wait till detach_resources msg is complete
During VF driver remove, a message is sent to detach VF
resources to PF but VF is not waiting until message is
complete. Also mailbox interrupts need to be turned off
after the detach resource message is complete. This patch
fixes that problem.
Fixes: 05fcc9e08955 ("octeontx2-pf: Attach NIX and NPA block LFs")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subbaraya Sundeep [Mon, 18 Mar 2024 09:29:54 +0000 (14:59 +0530)]
octeontx2: Detect the mbox up or down message via register
A single line of interrupt is used to receive up notifications
and down reply messages from AF to PF (similarly from PF to its VF).
PF acts as bridge and forwards VF messages to AF and sends respsones
back from AF to VF. When an async event like link event is received
by up message when PF is in middle of forwarding VF message then
mailbox errors occur because PF state machine is corrupted.
Since VF is a separate driver or VF driver can be in a VM it is
not possible to serialize from the start of communication at VF.
Hence to differentiate between type of messages at PF this patch makes
sender to set mbox data register with distinct values for up and down
messages. Sender also checks whether previous interrupt is received
before triggering current interrupt by waiting for mailbox data register
to become zero.
Fixes: 5a6d7c9daef3 ("octeontx2-pf: Mailbox communication with AF")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 20 Mar 2024 02:44:02 +0000 (19:44 -0700)]
Merge tag 'ipsec-2024-03-19' of git://git./linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2024-03-19
1) Fix possible page_pool leak triggered by esp_output.
From Dragos Tatulea.
2) Fix UDP encapsulation in software GSO path.
From Leon Romanovsky.
* tag 'ipsec-2024-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: Allow UDP encapsulation only in offload modes
net: esp: fix bad handling of pages from page_pool
====================
Link: https://lore.kernel.org/r/20240319110151.409825-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Mon, 18 Mar 2024 09:19:08 +0000 (10:19 +0100)]
devlink: fix port new reply cmd type
Due to a c&p error, port new reply fills-up cmd with wrong value,
any other existing port command replies and notifications.
Fix it by filling cmd with value DEVLINK_CMD_PORT_NEW.
Skimmed through devlink userspace implementations, none of them cares
about this cmd value.
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com>
Closes: https://lore.kernel.org/all/ZfZcDxGV3tSy4qsV@cy-server/
Fixes: cd76dcd68d96 ("devlink: Support add and delete devlink port")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://lore.kernel.org/r/20240318091908.2736542-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Fri, 15 Mar 2024 22:47:10 +0000 (15:47 -0700)]
tcp: Clear req->syncookie in reqsk_alloc().
syzkaller reported a read of uninit req->syncookie. [0]
Originally, req->syncookie was used only in tcp_conn_request()
to indicate if we need to encode SYN cookie in SYN+ACK, so the
field remains uninitialised in other places.
The commit
695751e31a63 ("bpf: tcp: Handle BPF SYN Cookie in
cookie_v[46]_check().") added another meaning in ACK path;
req->syncookie is set true if SYN cookie is validated by BPF
kfunc.
After the change, cookie_v[46]_check() always read req->syncookie,
but it is not initialised in the normal SYN cookie case as reported
by KMSAN.
Let's make sure we always initialise req->syncookie in reqsk_alloc().
[0]:
BUG: KMSAN: uninit-value in cookie_v4_check+0x22b7/0x29e0
net/ipv4/syncookies.c:477
cookie_v4_check+0x22b7/0x29e0 net/ipv4/syncookies.c:477
tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1855 [inline]
tcp_v4_do_rcv+0xb17/0x10b0 net/ipv4/tcp_ipv4.c:1914
tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:460 [inline]
ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:449
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:569
__netif_receive_skb_one_core net/core/dev.c:5538 [inline]
__netif_receive_skb+0x319/0x9e0 net/core/dev.c:5652
process_backlog+0x480/0x8b0 net/core/dev.c:5981
__napi_poll+0xe7/0x980 net/core/dev.c:6632
napi_poll net/core/dev.c:6701 [inline]
net_rx_action+0x89d/0x1820 net/core/dev.c:6813
__do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
do_softirq+0x9a/0x100 kernel/softirq.c:455
__local_bh_enable_ip+0x9f/0xb0 kernel/softirq.c:382
local_bh_enable include/linux/bottom_half.h:33 [inline]
rcu_read_unlock_bh include/linux/rcupdate.h:820 [inline]
__dev_queue_xmit+0x2776/0x52c0 net/core/dev.c:4362
dev_queue_xmit include/linux/netdevice.h:3091 [inline]
neigh_hh_output include/net/neighbour.h:526 [inline]
neigh_output include/net/neighbour.h:540 [inline]
ip_finish_output2+0x187a/0x1b70 net/ipv4/ip_output.c:235
__ip_finish_output+0x287/0x810
ip_finish_output+0x4b/0x550 net/ipv4/ip_output.c:323
NF_HOOK_COND include/linux/netfilter.h:303 [inline]
ip_output+0x15f/0x3f0 net/ipv4/ip_output.c:433
dst_output include/net/dst.h:450 [inline]
ip_local_out net/ipv4/ip_output.c:129 [inline]
__ip_queue_xmit+0x1e93/0x2030 net/ipv4/ip_output.c:535
ip_queue_xmit+0x60/0x80 net/ipv4/ip_output.c:549
__tcp_transmit_skb+0x3c70/0x4890 net/ipv4/tcp_output.c:1462
tcp_transmit_skb net/ipv4/tcp_output.c:1480 [inline]
tcp_write_xmit+0x3ee1/0x8900 net/ipv4/tcp_output.c:2792
__tcp_push_pending_frames net/ipv4/tcp_output.c:2977 [inline]
tcp_send_fin+0xa90/0x12e0 net/ipv4/tcp_output.c:3578
tcp_shutdown+0x198/0x1f0 net/ipv4/tcp.c:2716
inet_shutdown+0x33f/0x5b0 net/ipv4/af_inet.c:923
__sys_shutdown_sock net/socket.c:2425 [inline]
__sys_shutdown net/socket.c:2437 [inline]
__do_sys_shutdown net/socket.c:2445 [inline]
__se_sys_shutdown+0x2a4/0x440 net/socket.c:2443
__x64_sys_shutdown+0x6c/0xa0 net/socket.c:2443
do_syscall_64+0xd5/0x1f0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
Uninit was stored to memory at:
reqsk_alloc include/net/request_sock.h:148 [inline]
inet_reqsk_alloc+0x651/0x7a0 net/ipv4/tcp_input.c:6978
cookie_tcp_reqsk_alloc+0xd4/0x900 net/ipv4/syncookies.c:328
cookie_tcp_check net/ipv4/syncookies.c:388 [inline]
cookie_v4_check+0x289f/0x29e0 net/ipv4/syncookies.c:420
tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1855 [inline]
tcp_v4_do_rcv+0xb17/0x10b0 net/ipv4/tcp_ipv4.c:1914
tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:460 [inline]
ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:449
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:569
__netif_receive_skb_one_core net/core/dev.c:5538 [inline]
__netif_receive_skb+0x319/0x9e0 net/core/dev.c:5652
process_backlog+0x480/0x8b0 net/core/dev.c:5981
__napi_poll+0xe7/0x980 net/core/dev.c:6632
napi_poll net/core/dev.c:6701 [inline]
net_rx_action+0x89d/0x1820 net/core/dev.c:6813
__do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
Uninit was created at:
__alloc_pages+0x9a7/0xe00 mm/page_alloc.c:4592
__alloc_pages_node include/linux/gfp.h:238 [inline]
alloc_pages_node include/linux/gfp.h:261 [inline]
alloc_slab_page mm/slub.c:2175 [inline]
allocate_slab mm/slub.c:2338 [inline]
new_slab+0x2de/0x1400 mm/slub.c:2391
___slab_alloc+0x1184/0x33d0 mm/slub.c:3525
__slab_alloc mm/slub.c:3610 [inline]
__slab_alloc_node mm/slub.c:3663 [inline]
slab_alloc_node mm/slub.c:3835 [inline]
kmem_cache_alloc+0x6d3/0xbe0 mm/slub.c:3852
reqsk_alloc include/net/request_sock.h:131 [inline]
inet_reqsk_alloc+0x66/0x7a0 net/ipv4/tcp_input.c:6978
tcp_conn_request+0x484/0x44e0 net/ipv4/tcp_input.c:7135
tcp_v4_conn_request+0x16f/0x1d0 net/ipv4/tcp_ipv4.c:1716
tcp_rcv_state_process+0x2e5/0x4bb0 net/ipv4/tcp_input.c:6655
tcp_v4_do_rcv+0xbfd/0x10b0 net/ipv4/tcp_ipv4.c:1929
tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:314 [inline]
ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:460 [inline]
ip_sublist_rcv_finish net/ipv4/ip_input.c:580 [inline]
ip_list_rcv_finish net/ipv4/ip_input.c:631 [inline]
ip_sublist_rcv+0x15f3/0x17f0 net/ipv4/ip_input.c:639
ip_list_rcv+0x9ef/0xa40 net/ipv4/ip_input.c:674
__netif_receive_skb_list_ptype net/core/dev.c:5581 [inline]
__netif_receive_skb_list_core+0x15c5/0x1670 net/core/dev.c:5629
__netif_receive_skb_list net/core/dev.c:5681 [inline]
netif_receive_skb_list_internal+0x106c/0x16f0 net/core/dev.c:5773
gro_normal_list include/net/gro.h:438 [inline]
napi_complete_done+0x425/0x880 net/core/dev.c:6113
virtqueue_napi_complete drivers/net/virtio_net.c:465 [inline]
virtnet_poll+0x149d/0x2240 drivers/net/virtio_net.c:2211
__napi_poll+0xe7/0x980 net/core/dev.c:6632
napi_poll net/core/dev.c:6701 [inline]
net_rx_action+0x89d/0x1820 net/core/dev.c:6813
__do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
CPU: 0 PID: 16792 Comm: syz-executor.2 Not tainted
6.8.0-syzkaller-05562-g61387b8dcf1d #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
Fixes: 695751e31a63 ("bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/bpf/CANn89iKdN9c+C_2JAUbc+VY3DDQjAQukMtiBbormAmAk9CdvQA@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240315224710.55209-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thinh Tran [Fri, 15 Mar 2024 20:55:35 +0000 (15:55 -0500)]
net/bnx2x: Prevent access to a freed page in page_pool
Fix race condition leading to system crash during EEH error handling
During EEH error recovery, the bnx2x driver's transmit timeout logic
could cause a race condition when handling reset tasks. The
bnx2x_tx_timeout() schedules reset tasks via bnx2x_sp_rtnl_task(),
which ultimately leads to bnx2x_nic_unload(). In bnx2x_nic_unload()
SGEs are freed using bnx2x_free_rx_sge_range(). However, this could
overlap with the EEH driver's attempt to reset the device using
bnx2x_io_slot_reset(), which also tries to free SGEs. This race
condition can result in system crashes due to accessing freed memory
locations in bnx2x_free_rx_sge()
799 static inline void bnx2x_free_rx_sge(struct bnx2x *bp,
800 struct bnx2x_fastpath *fp, u16 index)
801 {
802 struct sw_rx_page *sw_buf = &fp->rx_page_ring[index];
803 struct page *page = sw_buf->page;
....
where sw_buf was set to NULL after the call to dma_unmap_page()
by the preceding thread.
EEH: Beginning: 'slot_reset'
PCI 0011:01:00.0#10000: EEH: Invoking bnx2x->slot_reset()
bnx2x: [bnx2x_io_slot_reset:14228(eth1)]IO slot reset initializing...
bnx2x 0011:01:00.0: enabling device (0140 -> 0142)
bnx2x: [bnx2x_io_slot_reset:14244(eth1)]IO slot reset --> driver unload
Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc0080000025065fc
Oops: Kernel access of bad area, sig: 11 [#1]
.....
Call Trace:
[
c000000003c67a20] [
c00800000250658c] bnx2x_io_slot_reset+0x204/0x610 [bnx2x] (unreliable)
[
c000000003c67af0] [
c0000000000518a8] eeh_report_reset+0xb8/0xf0
[
c000000003c67b60] [
c000000000052130] eeh_pe_report+0x180/0x550
[
c000000003c67c70] [
c00000000005318c] eeh_handle_normal_event+0x84c/0xa60
[
c000000003c67d50] [
c000000000053a84] eeh_event_handler+0xf4/0x170
[
c000000003c67da0] [
c000000000194c58] kthread+0x1c8/0x1d0
[
c000000003c67e10] [
c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
To solve this issue, we need to verify page pool allocations before
freeing.
Fixes: 4cace675d687 ("bnx2x: Alloc 4k fragment for each rx ring buffer element")
Signed-off-by: Thinh Tran <thinhtr@linux.ibm.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240315205535.1321-1-thinhtr@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nikita Kiryushin [Fri, 15 Mar 2024 17:50:52 +0000 (20:50 +0300)]
net: phy: fix phy_read_poll_timeout argument type in genphy_loopback
read_poll_timeout inside phy_read_poll_timeout can set val negative
in some cases (for example, __mdiobus_read inside phy_read can return
-EOPNOTSUPP).
Supposedly, commit
4ec732951702 ("net: phylib: fix phy_read*_poll_timeout()")
should fix problems with wrong-signed vals, but I do not see how
as val is sent to phy_read as is and __val = phy_read (not val)
is checked for sign.
Change val type for signed to allow better error handling as done in other
phy_read_poll_timeout callers. This will not fix any error handling
by itself, but allows, for example, to modify cond with appropriate
sign check or check resulting val separately.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration")
Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240315175052.8049-1-kiryushin@ancud.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Michal Koutný [Fri, 15 Mar 2024 16:02:10 +0000 (17:02 +0100)]
net/sched: Add module alias for sch_fq_pie
The commit
2c15a5aee2f3 ("net/sched: Load modules via their alias")
starts loading modules via aliases and not canonical names. The new
aliases were added in commit
241a94abcf46 ("net/sched: Add module
aliases for cls_,sch_,act_ modules") via a Coccinele script.
sch_fq_pie.c is missing module.h header and thus Coccinele did not patch
it. Add the include and module alias manually, so that autoloading works
for sch_fq_pie too.
(Note: commit message in commit
241a94abcf46 ("net/sched: Add module
aliases for cls_,sch_,act_ modules") was mangled due to '#'
misinterpretation. The predicate haskernel is:
| @ haskernel @
| @@
|
| #include <linux/module.h>
|
.)
Fixes: 241a94abcf46 ("net/sched: Add module aliases for cls_,sch_,act_ modules")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Link: https://lore.kernel.org/r/20240315160210.8379-1-mkoutny@suse.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tobias Brunner [Fri, 15 Mar 2024 14:35:40 +0000 (15:35 +0100)]
ipv4: raw: Fix sending packets from raw sockets via IPsec tunnels
Since the referenced commit, the xfrm_inner_extract_output() function
uses the protocol field to determine the address family. So not setting
it for IPv4 raw sockets meant that such packets couldn't be tunneled via
IPsec anymore.
IPv6 raw sockets are not affected as they already set the protocol since
9c9c9ad5fae7 ("ipv6: set skb->protocol on tcp, raw and ip6_append_data
genereated skbs").
Fixes: f4796398f21b ("xfrm: Remove inner/outer modes from output path")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/c5d9a947-eb19-4164-ac99-468ea814ce20@strongswan.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Felix Maurer [Fri, 15 Mar 2024 12:04:52 +0000 (13:04 +0100)]
hsr: Handle failures in module init
A failure during registration of the netdev notifier was not handled at
all. A failure during netlink initialization did not unregister the netdev
notifier.
Handle failures of netdev notifier registration and netlink initialization.
Both functions should only return negative values on failure and thereby
lead to the hsr module not being loaded.
Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/3ce097c15e3f7ace98fc7fd9bcbf299f092e63d1.1710504184.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yewon Choi [Fri, 15 Mar 2024 09:28:38 +0000 (18:28 +0900)]
rds: introduce acquire/release ordering in acquire/release_in_xmit()
acquire/release_in_xmit() work as bit lock in rds_send_xmit(), so they
are expected to ensure acquire/release memory ordering semantics.
However, test_and_set_bit/clear_bit() don't imply such semantics, on
top of this, following smp_mb__after_atomic() does not guarantee release
ordering (memory barrier actually should be placed before clear_bit()).
Instead, we use clear_bit_unlock/test_and_set_bit_lock() here.
Fixes: 0f4b1c7e89e6 ("rds: fix rds_send_xmit() serialization")
Fixes: 1f9ecd7eacfd ("RDS: Pass rds_conn_path to rds_send_xmit()")
Signed-off-by: Yewon Choi <woni9911@gmail.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://lore.kernel.org/r/ZfQUxnNTO9AJmzwc@libra05
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Fri, 15 Mar 2024 00:21:08 +0000 (17:21 -0700)]
tools: ynl: add header guards for nlctrl
I "extracted" YNL C into a GitHub repo to make it easier
to use in other projects: https://github.com/linux-netdev/ynl-c
GitHub actions use Ubuntu by default, and the kernel headers
there are missing
f329a0ebeaba ("genetlink: correct uAPI defines").
Add the direct include workaround for nlctrl.
Fixes: 768e044a5fd4 ("doc/netlink/specs: Add spec for nlctrl netlink family")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240315002108.523232-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Tue, 19 Mar 2024 10:22:54 +0000 (11:22 +0100)]
Merge branch 'wireguard-fixes-for-6-9-rc1'
Jason A. Donenfeld says:
====================
wireguard fixes for 6.9-rc1
This series has four WireGuard fixes:
1) Annotate a data race that KCSAN found by using READ_ONCE/WRITE_ONCE,
which has been causing syzkaller noise.
2) Use the generic netdev tstats allocation and stats getters instead of
doing this within the driver.
3) Explicitly check a flag variable instead of an empty list in the
netlink code, to prevent a UaF situation when paging through GET
results during a remove-all SET operation.
4) Set a flag in the RISC-V CI config so the selftests continue to boot.
====================
Link: https://lore.kernel.org/r/20240314224911.6653-1-Jason@zx2c4.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason A. Donenfeld [Thu, 14 Mar 2024 22:49:11 +0000 (16:49 -0600)]
wireguard: selftests: set RISCV_ISA_FALLBACK on riscv{32,64}
This option is needed to continue booting with QEMU. Recent changes that
made this optional meant that it gets unset in the test harness, and so
WireGuard CI has been broken. Fix this by simply setting this option.
Cc: stable@vger.kernel.org
Fixes: 496ea826d1e1 ("RISC-V: provide Kconfig & commandline options to control parsing "riscv,isa"")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason A. Donenfeld [Thu, 14 Mar 2024 22:49:10 +0000 (16:49 -0600)]
wireguard: netlink: access device through ctx instead of peer
The previous commit fixed a bug that led to a NULL peer->device being
dereferenced. It's actually easier and faster performance-wise to
instead get the device from ctx->wg. This semantically makes more sense
too, since ctx->wg->peer_allowedips.seq is compared with
ctx->allowedips_seq, basing them both in ctx. This also acts as a
defence in depth provision against freed peers.
Cc: stable@vger.kernel.org
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason A. Donenfeld [Thu, 14 Mar 2024 22:49:09 +0000 (16:49 -0600)]
wireguard: netlink: check for dangling peer via is_dead instead of empty list
If all peers are removed via wg_peer_remove_all(), rather than setting
peer_list to empty, the peer is added to a temporary list with a head on
the stack of wg_peer_remove_all(). If a netlink dump is resumed and the
cursored peer is one that has been removed via wg_peer_remove_all(), it
will iterate from that peer and then attempt to dump freed peers.
Fix this by instead checking peer->is_dead, which was explictly created
for this purpose. Also move up the device_update_lock lockdep assertion,
since reading is_dead relies on that.
It can be reproduced by a small script like:
echo "Setting config..."
ip link add dev wg0 type wireguard
wg setconf wg0 /big-config
(
while true; do
echo "Showing config..."
wg showconf wg0 > /dev/null
done
) &
sleep 4
wg setconf wg0 <(printf "[Peer]\nPublicKey=$(wg genkey)\n")
Resulting in:
BUG: KASAN: slab-use-after-free in __lock_acquire+0x182a/0x1b20
Read of size 8 at addr
ffff88811956ec70 by task wg/59
CPU: 2 PID: 59 Comm: wg Not tainted 6.8.0-rc2-debug+ #5
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x70
print_address_description.constprop.0+0x2c/0x380
print_report+0xab/0x250
kasan_report+0xba/0xf0
__lock_acquire+0x182a/0x1b20
lock_acquire+0x191/0x4b0
down_read+0x80/0x440
get_peer+0x140/0xcb0
wg_get_device_dump+0x471/0x1130
Cc: stable@vger.kernel.org
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Reported-by: Lillian Berry <lillian@star-ark.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Breno Leitao [Thu, 14 Mar 2024 22:49:08 +0000 (16:49 -0600)]
wireguard: device: remove generic .ndo_get_stats64
Commit
3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.
Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Breno Leitao [Thu, 14 Mar 2024 22:49:07 +0000 (16:49 -0600)]
wireguard: device: leverage core stats allocator
With commit
34d21de99cea9 ("net: Move {l,t,d}stats allocation to core
and convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in this driver and leverage the network core
allocation instead.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Nikita Zhandarovich [Thu, 14 Mar 2024 22:49:06 +0000 (16:49 -0600)]
wireguard: receive: annotate data-race around receiving_counter.counter
Syzkaller with KCSAN identified a data-race issue when accessing
keypair->receiving_counter.counter. Use READ_ONCE() and WRITE_ONCE()
annotations to mark the data race as intentional.
BUG: KCSAN: data-race in wg_packet_decrypt_worker / wg_packet_rx_poll
write to 0xffff888107765888 of 8 bytes by interrupt on cpu 0:
counter_validate drivers/net/wireguard/receive.c:321 [inline]
wg_packet_rx_poll+0x3ac/0xf00 drivers/net/wireguard/receive.c:461
__napi_poll+0x60/0x3b0 net/core/dev.c:6536
napi_poll net/core/dev.c:6605 [inline]
net_rx_action+0x32b/0x750 net/core/dev.c:6738
__do_softirq+0xc4/0x279 kernel/softirq.c:553
do_softirq+0x5e/0x90 kernel/softirq.c:454
__local_bh_enable_ip+0x64/0x70 kernel/softirq.c:381
__raw_spin_unlock_bh include/linux/spinlock_api_smp.h:167 [inline]
_raw_spin_unlock_bh+0x36/0x40 kernel/locking/spinlock.c:210
spin_unlock_bh include/linux/spinlock.h:396 [inline]
ptr_ring_consume_bh include/linux/ptr_ring.h:367 [inline]
wg_packet_decrypt_worker+0x6c5/0x700 drivers/net/wireguard/receive.c:499
process_one_work kernel/workqueue.c:2633 [inline]
...
read to 0xffff888107765888 of 8 bytes by task 3196 on cpu 1:
decrypt_packet drivers/net/wireguard/receive.c:252 [inline]
wg_packet_decrypt_worker+0x220/0x700 drivers/net/wireguard/receive.c:501
process_one_work kernel/workqueue.c:2633 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
worker_thread+0x525/0x730 kernel/workqueue.c:2787
...
Fixes: a9e90d9931f3 ("wireguard: noise: separate receive counter from send counter")
Reported-by: syzbot+d1de830e4ecdaac83d89@syzkaller.appspotmail.com
Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Eric Dumazet [Thu, 14 Mar 2024 20:08:45 +0000 (20:08 +0000)]
net: move dev->state into net_device_read_txrx group
dev->state can be read in rx and tx fast paths.
netif_running() which needs dev->state is called from
- enqueue_to_backlog() [RX path]
- __dev_direct_xmit() [TX path]
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240314200845.3050179-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Abhishek Chauhan [Thu, 14 Mar 2024 19:24:04 +0000 (12:24 -0700)]
Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets"
This reverts commit
885c36e59f46375c138de18ff1692f18eff67b7f.
The patch currently broke the bpf selftest test_tc_dtime because
uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which
does not necessarily mean mono with the original fix as the bit was re-used
for userspace timestamp as well to avoid tstamp reset in the forwarding
path. To solve this we need to keep mono_delivery_time as is and
introduce another bit called user_delivery_time and fall back to the
initial proposal of setting the user_delivery_time bit based on
sk_clockid set from userspace.
Fixes: 885c36e59f46 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets")
Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arınç ÜNAL [Thu, 14 Mar 2024 09:28:35 +0000 (12:28 +0300)]
net: dsa: mt7530: prevent possible incorrect XTAL frequency selection
On MT7530, the HT_XTAL_FSEL field of the HWTRAP register stores a 2-bit
value that represents the frequency of the crystal oscillator connected to
the switch IC. The field is populated by the state of the ESW_P4_LED_0 and
ESW_P4_LED_0 pins, which is done right after reset is deasserted.
ESW_P4_LED_0 ESW_P3_LED_0 Frequency
-----------------------------------------
0 0 Reserved
0 1 20MHz
1 0 40MHz
1 1 25MHz
On MT7531, the XTAL25 bit of the STRAP register stores this. The LAN0LED0
pin is used to populate the bit. 25MHz when the pin is high, 40MHz when
it's low.
These pins are also used with LEDs, therefore, their state can be set to
something other than the bootstrapping configuration. For example, a link
may be established on port 3 before the DSA subdriver takes control of the
switch which would set ESW_P3_LED_0 to high.
Currently on mt7530_setup() and mt7531_setup(), 1000 - 1100 usec delay is
described between reset assertion and deassertion. Some switch ICs in real
life conditions cannot always have these pins set back to the bootstrapping
configuration before reset deassertion in this amount of delay. This causes
wrong crystal frequency to be selected which puts the switch in a
nonfunctional state after reset deassertion.
The tests below are conducted on an MT7530 with a 40MHz crystal oscillator
by Justin Swartz.
With a cable from an active peer connected to port 3 before reset, an
incorrect crystal frequency (0b11 = 25MHz) is selected:
[1] [3] [5]
: : :
_____________________________ __________________
ESW_P4_LED_0 |_______|
_____________________________
ESW_P3_LED_0 |__________________________
: : : :
: : [4]...:
: :
[2]................:
[1] Reset is asserted.
[2] Period of 1000 - 1100 usec.
[3] Reset is deasserted.
[4] Period of 315 usec. HWTRAP register is populated with incorrect
XTAL frequency.
[5] Signals reflect the bootstrapped configuration.
Increase the delay between reset_control_assert() and
reset_control_deassert(), and gpiod_set_value_cansleep(priv->reset, 0) and
gpiod_set_value_cansleep(priv->reset, 1) to 5000 - 5100 usec. This amount
ensures a higher possibility that the switch IC will have these pins back
to the bootstrapping configuration before reset deassertion.
With a cable from an active peer connected to port 3 before reset, the
correct crystal frequency (0b10 = 40MHz) is selected:
[1] [2-1] [3] [5]
: : : :
_____________________________ __________________
ESW_P4_LED_0 |_______|
___________________ _______
ESW_P3_LED_0 |_________| |__________________
: : : : :
: [2-2]...: [4]...:
[2]................:
[1] Reset is asserted.
[2] Period of 5000 - 5100 usec.
[2-1] ESW_P3_LED_0 goes low.
[2-2] Remaining period of 5000 - 5100 usec.
[3] Reset is deasserted.
[4] Period of 310 usec. HWTRAP register is populated with bootstrapped
XTAL frequency.
[5] Signals reflect the bootstrapped configuration.
ESW_P3_LED_0 low period before reset deassertion:
5000 usec
- 5100 usec
TEST RESET HOLD
# (usec)
---------------------
1 5410
2 5440
3 4375
4 5490
5 5475
6 4335
7 4370
8 5435
9 4205
10 4335
11 3750
12 3170
13 4395
14 4375
15 3515
16 4335
17 4220
18 4175
19 4175
20 4350
Min 3170
Max 5490
Median 4342.500
Avg 4466.500
Revert commit
2920dd92b980 ("net: dsa: mt7530: disable LEDs before reset").
Changing the state of pins via reset assertion is simpler and more
efficient than doing so by setting the LED controller off.
Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
Fixes: c288575f7810 ("net: dsa: mt7530: Add the support of MT7531 switch")
Co-developed-by: Justin Swartz <justin.swartz@risingedge.co.za>
Signed-off-by: Justin Swartz <justin.swartz@risingedge.co.za>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 18 Mar 2024 12:25:52 +0000 (12:25 +0000)]
Merge branch 'veth-xdp-gro'
Ignat Korchagin says:
====================
net: veth: ability to toggle GRO and XDP independently
It is rather confusing that GRO is automatically enabled, when an XDP program
is attached to a veth interface. Moreover, it is not possible to disable GRO
on a veth, if an XDP program is attached (which might be desirable in some use
cases).
Make GRO and XDP independent for a veth interface.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ignat Korchagin [Wed, 13 Mar 2024 18:37:59 +0000 (19:37 +0100)]
selftests: net: veth: test the ability to independently manipulate GRO and XDP
We should be able to independently flip either XDP or GRO states and toggling
one should not affect the other.
Adjust other tests as well that had implicit expectation that GRO would be
automatically enabled.
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ignat Korchagin [Wed, 13 Mar 2024 18:37:58 +0000 (19:37 +0100)]
net: veth: do not manipulate GRO when using XDP
Commit
d3256efd8e8b ("veth: allow enabling NAPI even without XDP") tried to fix
the fact that GRO was not possible without XDP, because veth did not use NAPI
without XDP. However, it also introduced the behaviour that GRO is always
enabled, when XDP is enabled.
While it might be desired for most cases, it is confusing for the user at best
as the GRO flag suddenly changes, when an XDP program is attached. It also
introduces some complexities in state management as was partially addressed in
commit
fe9f801355f0 ("net: veth: clear GRO when clearing XDP even when down").
But the biggest problem is that it is not possible to disable GRO at all, when
an XDP program is attached, which might be needed for some use cases.
Fix this by not touching the GRO flag on XDP enable/disable as the code already
supports switching to NAPI if either GRO or XDP is requested.
Link: https://lore.kernel.org/lkml/20240311124015.38106-1-ignat@cloudflare.com/
Fixes: d3256efd8e8b ("veth: allow enabling NAPI even without XDP")
Fixes: fe9f801355f0 ("net: veth: clear GRO when clearing XDP even when down")
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Leon Romanovsky [Tue, 12 Mar 2024 11:55:22 +0000 (13:55 +0200)]
xfrm: Allow UDP encapsulation only in offload modes
The missing check of x->encap caused to the situation where GSO packets
were created with UDP encapsulation.
As a solution return the encap check for non-offloaded SA.
Fixes: 983a73da1f99 ("xfrm: Pass UDP encapsulation in TX packet offload")
Closes: https://lore.kernel.org/all/a650221ae500f0c7cf496c61c96c1b103dcb6f67.camel@redhat.com
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Dragos Tatulea [Fri, 8 Mar 2024 15:26:00 +0000 (17:26 +0200)]
net: esp: fix bad handling of pages from page_pool
When the skb is reorganized during esp_output (!esp->inline), the pages
coming from the original skb fragments are supposed to be released back
to the system through put_page. But if the skb fragment pages are
originating from a page_pool, calling put_page on them will trigger a
page_pool leak which will eventually result in a crash.
This leak can be easily observed when using CONFIG_DEBUG_VM and doing
ipsec + gre (non offloaded) forwarding:
BUG: Bad page state in process ksoftirqd/16 pfn:1451b6
page:
00000000de2b8d32 refcount:0 mapcount:0 mapping:
0000000000000000 index:0x1451b6000 pfn:0x1451b6
flags: 0x200000000000000(node=0|zone=2)
page_type: 0xffffffff()
raw:
0200000000000000 dead000000000040 ffff88810d23c000 0000000000000000
raw:
00000001451b6000 0000000000000001 00000000ffffffff 0000000000000000
page dumped because: page_pool leak
Modules linked in: ip_gre gre mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay zram zsmalloc fuse [last unloaded: mlx5_core]
CPU: 16 PID: 96 Comm: ksoftirqd/16 Not tainted 6.8.0-rc4+ #22
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x36/0x50
bad_page+0x70/0xf0
free_unref_page_prepare+0x27a/0x460
free_unref_page+0x38/0x120
esp_ssg_unref.isra.0+0x15f/0x200
esp_output_tail+0x66d/0x780
esp_xmit+0x2c5/0x360
validate_xmit_xfrm+0x313/0x370
? validate_xmit_skb+0x1d/0x330
validate_xmit_skb_list+0x4c/0x70
sch_direct_xmit+0x23e/0x350
__dev_queue_xmit+0x337/0xba0
? nf_hook_slow+0x3f/0xd0
ip_finish_output2+0x25e/0x580
iptunnel_xmit+0x19b/0x240
ip_tunnel_xmit+0x5fb/0xb60
ipgre_xmit+0x14d/0x280 [ip_gre]
dev_hard_start_xmit+0xc3/0x1c0
__dev_queue_xmit+0x208/0xba0
? nf_hook_slow+0x3f/0xd0
ip_finish_output2+0x1ca/0x580
ip_sublist_rcv_finish+0x32/0x40
ip_sublist_rcv+0x1b2/0x1f0
? ip_rcv_finish_core.constprop.0+0x460/0x460
ip_list_rcv+0x103/0x130
__netif_receive_skb_list_core+0x181/0x1e0
netif_receive_skb_list_internal+0x1b3/0x2c0
napi_gro_receive+0xc8/0x200
gro_cell_poll+0x52/0x90
__napi_poll+0x25/0x1a0
net_rx_action+0x28e/0x300
__do_softirq+0xc3/0x276
? sort_range+0x20/0x20
run_ksoftirqd+0x1e/0x30
smpboot_thread_fn+0xa6/0x130
kthread+0xcd/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x31/0x50
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
</TASK>
The suggested fix is to introduce a new wrapper (skb_page_unref) that
covers page refcounting for page_pool pages as well.
Cc: stable@vger.kernel.org
Fixes: 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling")
Reported-and-tested-by: Anatoli N.Chechelnickiy <Anatoli.Chechelnickiy@m.interpipe.biz>
Reported-by: Ian Kumlien <ian.kumlien@gmail.com>
Link: https://lore.kernel.org/netdev/CAA85sZvvHtrpTQRqdaOx6gd55zPAVsqMYk_Lwh4Md5knTq7AyA@mail.gmail.com
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Eric Dumazet [Thu, 14 Mar 2024 14:18:16 +0000 (14:18 +0000)]
packet: annotate data-races around ignore_outgoing
ignore_outgoing is read locklessly from dev_queue_xmit_nit()
and packet_getsockopt()
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
syzbot reported:
BUG: KCSAN: data-race in dev_queue_xmit_nit / packet_setsockopt
write to 0xffff888107804542 of 1 bytes by task 22618 on cpu 0:
packet_setsockopt+0xd83/0xfd0 net/packet/af_packet.c:4003
do_sock_setsockopt net/socket.c:2311 [inline]
__sys_setsockopt+0x1d8/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2340
do_syscall_64+0xd3/0x1d0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
read to 0xffff888107804542 of 1 bytes by task 27 on cpu 1:
dev_queue_xmit_nit+0x82/0x620 net/core/dev.c:2248
xmit_one net/core/dev.c:3527 [inline]
dev_hard_start_xmit+0xcc/0x3f0 net/core/dev.c:3547
__dev_queue_xmit+0xf24/0x1dd0 net/core/dev.c:4335
dev_queue_xmit include/linux/netdevice.h:3091 [inline]
batadv_send_skb_packet+0x264/0x300 net/batman-adv/send.c:108
batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:392 [inline]
batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:420 [inline]
batadv_iv_send_outstanding_bat_ogm_packet+0x3f0/0x4b0 net/batman-adv/bat_iv_ogm.c:1700
process_one_work kernel/workqueue.c:3254 [inline]
process_scheduled_works+0x465/0x990 kernel/workqueue.c:3335
worker_thread+0x526/0x730 kernel/workqueue.c:3416
kthread+0x1d1/0x210 kernel/kthread.c:388
ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
value changed: 0x00 -> 0x01
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 27 Comm: kworker/u8:1 Tainted: G W
6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
Fixes: fa788d986a3a ("packet: add sockopt to ignore outgoing packets")
Reported-by: syzbot+c669c1136495a2e7c31f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/CANn89i+Z7MfbkBLOv=p7KZ7=K1rKHO4P1OL5LYDCtBiyqsa9oQ@mail.gmail.com/T/#t
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herve Codina [Thu, 14 Mar 2024 12:33:46 +0000 (13:33 +0100)]
net: wan: fsl_qmc_hdlc: Fix module compilation
The fsl_qmc_driver does not compile as module:
error: ‘qmc_hdlc_driver’ undeclared here (not in a function);
405 | MODULE_DEVICE_TABLE(of, qmc_hdlc_driver);
| ^~~~~~~~~~~~~~~
Fix the typo.
Fixes: b40f00ecd463 ("net: wan: Add support for QMC HDLC")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Closes: https://lore.kernel.org/linux-kernel/87ttl93f7i.fsf@mail.lhotse/
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Golle [Wed, 13 Mar 2024 22:50:40 +0000 (22:50 +0000)]
net: ethernet: mtk_eth_soc: fix PPE hanging issue
A patch to resolve an issue was found in MediaTek's GPL-licensed SDK:
In the mtk_ppe_stop() function, the PPE scan mode is not disabled before
disabling the PPE. This can potentially lead to a hang during the process
of disabling the PPE.
Without this patch, the PPE may experience a hang during the reboot test.
Link: https://git01.mediatek.com/plugins/gitiles/openwrt/feeds/mtk-openwrt-feeds/+/b40da332dfe763932a82f9f62a4709457a15dd6c
Fixes: ba37b7caf1ed ("net: ethernet: mtk_eth_soc: add support for initializing the PPE")
Suggested-by: Bc-bocun Chen <bc-bocun.chen@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Golle [Wed, 13 Mar 2024 22:50:18 +0000 (22:50 +0000)]
net: mediatek: mtk_eth_soc: clear MAC_MCR_FORCE_LINK only when MAC is up
Clearing bit MAC_MCR_FORCE_LINK which forces the link down too early
can result in MAC ending up in a broken/blocked state.
Fix this by handling this bit in the .mac_link_up and .mac_link_down
calls instead of in .mac_finish.
Fixes: b8fc9f30821e ("net: ethernet: mediatek: Add basic PHYLINK support")
Suggested-by: Mason-cw Chang <Mason-cw.Chang@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jens Axboe [Tue, 12 Mar 2024 15:55:45 +0000 (09:55 -0600)]
net: remove {revc,send}msg_copy_msghdr() from exports
The only user of these was io_uring, and it's not using them anymore.
Make them static and remove them from the socket header file.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/1b6089d3-c1cf-464a-abd3-b0f0b6bb2523@kernel.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Duanqiang Wen [Wed, 13 Mar 2024 08:06:34 +0000 (16:06 +0800)]
net: txgbe: fix clk_name exceed MAX_DEV_ID limits
txgbe register clk which name is i2c_designware.pci_dev_id(),
clk_name will be stored in clk_lookup_alloc. If PCIe bus number
is larger than 0x39, clk_name size will be larger than 20 bytes.
It exceeds clk_lookup_alloc MAX_DEV_ID limits. So the driver
shortened clk_name.
Fixes: b63f20485e43 ("net: txgbe: Register fixed rate clock")
Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://lore.kernel.org/r/20240313080634.459523-1-duanqiangwen@net-swift.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 13 Mar 2024 03:23:29 +0000 (20:23 -0700)]
docs: networking: fix indentation errors in multi-pf-netdev
Stephen reports new warnings in the docs:
Documentation/networking/multi-pf-netdev.rst:94: ERROR: Unexpected indentation.
Documentation/networking/multi-pf-netdev.rst:106: ERROR: Unexpected indentation.
Fixes: 77d9ec3f6c8c ("Documentation: networking: Add description for multi-pf netdev")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lore.kernel.org/all/20240312153304.0ef1b78e@canb.auug.org.au/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240313032329.3919036-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 14 Mar 2024 12:09:53 +0000 (13:09 +0100)]
Merge branch 'rxrpc-fixes-for-af_rxrpc'
David Howells says:
====================
rxrpc: Fixes for AF_RXRPC
Here are a couple of fixes for the AF_RXRPC changes[1] in net-next.
(1) Fix a runtime warning introduced by a patch that changed how
page_frag_alloc_align() works.
(2) Fix an is-NULL vs IS_ERR error handling bug.
The patches are tagged here:
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/rxrpc-iothread-
20240312
And can be found on this branch:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-iothread
Link: https://lore.kernel.org/r/20240306000655.1100294-1-dhowells@redhat.com/
====================
Link: https://lore.kernel.org/r/20240312233723.2984928-1-dhowells@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Howells [Tue, 12 Mar 2024 23:37:18 +0000 (23:37 +0000)]
rxrpc: Fix error check on ->alloc_txbuf()
rxrpc_alloc_*_txbuf() and ->alloc_txbuf() return NULL to indicate no
memory, but rxrpc_send_data() uses IS_ERR().
Fix rxrpc_send_data() to check for NULL only and set -ENOMEM if it sees
that.
Fixes: 49489bb03a50 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags")
Signed-off-by: David Howells <dhowells@redhat.com>
Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Howells [Tue, 12 Mar 2024 23:37:17 +0000 (23:37 +0000)]
rxrpc: Fix use of changed alignment param to page_frag_alloc_align()
Commit
411c5f36805c ("mm/page_alloc: modify page_frag_alloc_align() to
accept align as an argument") changed the way page_frag_alloc_align()
worked, but it didn't fix AF_RXRPC as that use of that allocator function
hadn't been merged yet at the time. Now, when the AFS filesystem is used,
this results in:
WARNING: CPU: 4 PID: 379 at include/linux/gfp.h:323 rxrpc_alloc_data_txbuf+0x9d/0x2b0 [rxrpc]
Fix this by using __page_frag_alloc_align() instead.
Note that it might be better to use an order-based alignment rather than a
mask-based alignment.
Fixes: 49489bb03a50 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags")
Signed-off-by: David Howells <dhowells@redhat.com>
Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: Yunsheng Lin <linyunsheng@huawei.com>
cc: Alexander Duyck <alexander.duyck@gmail.com>
cc: Michael S. Tsirkin <mst@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Shigeru Yoshida [Tue, 12 Mar 2024 15:27:19 +0000 (00:27 +0900)]
hsr: Fix uninit-value access in hsr_get_node()
KMSAN reported the following uninit-value access issue [1]:
=====================================================
BUG: KMSAN: uninit-value in hsr_get_node+0xa2e/0xa40 net/hsr/hsr_framereg.c:246
hsr_get_node+0xa2e/0xa40 net/hsr/hsr_framereg.c:246
fill_frame_info net/hsr/hsr_forward.c:577 [inline]
hsr_forward_skb+0xe12/0x30e0 net/hsr/hsr_forward.c:615
hsr_dev_xmit+0x1a1/0x270 net/hsr/hsr_device.c:223
__netdev_start_xmit include/linux/netdevice.h:4940 [inline]
netdev_start_xmit include/linux/netdevice.h:4954 [inline]
xmit_one net/core/dev.c:3548 [inline]
dev_hard_start_xmit+0x247/0xa10 net/core/dev.c:3564
__dev_queue_xmit+0x33b8/0x5130 net/core/dev.c:4349
dev_queue_xmit include/linux/netdevice.h:3134 [inline]
packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
packet_snd net/packet/af_packet.c:3087 [inline]
packet_sendmsg+0x8b1d/0x9f30 net/packet/af_packet.c:3119
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
__sys_sendto+0x735/0xa10 net/socket.c:2191
__do_sys_sendto net/socket.c:2203 [inline]
__se_sys_sendto net/socket.c:2199 [inline]
__x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x6d/0x140 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Uninit was created at:
slab_post_alloc_hook+0x129/0xa70 mm/slab.h:768
slab_alloc_node mm/slub.c:3478 [inline]
kmem_cache_alloc_node+0x5e9/0xb10 mm/slub.c:3523
kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:560
__alloc_skb+0x318/0x740 net/core/skbuff.c:651
alloc_skb include/linux/skbuff.h:1286 [inline]
alloc_skb_with_frags+0xc8/0xbd0 net/core/skbuff.c:6334
sock_alloc_send_pskb+0xa80/0xbf0 net/core/sock.c:2787
packet_alloc_skb net/packet/af_packet.c:2936 [inline]
packet_snd net/packet/af_packet.c:3030 [inline]
packet_sendmsg+0x70e8/0x9f30 net/packet/af_packet.c:3119
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
__sys_sendto+0x735/0xa10 net/socket.c:2191
__do_sys_sendto net/socket.c:2203 [inline]
__se_sys_sendto net/socket.c:2199 [inline]
__x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x6d/0x140 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
CPU: 1 PID: 5033 Comm: syz-executor334 Not tainted
6.7.0-syzkaller-00562-g9f8413c4a66f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
=====================================================
If the packet type ID field in the Ethernet header is either ETH_P_PRP or
ETH_P_HSR, but it is not followed by an HSR tag, hsr_get_skb_sequence_nr()
reads an invalid value as a sequence number. This causes the above issue.
This patch fixes the issue by returning NULL if the Ethernet header is not
followed by an HSR tag.
Fixes: f266a683a480 ("net/hsr: Better frame dispatch")
Reported-and-tested-by: syzbot+2ef3a8ce8e91b5a50098@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2ef3a8ce8e91b5a50098 [1]
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Link: https://lore.kernel.org/r/20240312152719.724530-1-syoshida@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
William Tu [Sat, 9 Mar 2024 18:31:47 +0000 (20:31 +0200)]
vmxnet3: Fix missing reserved tailroom
Use rbi->len instead of rcd->len for non-dataring packet.
Found issue:
XDP_WARN: xdp_update_frame_from_buff(line:278): Driver BUG: missing reserved tailroom
WARNING: CPU: 0 PID: 0 at net/core/xdp.c:586 xdp_warn+0xf/0x20
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W O 6.5.1 #1
RIP: 0010:xdp_warn+0xf/0x20
...
? xdp_warn+0xf/0x20
xdp_do_redirect+0x15f/0x1c0
vmxnet3_run_xdp+0x17a/0x400 [vmxnet3]
vmxnet3_process_xdp+0xe4/0x760 [vmxnet3]
? vmxnet3_tq_tx_complete.isra.0+0x21e/0x2c0 [vmxnet3]
vmxnet3_rq_rx_complete+0x7ad/0x1120 [vmxnet3]
vmxnet3_poll_rx_only+0x2d/0xa0 [vmxnet3]
__napi_poll+0x20/0x180
net_rx_action+0x177/0x390
Reported-by: Martin Zaharinov <micron10@gmail.com>
Tested-by: Martin Zaharinov <micron10@gmail.com>
Link: https://lore.kernel.org/netdev/74BF3CC8-2A3A-44FF-98C2-1E20F110A92E@gmail.com/
Fixes: 54f00cce1178 ("vmxnet3: Add XDP support.")
Signed-off-by: William Tu <witu@nvidia.com>
Link: https://lore.kernel.org/r/20240309183147.28222-1-witu@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Kuniyuki Iwashima [Fri, 8 Mar 2024 20:16:23 +0000 (12:16 -0800)]
tcp: Fix refcnt handling in __inet_hash_connect().
syzbot reported a warning in sk_nulls_del_node_init_rcu().
The commit
66b60b0c8c4a ("dccp/tcp: Unhash sk from ehash for tb2 alloc
failure after check_estalblished().") tried to fix an issue that an
unconnected socket occupies an ehash entry when bhash2 allocation fails.
In such a case, we need to revert changes done by check_established(),
which does not hold refcnt when inserting socket into ehash.
So, to revert the change, we need to __sk_nulls_add_node_rcu() instead
of sk_nulls_add_node_rcu().
Otherwise, sock_put() will cause refcnt underflow and leak the socket.
[0]:
WARNING: CPU: 0 PID: 23948 at include/net/sock.h:799 sk_nulls_del_node_init_rcu+0x166/0x1a0 include/net/sock.h:799
Modules linked in:
CPU: 0 PID: 23948 Comm: syz-executor.2 Not tainted
6.8.0-rc6-syzkaller-00159-gc055fc00c07b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:sk_nulls_del_node_init_rcu+0x166/0x1a0 include/net/sock.h:799
Code: e8 7f 71 c6 f7 83 fb 02 7c 25 e8 35 6d c6 f7 4d 85 f6 0f 95 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 1b 6d c6 f7 90 <0f> 0b 90 eb b2 e8 10 6d c6 f7 4c 89 e7 be 04 00 00 00 e8 63 e7 d2
RSP: 0018:
ffffc900032d7848 EFLAGS:
00010246
RAX:
ffffffff89cd0035 RBX:
0000000000000001 RCX:
0000000000040000
RDX:
ffffc90004de1000 RSI:
000000000003ffff RDI:
0000000000040000
RBP:
1ffff1100439ac26 R08:
ffffffff89ccffe3 R09:
1ffff1100439ac28
R10:
dffffc0000000000 R11:
ffffed100439ac29 R12:
ffff888021cd6140
R13:
dffffc0000000000 R14:
ffff88802a9bf5c0 R15:
ffff888021cd6130
FS:
00007f3b823f16c0(0000) GS:
ffff8880b9400000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007f3b823f0ff8 CR3:
000000004674a000 CR4:
00000000003506f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
<TASK>
__inet_hash_connect+0x140f/0x20b0 net/ipv4/inet_hashtables.c:1139
dccp_v6_connect+0xcb9/0x1480 net/dccp/ipv6.c:956
__inet_stream_connect+0x262/0xf30 net/ipv4/af_inet.c:678
inet_stream_connect+0x65/0xa0 net/ipv4/af_inet.c:749
__sys_connect_file net/socket.c:2048 [inline]
__sys_connect+0x2df/0x310 net/socket.c:2065
__do_sys_connect net/socket.c:2075 [inline]
__se_sys_connect net/socket.c:2072 [inline]
__x64_sys_connect+0x7a/0x90 net/socket.c:2072
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f3b8167dda9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:
00007f3b823f10c8 EFLAGS:
00000246 ORIG_RAX:
000000000000002a
RAX:
ffffffffffffffda RBX:
00007f3b817abf80 RCX:
00007f3b8167dda9
RDX:
000000000000001c RSI:
0000000020000040 RDI:
0000000000000003
RBP:
00007f3b823f1120 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
0000000000000001
R13:
000000000000000b R14:
00007f3b817abf80 R15:
00007ffd3beb57b8
</TASK>
Reported-by: syzbot+12c506c1aae251e70449@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=12c506c1aae251e70449
Fixes: 66b60b0c8c4a ("dccp/tcp: Unhash sk from ehash for tb2 alloc failure after check_estalblished().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240308201623.65448-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Shay Drory [Tue, 12 Mar 2024 10:52:38 +0000 (12:52 +0200)]
devlink: Fix devlink parallel commands processing
Commit
870c7ad4a52b ("devlink: protect devlink->dev by the instance
lock") added devlink instance locking inside a loop that iterates over
all the registered devlink instances on the machine in the pre-doit
phase. This can lead to serialization of devlink commands over
different devlink instances.
For example: While the first devlink instance is executing firmware
flash, all commands to other devlink instances on the machine are
forced to wait until the first devlink finishes.
Therefore, in the pre-doit phase, take the devlink instance lock only
for the devlink instance the command is targeting. Devlink layer is
taking a reference on the devlink instance, ensuring the devlink->dev
pointer is valid. This reference taking was introduced by commit
a380687200e0 ("devlink: take device reference for devlink object").
Without this commit, it would not be safe to access devlink->dev
lockless.
Fixes: 870c7ad4a52b ("devlink: protect devlink->dev by the instance lock")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 11 Mar 2024 20:46:28 +0000 (20:46 +0000)]
net/sched: taprio: proper TCA_TAPRIO_TC_ENTRY_INDEX check
taprio_parse_tc_entry() is not correctly checking
TCA_TAPRIO_TC_ENTRY_INDEX attribute:
int tc; // Signed value
tc = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_INDEX]);
if (tc >= TC_QOPT_MAX_QUEUE) {
NL_SET_ERR_MSG_MOD(extack, "TC entry index out of range");
return -ERANGE;
}
syzbot reported that it could fed arbitary negative values:
UBSAN: shift-out-of-bounds in net/sched/sch_taprio.c:1722:18
shift exponent -
2147418108 is negative
CPU: 0 PID: 5066 Comm: syz-executor367 Not tainted
6.8.0-rc7-syzkaller-00136-gc8a5c731fd12 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
ubsan_epilogue lib/ubsan.c:217 [inline]
__ubsan_handle_shift_out_of_bounds+0x3c7/0x420 lib/ubsan.c:386
taprio_parse_tc_entry net/sched/sch_taprio.c:1722 [inline]
taprio_parse_tc_entries net/sched/sch_taprio.c:1768 [inline]
taprio_change+0xb87/0x57d0 net/sched/sch_taprio.c:1877
taprio_init+0x9da/0xc80 net/sched/sch_taprio.c:2134
qdisc_create+0x9d4/0x1190 net/sched/sch_api.c:1355
tc_modify_qdisc+0xa26/0x1e40 net/sched/sch_api.c:1776
rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6617
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f1b2dea3759
Code: 48 83 c4 28 c3 e8 d7 19 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:
00007ffd4de452f8 EFLAGS:
00000246 ORIG_RAX:
000000000000002e
RAX:
ffffffffffffffda RBX:
00007f1b2def0390 RCX:
00007f1b2dea3759
RDX:
0000000000000000 RSI:
00000000200007c0 RDI:
0000000000000004
RBP:
0000000000000003 R08:
0000555500000000 R09:
0000555500000000
R10:
0000555500000000 R11:
0000000000000246 R12:
00007ffd4de45340
R13:
00007ffd4de45310 R14:
0000000000000001 R15:
00007ffd4de45340
Fixes: a54fc09e4cba ("net/sched: taprio: allow user input of per-tc max SDU")
Reported-and-tested-by: syzbot+a340daa06412d6028918@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linu Cherian [Tue, 12 Mar 2024 07:06:22 +0000 (12:36 +0530)]
octeontx2-af: Use matching wake_up API variant in CGX command interface
Use wake_up API instead of wake_up_interruptible, since
wait_event_timeout API is used for waiting on command completion.
Fixes: 1463f382f58d ("octeontx2-af: Add support for CGX link management")
Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Mon, 11 Mar 2024 16:38:30 +0000 (12:38 -0400)]
soc: fsl: qbman: Use raw spinlock for cgr_lock
smp_call_function always runs its callback in hard IRQ context, even on
PREEMPT_RT, where spinlocks can sleep. So we need to use a raw spinlock
for cgr_lock to ensure we aren't waiting on a sleeping task.
Although this bug has existed for a while, it was not apparent until
commit
ef2a8d5478b9 ("net: dpaa: Adjust queue depth on rate change")
which invokes smp_call_function_single via qman_update_cgr_safe every
time a link goes up or down.
Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
CC: stable@vger.kernel.org
Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Closes: https://lore.kernel.org/all/20230323153935.nofnjucqjqnz34ej@skbuf/
Reported-by: Steffen Trumtrar <s.trumtrar@pengutronix.de>
Closes: https://lore.kernel.org/linux-arm-kernel/87wmsyvclu.fsf@pengutronix.de/
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Camelia Groza <camelia.groza@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Mon, 11 Mar 2024 16:38:29 +0000 (12:38 -0400)]
soc: fsl: qbman: Always disable interrupts when taking cgr_lock
smp_call_function_single disables IRQs when executing the callback. To
prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
This is already done by qman_update_cgr and qman_delete_cgr; fix the
other lockers.
Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
CC: stable@vger.kernel.org
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Camelia Groza <camelia.groza@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 13 Mar 2024 01:56:18 +0000 (18:56 -0700)]
Merge branch 'tcp-rds-fix-use-after-free-around-kernel-tcp-reqsk'
Kuniyuki Iwashima says:
====================
tcp/rds: Fix use-after-free around kernel TCP reqsk.
syzkaller reported an warning of netns ref tracker for RDS TCP listener,
which commit
740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed for per-netns ehash.
This series fixes the bug in the partial fix and fixes the reported bug
in the global ehash.
v4: https://lore.kernel.org/netdev/
20240307232151.55963-1-kuniyu@amazon.com/
v3: https://lore.kernel.org/netdev/
20240307224423.53315-1-kuniyu@amazon.com/
v2: https://lore.kernel.org/netdev/
20240227011041.97375-1-kuniyu@amazon.com/
v1: https://lore.kernel.org/netdev/
20240223172448.94084-1-kuniyu@amazon.com/
====================
Link: https://lore.kernel.org/r/20240308200122.64357-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Fri, 8 Mar 2024 20:01:22 +0000 (12:01 -0800)]
rds: tcp: Fix use-after-free of net in reqsk_timer_handler().
syzkaller reported a warning of netns tracker [0] followed by KASAN
splat [1] and another ref tracker warning [1].
syzkaller could not find a repro, but in the log, the only suspicious
sequence was as follows:
18:26:22 executing program 1:
r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
...
connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0, @loopback}, 0x1c) (async)
The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
So, the scenario would be:
1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
rds_tcp_listen_init().
2. syz-executor connect()s to it and creates a reqsk.
3. syz-executor exit()s immediately.
4. netns is dismantled. [0]
5. reqsk timer is fired, and UAF happens while freeing reqsk. [1]
6. listener is freed after RCU grace period. [2]
Basically, reqsk assumes that the listener guarantees netns safety
until all reqsk timers are expired by holding the listener's refcount.
However, this was not the case for kernel sockets.
Commit
740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed this issue only for per-netns ehash.
Let's apply the same fix for the global ehash.
[0]:
ref_tracker: net notrefcnt@
0000000065449cc3 has 1/1 users at
sk_alloc (./include/net/net_namespace.h:337 net/core/sock.c:2146)
inet6_create (net/ipv6/af_inet6.c:192 net/ipv6/af_inet6.c:119)
__sock_create (net/socket.c:1572)
rds_tcp_listen_init (net/rds/tcp_listen.c:279)
rds_tcp_init_net (net/rds/tcp.c:577)
ops_init (net/core/net_namespace.c:137)
setup_net (net/core/net_namespace.c:340)
copy_net_ns (net/core/net_namespace.c:497)
create_new_namespaces (kernel/nsproxy.c:110)
unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
ksys_unshare (kernel/fork.c:3429)
__x64_sys_unshare (kernel/fork.c:3496)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
...
WARNING: CPU: 0 PID: 27 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179)
[1]:
BUG: KASAN: slab-use-after-free in inet_csk_reqsk_queue_drop (./include/net/inet_hashtables.h:180 net/ipv4/inet_connection_sock.c:952 net/ipv4/inet_connection_sock.c:966)
Read of size 8 at addr
ffff88801b370400 by task swapper/0/0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
kasan_report (mm/kasan/report.c:603)
inet_csk_reqsk_queue_drop (./include/net/inet_hashtables.h:180 net/ipv4/inet_connection_sock.c:952 net/ipv4/inet_connection_sock.c:966)
reqsk_timer_handler (net/ipv4/inet_connection_sock.c:979 net/ipv4/inet_connection_sock.c:1092)
call_timer_fn (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/timer.h:127 kernel/time/timer.c:1701)
__run_timers.part.0 (kernel/time/timer.c:1752 kernel/time/timer.c:2038)
run_timer_softirq (kernel/time/timer.c:2053)
__do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
</IRQ>
Allocated by task 258 on cpu 0 at 83.612050s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
__kasan_slab_alloc (mm/kasan/common.c:343)
kmem_cache_alloc (mm/slub.c:3813 mm/slub.c:3860 mm/slub.c:3867)
copy_net_ns (./include/linux/slab.h:701 net/core/net_namespace.c:421 net/core/net_namespace.c:480)
create_new_namespaces (kernel/nsproxy.c:110)
unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
ksys_unshare (kernel/fork.c:3429)
__x64_sys_unshare (kernel/fork.c:3496)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
Freed by task 27 on cpu 0 at 329.158864s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
kasan_save_free_info (mm/kasan/generic.c:643)
__kasan_slab_free (mm/kasan/common.c:265)
kmem_cache_free (mm/slub.c:4299 mm/slub.c:4363)
cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:639)
process_one_work (kernel/workqueue.c:2638)
worker_thread (kernel/workqueue.c:2700 kernel/workqueue.c:2787)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
The buggy address belongs to the object at
ffff88801b370000
which belongs to the cache net_namespace of size 4352
The buggy address is located 1024 bytes inside of
freed 4352-byte region [
ffff88801b370000,
ffff88801b371100)
[2]:
WARNING: CPU: 0 PID: 95 at lib/ref_tracker.c:228 ref_tracker_free (lib/ref_tracker.c:228 (discriminator 1))
Modules linked in:
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:ref_tracker_free (lib/ref_tracker.c:228 (discriminator 1))
...
Call Trace:
<IRQ>
__sk_destruct (./include/net/net_namespace.h:353 net/core/sock.c:2204)
rcu_core (./arch/x86/include/asm/preempt.h:26 kernel/rcu/tree.c:2165 kernel/rcu/tree.c:2433)
__do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
</IRQ>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240308200122.64357-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Fri, 8 Mar 2024 20:01:21 +0000 (12:01 -0800)]
tcp: Fix NEW_SYN_RECV handling in inet_twsk_purge()
inet_twsk_purge() uses rcu to find TIME_WAIT and NEW_SYN_RECV
objects to purge.
These objects use SLAB_TYPESAFE_BY_RCU semantic and need special
care. We need to use refcount_inc_not_zero(&sk->sk_refcnt).
Reuse the existing correct logic I wrote for TIME_WAIT,
because both structures have common locations for
sk_state, sk_family, and netns pointer.
If after the refcount_inc_not_zero() the object fields longer match
the keys, use sock_gen_put(sk) to release the refcount.
Then we can call inet_twsk_deschedule_put() for TIME_WAIT,
inet_csk_reqsk_queue_drop_and_put() for NEW_SYN_RECV sockets,
with BH disabled.
Then we need to restart the loop because we had drop rcu_read_lock().
Fixes: 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in inet_twsk_purge()")
Link: https://lore.kernel.org/netdev/CANn89iLvFuuihCtt9PME2uS1WJATnf5fKjDToa1WzVnRzHnPfg@mail.gmail.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240308200122.64357-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 13 Mar 2024 00:44:08 +0000 (17:44 -0700)]
Merge tag 'net-next-6.9' of git://git./linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Large effort by Eric to lower rtnl_lock pressure and remove locks:
- Make commonly used parts of rtnetlink (address, route dumps
etc) lockless, protected by RCU instead of rtnl_lock.
- Add a netns exit callback which already holds rtnl_lock,
allowing netns exit to take rtnl_lock once in the core instead
of once for each driver / callback.
- Remove locks / serialization in the socket diag interface.
- Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
- Remove the dev_base_lock, depend on RCU where necessary.
- Support busy polling on a per-epoll context basis. Poll length and
budget parameters can be set independently of system defaults.
- Introduce struct net_hotdata, to make sure read-mostly global
config variables fit in as few cache lines as possible.
- Add optional per-nexthop statistics to ease monitoring / debug of
ECMP imbalance problems.
- Support TCP_NOTSENT_LOWAT in MPTCP.
- Ensure that IPv6 temporary addresses' preferred lifetimes are long
enough, compared to other configured lifetimes, and at least 2 sec.
- Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
- Add support for the independent control state machine for bonding
per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
control state machine.
- Add "network ID" to MCTP socket APIs to support hosts with multiple
disjoint MCTP networks.
- Re-use the mono_delivery_time skbuff bit for packets which user
space wants to be sent at a specified time. Maintain the timing
information while traversing veth links, bridge etc.
- Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
- Simplify many places iterating over netdevs by using an xarray
instead of a hash table walk (hash table remains in place, for use
on fastpaths).
- Speed up scanning for expired routes by keeping a dedicated list.
- Speed up "generic" XDP by trying harder to avoid large allocations.
- Support attaching arbitrary metadata to netconsole messages.
Things we sprinkled into general kernel code:
- Enforce VM_IOREMAP flag and range in ioremap_page_range and
introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
bpf_arena).
- Rework selftest harness to enable the use of the full range of ksft
exit code (pass, fail, skip, xfail, xpass).
Netfilter:
- Allow userspace to define a table that is exclusively owned by a
daemon (via netlink socket aliveness) without auto-removing this
table when the userspace program exits. Such table gets marked as
orphaned and a restarting management daemon can re-attach/regain
ownership.
- Speed up element insertions to nftables' concatenated-ranges set
type. Compact a few related data structures.
BPF:
- Add BPF token support for delegating a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application.
- Introduce bpf_arena which is sparse shared memory region between
BPF program and user space where structures inside the arena can
have pointers to other areas of the arena, and pointers work
seamlessly for both user-space programs and BPF programs.
- Introduce may_goto instruction that is a contract between the
verifier and the program. The verifier allows the program to loop
assuming it's behaving well, but reserves the right to terminate
it.
- Extend the BPF verifier to enable static subprog calls in spin lock
critical sections.
- Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops
type.
- Add support for retrieval of cookies for perf/kprobe multi links.
- Support arbitrary TCP SYN cookie generation / validation in the TC
layer with BPF to allow creating SYN flood handling in BPF
firewalls.
- Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF
objects.
Wireless:
- Add SPP (signaling and payload protected) AMSDU support.
- Support wider bandwidth OFDMA, as required for EHT operation.
Driver API:
- Major overhaul of the Energy Efficient Ethernet internals to
support new link modes (2.5GE, 5GE), share more code between
drivers (especially those using phylib), and encourage more
uniform behavior. Convert and clean up drivers.
- Define an API for querying per netdev queue statistics from
drivers.
- IPSec: account in global stats for fully offloaded sessions.
- Create a concept of Ethernet PHY Packages at the Device Tree level,
to allow parameterizing the existing PHY package code.
- Enable Rx hashing (RSS) on GTP protocol fields.
Misc:
- Improvements and refactoring all over networking selftests.
- Create uniform module aliases for TC classifiers, actions, and
packet schedulers to simplify creating modprobe policies.
- Address all missing MODULE_DESCRIPTION() warnings in networking.
- Extend the Netlink descriptions in YAML to cover message
encapsulation or "Netlink polymorphism", where interpretation of
nested attributes depends on link type, classifier type or some
other "class type".
Drivers:
- Ethernet high-speed NICs:
- Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
- Intel (100G, ice, idpf):
- support E825-C devices
- nVidia/Mellanox:
- support devices with one port and multiple PCIe links
- Broadcom (bnxt):
- support n-tuple filters
- support configuring the RSS key
- Wangxun (ngbe/txgbe):
- implement irq_domain for TXGBE's sub-interrupts
- Pensando/AMD:
- support XDP
- optimize queue submission and wakeup handling (+17% bps)
- optimize struct layout, saving 28% of memory on queues
- Ethernet NICs embedded and virtual:
- Google cloud vNIC:
- refactor driver to perform memory allocations for new queue
config before stopping and freeing the old queue memory
- Synopsys (stmmac):
- obey queueMaxSDU and implement counters required by 802.1Qbv
- Renesas (ravb):
- support packet checksum offload
- suspend to RAM and runtime PM support
- Ethernet switches:
- nVidia/Mellanox:
- support for nexthop group statistics
- Microchip:
- ksz8: implement PHY loopback
- add support for KSZ8567, a 7-port 10/100Mbps switch
- PTP:
- New driver for RENESAS FemtoClock3 Wireless clock generator.
- Support OCP PTP cards designed and built by Adva.
- CAN:
- Support recvmsg() flags for own, local and remote traffic on CAN
BCM sockets.
- Support for esd GmbH PCIe/402 CAN device family.
- m_can:
- Rx/Tx submission coalescing
- wake on frame Rx
- WiFi:
- Intel (iwlwifi):
- enable signaling and payload protected A-MSDUs
- support wider-bandwidth OFDMA
- support for new devices
- bump FW API to 89 for AX devices; 90 for BZ/SC devices
- MediaTek (mt76):
- mt7915: newer ADIE version support
- mt7925: radio temperature sensor support
- Qualcomm (ath11k):
- support 6 GHz station power modes: Low Power Indoor (LPI),
Standard Power) SP and Very Low Power (VLP)
- QCA6390 & WCN6855: support 2 concurrent station interfaces
- QCA2066 support
- Qualcomm (ath12k):
- refactoring in preparation for Multi-Link Operation (MLO)
support
- 1024 Block Ack window size support
- firmware-2.bin support
- support having multiple identical PCI devices (firmware needs
to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
- QCN9274: support split-PHY devices
- WCN7850: enable Power Save Mode in station mode
- WCN7850: P2P support
- RealTek:
- rtw88: support for more rtw8811cu and rtw8821cu devices
- rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
- rtlwifi: speed up USB firmware initialization
- rtwl8xxxu:
- RTL8188F: concurrent interface support
- Channel Switch Announcement (CSA) support in AP mode
- Broadcom (brcmfmac):
- per-vendor feature support
- per-vendor SAE password setup
- DMI nvram filename quirk for ACEPC W5 Pro"
* tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
nexthop: Fix out-of-bounds access during attribute validation
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
selftests/bpf: Add kprobe multi triggering benchmarks
ptp: Move from simple ida to xarray
vxlan: Remove generic .ndo_get_stats64
vxlan: Do not alloc tstats manually
devlink: Add comments to use netlink gen tool
nfp: flower: handle acti_netdevs allocation failure
net/packet: Add getsockopt support for PACKET_COPY_THRESH
net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
selftests/bpf: Add bpf_arena_htab test.
selftests/bpf: Add bpf_arena_list test.
selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
bpf: Add helper macro bpf_addr_space_cast()
libbpf: Recognize __arena global variables.
bpftool: Recognize arena map type
...
Linus Torvalds [Tue, 12 Mar 2024 22:18:34 +0000 (15:18 -0700)]
Merge tag 'docs-6.9' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A moderatly busy cycle for development this time around.
- Some cleanup of the main index page for easier navigation
- Rework some of the other top-level pages for better readability
and, with luck, fewer merge conflicts in the future.
- Submit-checklist improvements, hopefully the first of many.
- New Italian translations
- A fair number of kernel-doc fixes and improvements. We have also
dropped the recommendation to use an old version of Sphinx.
- A new document from Thorsten on bisection
... and lots of fixes and updates"
* tag 'docs-6.9' of git://git.lwn.net/linux: (54 commits)
docs: verify/bisect: fixes, finetuning, and support for Arch
docs: Makefile: Add dependency to $(YNL_INDEX) for targets other than htmldocs
docs: Move ja_JP/howto.rst to ja_JP/process/howto.rst
docs: submit-checklist: use subheadings
docs: submit-checklist: structure by category
docs: new text on bisecting which also covers bug validation
docs: drop the version constraints for sphinx and dependencies
docs: kerneldoc-preamble.sty: Remove code for Sphinx <2.4
docs: Restore "smart quotes" for quotes
docs/zh_CN: accurate translation of "function"
docs: Include simplified link titles in main index
docs: Correct formatting of title in admin-guide/index.rst
docs: kernel_feat.py: fix build error for missing files
MAINTAINERS: Set the field name for subsystem profile section
kasan: Add documentation for CONFIG_KASAN_EXTRA_INFO
Fixed case issue with 'fault-injection' in documentation
kernel-doc: handle #if in enums as well
Documentation: update mailing list addresses
doc: kerneldoc.py: fix indentation
scripts/kernel-doc: simplify signature printing
...
Linus Torvalds [Tue, 12 Mar 2024 22:10:51 +0000 (15:10 -0700)]
Merge tag 'audit-pr-
20240312' of git://git./linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Two small audit patches:
- Use the KMEM_CACHE() macro instead of kmem_cache_create()
The guidance appears to be to use the KMEM_CACHE() macro when
possible and there is no reason why we can't use the macro, so
let's use it.
- Remove an unnecessary assignment in audit_dupe_lsm_field()
A return value variable was assigned a value in its declaration,
but the declaration value is overwritten before the return value
variable is ever referenced; drop the assignment at declaration
time"
* tag 'audit-pr-
20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: use KMEM_CACHE() instead of kmem_cache_create()
audit: remove unnecessary assignment in audit_dupe_lsm_field()
Linus Torvalds [Tue, 12 Mar 2024 22:08:06 +0000 (15:08 -0700)]
Merge tag 'Smack-for-6.9' of https://github.com/cschaufler/smack-next
Pull smack updates from Casey Schaufler:
- Improvements to the initialization of in-memory inodes
- A fix in ramfs to propery ensure the initialization of in-memory
inodes
- Removal of duplicated code in smack_cred_transfer()
* tag 'Smack-for-6.9' of https://github.com/cschaufler/smack-next:
Smack: use init_task_smack() in smack_cred_transfer()
ramfs: Initialize security of in-memory inodes
smack: Initialize the in-memory inode in smack_inode_init_security()
smack: Always determine inode labels in smack_inode_init_security()
smack: Handle SMACK64TRANSMUTE in smack_inode_setsecurity()
smack: Set SMACK64TRANSMUTE only for dirs in smack_inode_setxattr()
Linus Torvalds [Tue, 12 Mar 2024 22:05:27 +0000 (15:05 -0700)]
Merge tag 'seccomp-v6.9-rc1' of git://git./linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
"There are no core kernel changes here; it's entirely selftests and
samples:
- Improve reliability of selftests (Terry Tritton, Kees Cook)
- Fix strict-aliasing warning in samples (Arnd Bergmann)"
* tag 'seccomp-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
samples: user-trap: fix strict-aliasing warning
selftests/seccomp: Pin benchmark to single CPU
selftests/seccomp: user_notification_addfd check nextfd is available
selftests/seccomp: Change the syscall used in KILL_THREAD test
selftests/seccomp: Handle EINVAL on unshare(CLONE_NEWPID)
Linus Torvalds [Tue, 12 Mar 2024 21:49:30 +0000 (14:49 -0700)]
Merge tag 'hardening-v6.9-rc1' of git://git./linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"As is pretty normal for this tree, there are changes all over the
place, especially for small fixes, selftest improvements, and improved
macro usability.
Some header changes ended up landing via this tree as they depended on
the string header cleanups. Also, a notable set of changes is the work
for the reintroduction of the UBSAN signed integer overflow sanitizer
so that we can continue to make improvements on the compiler side to
make this sanitizer a more viable future security hardening option.
Summary:
- string.h and related header cleanups (Tanzir Hasan, Andy
Shevchenko)
- VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev,
Harshit Mogalapalli)
- selftests/powerpc: Fix load_unaligned_zeropad build failure
(Michael Ellerman)
- hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn)
- Handle tail call optimization better in LKDTM (Douglas Anderson)
- Use long form types in overflow.h (Andy Shevchenko)
- Add flags param to string_get_size() (Andy Shevchenko)
- Add Coccinelle script for potential struct_size() use (Jacob
Keller)
- Fix objtool corner case under KCFI (Josh Poimboeuf)
- Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng)
- Add str_plural() helper (Michal Wajdeczko, Kees Cook)
- Ignore relocations in .notes section
- Add comments to explain how __is_constexpr() works
- Fix m68k stack alignment expectations in stackinit Kunit test
- Convert string selftests to KUnit
- Add KUnit tests for fortified string functions
- Improve reporting during fortified string warnings
- Allow non-type arg to type_max() and type_min()
- Allow strscpy() to be called with only 2 arguments
- Add binary mode to leaking_addresses scanner
- Various small cleanups to leaking_addresses scanner
- Adding wrapping_*() arithmetic helper
- Annotate initial signed integer wrap-around in refcount_t
- Add explicit UBSAN section to MAINTAINERS
- Fix UBSAN self-test warnings
- Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL
- Reintroduce UBSAN's signed overflow sanitizer"
* tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits)
selftests/powerpc: Fix load_unaligned_zeropad build failure
string: Convert helpers selftest to KUnit
string: Convert selftest to KUnit
sh: Fix build with CONFIG_UBSAN=y
compiler.h: Explain how __is_constexpr() works
overflow: Allow non-type arg to type_max() and type_min()
VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler()
lib/string_helpers: Add flags param to string_get_size()
x86, relocs: Ignore relocations in .notes section
objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks
overflow: Use POD in check_shl_overflow()
lib: stackinit: Adjust target string to 8 bytes for m68k
sparc: vdso: Disable UBSAN instrumentation
kernel.h: Move lib/cmdline.c prototypes to string.h
leaking_addresses: Provide mechanism to scan binary files
leaking_addresses: Ignore input device status lines
leaking_addresses: Use File::Temp for /tmp files
MAINTAINERS: Update LEAKING_ADDRESSES details
fortify: Improve buffer overflow reporting
fortify: Add KUnit tests for runtime overflows
...
Linus Torvalds [Tue, 12 Mar 2024 21:45:12 +0000 (14:45 -0700)]
Merge tag 'execve-v6.9-rc1' of git://git./linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- Drop needless error path code in remove_arg_zero() (Li kunyu, Kees
Cook)
- binfmt_elf_efpic: Don't use missing interpreter's properties (Max
Filippov)
- Use /bin/bash for execveat selftests
* tag 'execve-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
exec: Simplify remove_arg_zero() error path
selftests/exec: Perform script checks with /bin/bash
exec: Delete unnecessary statements in remove_arg_zero()
fs: binfmt_elf_efpic: don't use missing interpreter's properties
Linus Torvalds [Tue, 12 Mar 2024 21:36:18 +0000 (14:36 -0700)]
Merge tag 'pstore-v6.9-rc1' of git://git./linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:
- Make PSTORE_RAM available by default on arm64 (Nícolas F R A Prado)
- Allow for dynamic initialization in modular build (Guilherme G
Piccoli)
- Add missing allocation failure check (Kunwu Chan)
- Avoid duplicate memory zeroing (Christophe JAILLET)
- Avoid potential double-free during pstorefs umount
* tag 'pstore-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore/zone: Don't clear memory twice
pstore/zone: Add a null pointer check to the psz_kmsg_read
efi: pstore: Allow dynamic initialization based on module parameter
arm64: defconfig: Enable PSTORE_RAM
pstore/ram: Register to module device table
pstore: inode: Only d_invalidate() is needed
Linus Torvalds [Tue, 12 Mar 2024 21:27:37 +0000 (14:27 -0700)]
Merge tag 'nfsd-6.9' of git://git./linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The bulk of the patches for this release are optimizations, code
clean-ups, and minor bug fixes.
One new feature to mention is that NFSD administrators now have the
ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support has
had this capability for some time.
As always I am grateful to NFSD contributors, reviewers, and testers"
* tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (75 commits)
NFSD: Clean up nfsd4_encode_replay()
NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit
NFSD: Document nfsd_setattr() fill-attributes behavior
nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr()
nfsd: Fix a regression in nfsd_setattr()
NFSD: OP_CB_RECALL_ANY should recall both read and write delegations
NFSD: handle GETATTR conflict with write delegation
NFSD: add support for CB_GETATTR callback
NFSD: Document the phases of CREATE_SESSION
NFSD: Fix the NFSv4.1 CREATE_SESSION operation
nfsd: clean up comments over nfs4_client definition
svcrdma: Add Write chunk WRs to the RPC's Send WR chain
svcrdma: Post WRs for Write chunks in svc_rdma_sendto()
svcrdma: Post the Reply chunk and Send WR together
svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt
svcrdma: Post Send WR chain
svcrdma: Fix retry loop in svc_rdma_send()
svcrdma: Prevent a UAF in svc_rdma_send()
svcrdma: Fix SQ wake-ups
svcrdma: Increase the per-transport rw_ctx count
...
Linus Torvalds [Tue, 12 Mar 2024 20:25:53 +0000 (13:25 -0700)]
Merge tag 'erofs-for-6.9-rc1' of git://git./linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, we introduce compressed inode support over fscache
since a lot of native EROFS images are explicitly compressed so that
EROFS over fscache can be more widely used even without Dragonfly
Nydus [1].
Apart from that, there are some folio conversions for compressed
inodes available as well as a lockdep false positive fix.
Summary:
- Some folio conversions for compressed inodes;
- Add compressed inode support over fscache;
- Fix lockdep false positives of erofs_pseudo_mnt"
Link: https://nydus.dev
* tag 'erofs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: support compressed inodes over fscache
erofs: make iov_iter describe target buffers over fscache
erofs: fix lockdep false positives on initializing erofs_pseudo_mnt
erofs: refine managed cache operations to folios
erofs: convert z_erofs_submissionqueue_endio() to folios
erofs: convert z_erofs_fill_bio_vec() to folios
erofs: get rid of `justfound` debugging tag
erofs: convert z_erofs_do_read_page() to folios
erofs: convert z_erofs_onlinepage_.* to folios
Linus Torvalds [Tue, 12 Mar 2024 20:22:10 +0000 (13:22 -0700)]
Merge tag 'fsverity-for-linus' of git://git./fs/fsverity/linux
Pull fsverity update from Eric Biggers:
"Slightly improve data verification performance by eliminating an
unnecessary lock"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: remove hash page spin lock
Linus Torvalds [Tue, 12 Mar 2024 20:17:36 +0000 (13:17 -0700)]
Merge tag 'fscrypt-for-linus' of git://git./fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers:
"Fix flakiness in a test by releasing the quota synchronously when a
key is removed, and other minor cleanups"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: shrink the size of struct fscrypt_inode_info slightly
fscrypt: write CBC-CTS instead of CTS-CBC
fscrypt: clear keyring before calling key_put()
fscrypt: explicitly require that inode->i_blkbits be set
Linus Torvalds [Tue, 12 Mar 2024 19:35:42 +0000 (12:35 -0700)]
Merge tag 'affs-for-6.9' of git://git./linux/kernel/git/kdave/linux
Pull affs update from David Sterba:
"One change to AFFS that removes use of SLAB_MEM_SPREAD, which is going
to be removed from MM code"
* tag 'affs-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
affs: remove SLAB_MEM_SPREAD flag usage
Linus Torvalds [Tue, 12 Mar 2024 19:28:34 +0000 (12:28 -0700)]
Merge tag 'for-6.9-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly stabilization, refactoring and cleanup changes. There rest are
minor performance optimizations due to caching or lock contention
reduction and a few notable fixes.
Performance improvements:
- minor speedup in logging when repeatedly allocated structure is
preallocated only once, improves latency and decreases lock
contention
- minor throughput increase (+6%), reduced lock contention after
clearing delayed allocation bits, applies to several common
workload types
- skip full quota rescan if a new relation is added in the same
transaction
Fixes:
- zstd fix for inline compressed file in subpage mode, updated
version from the 6.8 time
- proper qgroup inheritance ioctl parameter validation
- more fiemap followup fixes after reduced locking done in 6.8:
- fix race when detecting delalloc ranges
Core changes:
- more debugging code:
- added assertions for a very rare crash in raid56 calculation
- tree-checker dumps page state to give more insights into
possible reference counting issues
- add checksum calculation offloading sysfs knob, for now enabled
under DEBUG only to determine a good heuristic for deciding the
offload or synchronous, depends on various factors (block group
profile, device speed) and is not as clear as initially thought
(checksum type)
- error handling improvements, added assertions
- more page to folio conversion (defrag, truncate), cached size and
shift
- preparation for more fine grained locking of sectors in subpage
mode
- cleanups and refactoring:
- include cleanups, forward declarations
- pointer-to-structure helpers
- redundant argument removals
- removed unused code
- slab cache updates, last use of SLAB_MEM_SPREAD removed"
* tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
btrfs: fix race when detecting delalloc ranges during fiemap
btrfs: fix off-by-one chunk length calculation at contains_pending_extent()
btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent
btrfs: qgroup: validate btrfs_qgroup_inherit parameter
btrfs: include device major and minor numbers in the device scan notice
btrfs: mark btrfs_put_caching_control() static
btrfs: remove SLAB_MEM_SPREAD flag use
btrfs: qgroup: always free reserved space for extent records
btrfs: tree-checker: dump the page status if hit something wrong
btrfs: compression: remove dead comments in btrfs_compress_heuristic()
btrfs: subpage: make writer lock utilize bitmap
btrfs: subpage: make reader lock utilize bitmap
btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer()
btrfs: pass a valid extent map cache pointer to __get_extent_map()
btrfs: merge btrfs_del_delalloc_inode() helpers
btrfs: pass btrfs_device to btrfs_scratch_superblocks()
btrfs: handle transaction commit errors in flush_reservations()
btrfs: use KMEM_CACHE() to create btrfs_free_space cache
btrfs: use KMEM_CACHE() to create delayed ref caches
...
Linus Torvalds [Tue, 12 Mar 2024 19:24:40 +0000 (12:24 -0700)]
Merge tag 'zonefs-6.9-rc1' of git://git./linux/kernel/git/dlemoal/zonefs
Pull zonefs update from Damien Le Moal:
- A single change for this cycle to convert zonefs to use the new
mount API
* tag 'zonefs-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: convert zonefs to use the new mount api
Linus Torvalds [Tue, 12 Mar 2024 17:56:28 +0000 (10:56 -0700)]
Merge tag 'asm-generic-6.9' of git://git./linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"Just two small updates this time:
- A series I did to unify the definition of PAGE_SIZE through
Kconfig, intended to help with a vdso rework that needs the
constant but cannot include the normal kernel headers when building
the compat VDSO on arm64 and potentially others
- a patch from Yan Zhao to remove the pfn_to_virt() definitions from
a couple of architectures after finding they were both incorrect
and entirely unused"
* tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
arch: define CONFIG_PAGE_SIZE_*KB on all architectures
arch: simplify architecture specific page size configuration
arch: consolidate existing CONFIG_PAGE_SIZE_*KB definitions
mm: Remove broken pfn_to_virt() on arch csky/hexagon/openrisc
Linus Torvalds [Tue, 12 Mar 2024 17:52:18 +0000 (10:52 -0700)]
Merge tag 'soc-defconfig-6.9' of git://git./linux/kernel/git/soc/soc
Pull ARM defconfig updates from Arnd Bergmann:
"This has the usual updates to enable platform specific driver modules
as new hardware gets supported, as well as an update to the
virt.config fragment so we disable all newly added platforms again"
* tag 'soc-defconfig-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (24 commits)
arm64: defconfig: Enable support for cbmem entries in the coreboot table
ARM: defconfig: enable STMicroelectronics accelerometer and gyro for Exynos
arm64: defconfig: drop ext2 filesystem and redundant ext3
arm64: defconfig: Enable Rockchip HDMI/eDP Combo PHY
arm64: defconfig: Enable Wave5 Video Encoder/Decoder
arm64: config: disable new platforms in virt.config
arm64: defconfig: Enable QCOM PBS
arm64: deconfig: enable Goodix Berlin SPI touchscreen driver as module
arm64: defconfig: Enable X1E80100 multimedia clock controllers configs
arm64: defconfig: Enable GCC and interconnect for QDU1000/QRU1000
arm64: defconfig: enable i.MX8MP ldb bridge
arm64: defconfig: enable the vf610 gpio driver
ARM: imx_v6_v7_defconfig: enable the vf610 gpio driver
ARM: multi_v7_defconfig: Add more TI Keystone support
arm64: defconfig: enable WCD939x USBSS driver as module
arm64: defconfig: enable audio drivers for SM8650 QRD board
arm64: defconfig: Enable Qualcomm interconnect providers
ARM: multi_v7_defconfig: Enable BACKLIGHT_CLASS_DEVICE
arm64: defconfig: Enable i.MX8QXP device drivers
ARM: multi_v7_defconfig: Add more TI Keystone support
...
Linus Torvalds [Tue, 12 Mar 2024 17:47:33 +0000 (10:47 -0700)]
Merge tag 'soc-arm-6.9' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC code updates from Arnd Bergmann:
"These are mostly minor updates, including a number of kerneldoc fixes
from Randy Dunlap across multiple platforms. OMAP gets a few bugfixes,
and the MAINTAINERS file gets updated for AMD Zynq and NXP S32G"
* tag 'soc-arm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (23 commits)
ARM: s32c: update MAINTAINERS entry
ARM: AM33xx: PRM: Implement REBOOT_COLD
ARM: AM33xx: PRM: Remove redundand defines
ARM: omap1: remove duplicated 'select ARCH_OMAP'
ARM: s3c64xx: make bus_type const
ARM: imx: Remove usage of the deprecated ida_simple_xx() API
ARM: OMAP2+: fix kernel-doc warnings
ARM: OMAP2+: fix kernel-doc warnings
ARM: OMAP2+: fix a kernel-doc warning
ARM: OMAP2+: PRM: fix kernel-doc warnings
ARM: OMAP2+: prm44xx: fix a kernel-doc warning
ARM: OMAP2+: pmic-cpcap: fix kernel-doc warnings
ARM: OMAP2+: hwmod: fix kernel-doc warnings
ARM: OMAP2+: hwmod: remove misuse of kernel-doc
ARM: OMAP2+: CMINST: use matching function name in kernel-doc
ARM: OMAP2+: cm33xx: use matching function name in kernel-doc
ARM: OMAP2+: clock: fix a function name in kernel-doc
ARM: OMAP2+: clockdomain: fix kernel-doc warnings
ARM: OMAP2+: am33xx-restart: fix function name in kernel-doc
soc: xilinx: update maintainer of event manager driver
...
Linus Torvalds [Tue, 12 Mar 2024 17:35:24 +0000 (10:35 -0700)]
Merge tag 'soc-drivers-6.9' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC driver updates from Arnd Bergmann:
"This is the usual mix of updates for drivers that are used on (mostly
ARM) SoCs with no other top-level subsystem tree, including:
- The SCMI firmware subsystem gains support for version 3.2 of the
specification and updates to the notification code
- Feature updates for Tegra and Qualcomm platforms for added hardware
support
- A number of platforms get soc_device additions for identifying
newly added chips from Renesas, Qualcomm, Mediatek and Google
- Trivial improvements for firmware and memory drivers amongst
others, in particular 'const' annotations throughout multiple
subsystems"
* tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (96 commits)
tee: make tee_bus_type const
soc: qcom: aoss: add missing kerneldoc for qmp members
soc: qcom: geni-se: drop unused kerneldoc struct geni_wrapper param
soc: qcom: spm: fix building with CONFIG_REGULATOR=n
bus: ti-sysc: constify the struct device_type usage
memory: stm32-fmc2-ebi: keep power domain on
memory: stm32-fmc2-ebi: add MP25 RIF support
memory: stm32-fmc2-ebi: add MP25 support
memory: stm32-fmc2-ebi: check regmap_read return value
dt-bindings: memory-controller: st,stm32: add MP25 support
dt-bindings: bus: imx-weim: convert to YAML
watchdog: s3c2410_wdt: use exynos_get_pmu_regmap_by_phandle() for PMU regs
soc: samsung: exynos-pmu: Add regmap support for SoCs that protect PMU regs
MAINTAINERS: Update SCMI entry with HWMON driver
MAINTAINERS: samsung: gs101: match patches touching Google Tensor SoC
memory: tegra: Fix indentation
memory: tegra: Add BPMP and ICC info for DLA clients
memory: tegra: Correct DLA client names
dt-bindings: memory: renesas,rpc-if: Document R-Car V4M support
firmware: arm_scmi: Update the supported clock protocol version
...
Linus Torvalds [Tue, 12 Mar 2024 17:29:57 +0000 (10:29 -0700)]
Merge tag 'soc-dt-6.9' of git://git./linux/kernel/git/soc/soc
Pull SoC device tree updates from Arnd Bergmann:
"There is very little going on with new SoC support this time, all the
new chips are variations of others that we already support, and they
are all based on ARMv8 cores:
- Mediatek MT7981B (Filogic 820) and MT7988A (Filogic 880) are
networking SoCs designed to be used in wireless routers, similar to
the already supported MT7986A (Filogic 830).
- NXP i.MX8DXP is a variant of i.MX8QXP, with two CPU cores less.
These are used in many embedded and industrial applications.
- Renesas R8A779G2 (R-Car V4H ES2.0) and R8A779H0 (R-Car V4M) are
automotive SoCs.
- TI J722S is another automotive variant of its K3 family, related to
the AM62 series.
There are a total of 7 new arm32 machines and 45 arm64 ones, including
- Two Android phones based on the old Tegra30 chip
- Two machines using Cortex-A53 SoCs from Allwinner, a mini PC and a
SoM development board
- A set-top box using Amlogic Meson G12A S905X2
- Eight embedded board using NXP i.MX6/8/9
- Three machines using Mediatek network router chips
- Ten Chromebooks, all based on Mediatek MT8186
- One development board based on Mediatek MT8395 (Genio 1200)
- Seven tablets and phones based on Qualcomm SoCs, most of them from
Samsung.
- A third development board for Qualcomm SM8550 (Snapdragon 8 Gen 2)
- Three variants of the "White Hawk" board for Renesas automotive
SoCs
- Ten Rockchips RK35xx based machines, including NAS, Tablet, Game
console and industrial form factors.
- Three evaluation boards for TI K3 based SoCs
The other changes are mainly the usual feature additions for existing
hardware, cleanups, and dtc compile time fixes. One notable change is
the inclusion of PowerVR SGX GPU nodes on TI SoCs"
* tag 'soc-dt-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (824 commits)
riscv: dts: Move BUILTIN_DTB_SOURCE to common Kconfig
riscv: dts: starfive: jh7100: fix root clock names
ARM: dts: samsung: exynos4412: decrease memory to account for unusable region
arm64: dts: qcom: sm8250-xiaomi-elish: set rotation
arm64: dts: qcom: sm8650: Fix SPMI channels size
arm64: dts: qcom: sm8550: Fix SPMI channels size
arm64: dts: rockchip: Fix name for UART pin header on qnap-ts433
arm: dts: marvell: clearfog-gtr-l8: align port numbers with enclosure
arm: dts: marvell: clearfog-gtr-l8: add support for second sfp connector
dt-bindings: soc: renesas: renesas-soc: Add pattern for gray-hawk
dtc: Enable dtc interrupt_provider check
arm64: dts: st: add video encoder support to stm32mp255
arm64: dts: st: add video decoder support to stm32mp255
ARM: dts: stm32: enable crypto accelerator on stm32mp135f-dk
ARM: dts: stm32: enable CRC on stm32mp135f-dk
ARM: dts: stm32: add CRC on stm32mp131
ARM: dts: add stm32f769-disco-mb1166-reva09
ARM: dts: stm32: add display support on stm32f769-disco
ARM: dts: stm32: rename mmc_vcard to vcc-3v3 on stm32f769-disco
ARM: dts: stm32: add DSI support on stm32f769
...
Linus Torvalds [Tue, 12 Mar 2024 17:27:52 +0000 (10:27 -0700)]
Merge tag 'm68k-for-v6.9-tag1' of git://git./linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
- Make the Zorro bus type constant
- defconfig updates
* tag 'm68k-for-v6.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: defconfig: Update defconfigs for v6.8-rc1
zorro: Make zorro_bus_type const
Linus Torvalds [Tue, 12 Mar 2024 17:14:22 +0000 (10:14 -0700)]
Merge tag 's390-6.9-1' of git://git./linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Various virtual vs physical address usage fixes
- Fix error handling in Processor Activity Instrumentation device
driver, and export number of counters with a sysfs file
- Allow for multiple events when Processor Activity Instrumentation
counters are monitored in system wide sampling
- Change multiplier and shift values of the Time-of-Day clock source to
improve steering precision
- Remove a couple of unneeded GFP_DMA flags from allocations
- Disable mmap alignment if randomize_va_space is also disabled, to
avoid a too small heap
- Various changes to allow s390 to be compiled with LLVM=1, since
ld.lld and llvm-objcopy will have proper s390 support witch clang 19
- Add __uninitialized macro to Compiler Attributes. This is helpful
with s390's FPU code where some users have up to 520 byte stack
frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or
INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the
intention (performance improvement) of such code sections.
- Convert switch_to() to an out-of-line function, and use the generic
switch_to header file
- Replace the usage of s390's debug feature with pr_debug() calls
within the zcrypt device driver
- Improve hotplug support of the Adjunct Processor device driver
- Improve retry handling in the zcrypt device driver
- Various changes to the in-kernel FPU code:
- Make in-kernel FPU sections preemptible
- Convert various larger inline assemblies and assembler files to
C, mainly by using singe instruction inline assemblies. This
increases readability, but also allows makes it easier to add
proper instrumentation hooks
- Cleanup of the header files
- Provide fast variants of csum_partial() and
csum_partial_copy_nocheck() based on vector instructions
- Introduce and use a lock to synchronize accesses to zpci device data
structures to avoid inconsistent states caused by concurrent accesses
- Compile the kernel without -fPIE. This addresses the following
problems if the kernel is compiled with -fPIE:
- It uses dynamic symbols (.dynsym), for which the linker refuses
to allow more than 64k sections. This can break features which
use '-ffunction-sections' and '-fdata-sections', including
kpatch-build and function granular KASLR
- It unnecessarily uses GOT relocations, adding an extra layer of
indirection for many memory accesses
- Fix shared_cpu_list for CPU private L2 caches, which incorrectly were
reported as globally shared
* tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits)
s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64
s390/cache: prevent rebuild of shared_cpu_list
s390/crypto: remove retry loop with sleep from PAES pkey invocation
s390/pkey: improve pkey retry behavior
s390/zcrypt: improve zcrypt retry behavior
s390/zcrypt: introduce retries on in-kernel send CPRB functions
s390/ap: introduce mutex to lock the AP bus scan
s390/ap: rework ap_scan_bus() to return true on config change
s390/ap: clarify AP scan bus related functions and variables
s390/ap: rearm APQNs bindings complete completion
s390/configs: increase number of LOCKDEP_BITS
s390/vfio-ap: handle hardware checkstop state on queue reset operation
s390/pai: change sampling event assignment for PMU device driver
s390/boot: fix minor comment style damages
s390/boot: do not check for zero-termination relocation entry
s390/boot: make type of __vmlinux_relocs_64_start|end consistent
s390/boot: sanitize kaslr_adjust_relocs() function prototype
s390/boot: simplify GOT handling
s390: vmlinux.lds.S: fix .got.plt assertion
s390/boot: workaround current 'llvm-objdump -t -j ...' behavior
...
Linus Torvalds [Tue, 12 Mar 2024 16:58:57 +0000 (09:58 -0700)]
Merge tag 'x86-boot-2024-03-12' of git://git./linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
- Continuing work by Ard Biesheuvel to improve the x86 early startup
code, with the long-term goal to make it position independent:
- Get rid of early accesses to global objects, either by moving
them to the stack, deferring the access until later, or dropping
the globals entirely
- Move all code that runs early via the 1:1 mapping into
.head.text, and move code that does not out of it, so that build
time checks can be added later to ensure that no inadvertent
absolute references were emitted into code that does not
tolerate them
- Remove fixup_pointer() and occurrences of __pa_symbol(), which
rely on the compiler emitting absolute references, which is not
guaranteed
- Improve the early console code
- Add early console message about ignored NMIs, so that users are at
least warned about their existence - even if we cannot do anything
about them
- Improve the kexec code's kernel load address handling
- Enable more X86S (simplified x86) bits
- Simplify early boot GDT handling
- Micro-optimize the boot code a bit
- Misc cleanups
* tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/sev: Move early startup code into .head.text section
x86/sme: Move early SME kernel encryption handling into .head.text
x86/boot: Move mem_encrypt= parsing to the decompressor
efi/libstub: Add generic support for parsing mem_encrypt=
x86/startup_64: Simplify virtual switch on primary boot
x86/startup_64: Simplify calculation of initial page table address
x86/startup_64: Defer assignment of 5-level paging global variables
x86/startup_64: Simplify CR4 handling in startup code
x86/boot: Use 32-bit XOR to clear registers
efi/x86: Set the PE/COFF header's NX compat flag unconditionally
x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]
x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]
x86/boot/64: Use RIP_REL_REF() to access early page tables
x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'
x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]
x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'
x86/boot/64: Simplify global variable accesses in GDT/IDT programming
x86/trampoline: Bypass compat mode in trampoline_start64() if not needed
kexec: Allocate kernel above bzImage's pref_address
x86/boot: Add a message about ignored early NMIs
...
Linus Torvalds [Tue, 12 Mar 2024 16:45:34 +0000 (09:45 -0700)]
Merge tag 'x86-apic-2024-03-12' of git://git./linux/kernel/git/tip/tip
Pull x86 APIC fixup from Dave Hansen:
"Revert VERW fixed addressing patch.
The reverted commit is not x86/apic material and was cruft left over
from a merge.
I believe the sequence of events went something like this:
- The commit in question was added to x86/urgent
- x86/urgent was merged into x86/apic to resolve a conflict
- The commit was zapped from x86/urgent, but *not* from x86/apic
- x86/apic got pullled (yesterday)
I think we need to be a bit more vigilant when zapping things to make
sure none of the other branches are depending on the zapped material"
* tag 'x86-apic-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "x86/bugs: Use fixed addressing for VERW operand"
Linus Torvalds [Tue, 12 Mar 2024 16:31:39 +0000 (09:31 -0700)]
Merge tag 'rfds-for-linus-2024-03-11' of git://git./linux/kernel/git/tip/tip
Pull x86 RFDS mitigation from Dave Hansen:
"RFDS is a CPU vulnerability that may allow a malicious userspace to
infer stale register values from kernel space. Kernel registers can
have all kinds of secrets in them so the mitigation is basically to
wait until the kernel is about to return to userspace and has user
values in the registers. At that point there is little chance of
kernel secrets ending up in the registers and the microarchitectural
state can be cleared.
This leverages some recent robustness fixes for the existing MDS
vulnerability. Both MDS and RFDS use the VERW instruction for
mitigation"
* tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
x86/rfds: Mitigate Register File Data Sampling (RFDS)
Documentation/hw-vuln: Add documentation for RFDS
x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
Dave Hansen [Tue, 12 Mar 2024 14:27:57 +0000 (07:27 -0700)]
Revert "x86/bugs: Use fixed addressing for VERW operand"
This was reverts commit
8009479ee919b9a91674f48050ccbff64eafedaa.
It was originally in x86/urgent, but was deemed wrong so got zapped.
But in the meantime, x86/urgent had been merged into x86/apic to
resolve a conflict. I didn't notice the merge so didn't zap it
from x86/apic and it managed to make it up with the x86/apic
material.
The reverted commit is known to cause some KASAN problems.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Ingo Molnar [Tue, 12 Mar 2024 08:49:52 +0000 (09:49 +0100)]
Merge branch 'linus' into x86/boot, to resolve conflict
There's a new conflict with Linus's upstream tree, because
in the following merge conflict resolution in <asm/coco.h>:
38b334fc767e Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Linus has resolved the conflicting placement of 'cc_mask' better
than the original commit:
1c811d403afd x86/sev: Fix position dependent variable references in startup code
... which was also done by an internal merge resolution:
2e5fc4786b7a Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree
But Linus is right in
38b334fc767e, the 'cc_mask' declaration is sufficient
within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block.
So instead of forcing Linus to do the same resolution again, merge in Linus's
tree and follow his conflict resolution.
Conflicts:
arch/x86/include/asm/coco.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jakub Kicinski [Tue, 12 Mar 2024 03:37:53 +0000 (20:37 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.9 net-next PR.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 12 Mar 2024 03:35:22 +0000 (20:35 -0700)]
Merge branch 'nexthop-fix-two-nexthop-group-statistics-issues'
Ido Schimmel says:
====================
nexthop: Fix two nexthop group statistics issues
Fix two issues that were introduced as part of the recent nexthop group
statistics submission. See the commit messages for more details.
====================
Link: https://lore.kernel.org/r/20240311162307.545385-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Mon, 11 Mar 2024 16:23:07 +0000 (18:23 +0200)]
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
Locally generated packets can increment the new nexthop statistics from
process context, resulting in the following splat [1] due to preemption
being enabled. Fix by using get_cpu_ptr() / put_cpu_ptr() which will
which take care of disabling / enabling preemption.
BUG: using smp_processor_id() in preemptible [
00000000] code: ping/949
caller is nexthop_select_path+0xcf8/0x1e30
CPU: 12 PID: 949 Comm: ping Not tainted
6.8.0-rc7-custom-gcb450f605fae #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xbd/0xe0
check_preemption_disabled+0xce/0xe0
nexthop_select_path+0xcf8/0x1e30
fib_select_multipath+0x865/0x18b0
fib_select_path+0x311/0x1160
ip_route_output_key_hash_rcu+0xe54/0x2720
ip_route_output_key_hash+0x193/0x380
ip_route_output_flow+0x25/0x130
raw_sendmsg+0xbab/0x34a0
inet_sendmsg+0xa2/0xe0
__sys_sendto+0x2ad/0x430
__x64_sys_sendto+0xe5/0x1c0
do_syscall_64+0xc5/0x1d0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
[...]
Fixes: f4676ea74b85 ("net: nexthop: Add nexthop group entry stats")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Mon, 11 Mar 2024 16:23:06 +0000 (18:23 +0200)]
nexthop: Fix out-of-bounds access during attribute validation
Passing a maximum attribute type to nlmsg_parse() that is larger than
the size of the passed policy will result in an out-of-bounds access [1]
when the attribute type is used as an index into the policy array.
Fix by setting the maximum attribute type according to the policy size,
as is already done for RTM_NEWNEXTHOP messages. Add a test case that
triggers the bug.
No regressions in fib nexthops tests:
# ./fib_nexthops.sh
[...]
Tests passed: 236
Tests failed: 0
[1]
BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x1e53/0x2940
Read of size 1 at addr
ffffffff99ab4d20 by task ip/610
CPU: 3 PID: 610 Comm: ip Not tainted
6.8.0-rc7-custom-gd435d6e3e161 #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x8f/0xe0
print_report+0xcf/0x670
kasan_report+0xd8/0x110
__nla_validate_parse+0x1e53/0x2940
__nla_parse+0x40/0x50
rtm_del_nexthop+0x1bd/0x400
rtnetlink_rcv_msg+0x3cc/0xf20
netlink_rcv_skb+0x170/0x440
netlink_unicast+0x540/0x820
netlink_sendmsg+0x8d3/0xdb0
____sys_sendmsg+0x31f/0xa60
___sys_sendmsg+0x13a/0x1e0
__sys_sendmsg+0x11c/0x1f0
do_syscall_64+0xc5/0x1d0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
[...]
The buggy address belongs to the variable:
rtm_nh_policy_del+0x20/0x40
Fixes: 2118f9390d83 ("net: nexthop: Adjust netlink policy parsing for a new attribute")
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89i+UNcG0PJMW5X7gOMunF38ryMh=L1aeZUKH3kL4UdUqag@mail.gmail.com/
Reported-by: syzbot+65bb09a7208ce3d4a633@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/00000000000088981b06133bc07b@google.com/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Mon, 11 Mar 2024 16:23:05 +0000 (18:23 +0200)]
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
The attribute is parsed in __nh_valid_dump_req() which is called by the
dump handlers of RTM_GETNEXTHOP and RTM_GETNEXTHOPBUCKET although it is
only used by the former and rejected by the policy of the latter.
Move the parsing to nh_valid_dump_req() which is only called by the dump
handler of RTM_GETNEXTHOP.
This is a preparation for a subsequent patch.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Mon, 11 Mar 2024 16:23:04 +0000 (18:23 +0200)]
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
The attribute is parsed into 'op_flags' in nh_valid_get_del_req() which
is called from the handlers of three message types: RTM_DELNEXTHOP,
RTM_GETNEXTHOPBUCKET and RTM_GETNEXTHOP. The attribute is only used by
the latter and rejected by the policies of the other two.
Pass 'op_flags' as NULL from the handlers of the other two and only
parse the attribute when the argument is not NULL.
This is a preparation for a subsequent patch.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Tue, 12 Mar 2024 03:20:36 +0000 (20:20 -0700)]
Merge tag 'x86_tdx_for_6.9' of git://git./linux/kernel/git/tip/tip
Pull x86 tdx update from Dave Hansen:
- Fix sparse warning from TDX use of movdir64b()
* tag 'x86_tdx_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Remove the __iomem annotation of movdir64b()'s dst argument
Linus Torvalds [Tue, 12 Mar 2024 03:07:52 +0000 (20:07 -0700)]
Merge tag 'x86_mm_for_6.9' of git://git./linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:
- Add a warning when memory encryption conversions fail. These
operations require VMM cooperation, even in CoCo environments where
the VMM is untrusted. While it's _possible_ that memory pressure
could trigger the new warning, the odds are that a guest would only
see this from an attacking VMM.
- Simplify page fault code by re-enabling interrupts unconditionally
- Avoid truncation issues when pfns are passed in to pfn_to_kaddr()
with small (<64-bit) types.
* tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/cpa: Warn for set_memory_XXcrypted() VMM fails
x86/mm: Get rid of conditional IF flag handling in page fault path
x86/mm: Ensure input to pfn_to_kaddr() is treated as a 64-bit type
Linus Torvalds [Tue, 12 Mar 2024 02:53:15 +0000 (19:53 -0700)]
Merge tag 'x86-core-2024-03-11' of git://git./linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
- The biggest change is the rework of the percpu code, to support the
'Named Address Spaces' GCC feature, by Uros Bizjak:
- This allows C code to access GS and FS segment relative memory
via variables declared with such attributes, which allows the
compiler to better optimize those accesses than the previous
inline assembly code.
- The series also includes a number of micro-optimizations for
various percpu access methods, plus a number of cleanups of %gs
accesses in assembly code.
- These changes have been exposed to linux-next testing for the
last ~5 months, with no known regressions in this area.
- Fix/clean up __switch_to()'s broken but accidentally working handling
of FPU switching - which also generates better code
- Propagate more RIP-relative addressing in assembly code, to generate
slightly better code
- Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
make it easier for distros to follow & maintain these options
- Rework the x86 idle code to cure RCU violations and to clean up the
logic
- Clean up the vDSO Makefile logic
- Misc cleanups and fixes
* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
x86/idle: Select idle routine only once
x86/idle: Let prefer_mwait_c1_over_halt() return bool
x86/idle: Cleanup idle_setup()
x86/idle: Clean up idle selection
x86/idle: Sanitize X86_BUG_AMD_E400 handling
sched/idle: Conditionally handle tick broadcast in default_idle_call()
x86: Increase brk randomness entropy for 64-bit systems
x86/vdso: Move vDSO to mmap region
x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
x86/retpoline: Ensure default return thunk isn't used at runtime
x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
x86/vdso: Use $(addprefix ) instead of $(foreach )
x86/vdso: Simplify obj-y addition
x86/vdso: Consolidate targets and clean-files
x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK
x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO
x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY
x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY
x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS
...
Linus Torvalds [Tue, 12 Mar 2024 02:37:56 +0000 (19:37 -0700)]
Merge tag 'x86-cleanups-2024-03-11' of git://git./linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups, including a large series from Thomas Gleixner to cure
sparse warnings"
* tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Drop unused declaration of proc_nmi_enabled()
x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables
x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current
x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
x86/percpu: Cure per CPU madness on UP
smp: Consolidate smp_prepare_boot_cpu()
x86/msr: Add missing __percpu annotations
x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
perf/x86/amd/uncore: Fix __percpu annotation
x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP)
x86/apm_32: Remove dead function apm_get_battery_status()
x86/insn-eval: Fix function param name in get_eff_addr_sib()
Linus Torvalds [Tue, 12 Mar 2024 02:23:16 +0000 (19:23 -0700)]
Merge tag 'x86-build-2024-03-11' of git://git./linux/kernel/git/tip/tip
Pull x86 build updates from Ingo Molnar:
- Reduce <asm/bootparam.h> dependencies
- Simplify <asm/efi.h>
- Unify *_setup_data definitions into <asm/setup_data.h>
- Reduce the size of <asm/bootparam.h>
* tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Do not include <asm/bootparam.h> in several files
x86/efi: Implement arch_ima_efi_boot_mode() in source file
x86/setup: Move internal setup_data structures into setup_data.h
x86/setup: Move UAPI setup structures into setup_data.h
Linus Torvalds [Tue, 12 Mar 2024 02:13:06 +0000 (19:13 -0700)]
Merge tag 'x86-asm-2024-03-11' of git://git./linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
"Two changes to simplify the x86 decoder logic a bit"
* tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/insn: Directly assign x86_64 state in insn_init()
x86/insn: Remove superfluous checks from instruction decoding routines
Linus Torvalds [Tue, 12 Mar 2024 01:45:16 +0000 (18:45 -0700)]
Merge tag 'sched-core-2024-03-11' of git://git./linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Fix inconsistency in misfit task load-balancing
- Fix CPU isolation bugs in the task-wakeup logic
- Rework and unify the sched_use_asym_prio() and sched_asym_prefer()
logic
- Clean up and simplify ->avg_* accesses
- Misc cleanups and fixes
* tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC
sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio()
sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer()
sched/fair: Remove unused parameter from sched_asym()
sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS
sched/fair: Simplify the update_sd_pick_busiest() logic
sched/fair: Do strict inequality check for busiest misfit task group
sched/fair: Remove unnecessary goto in update_sd_lb_stats()
sched/fair: Take the scheduling domain into account in select_idle_core()
sched/fair: Take the scheduling domain into account in select_idle_smt()
sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq
sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl
sched/core: Simplify code by removing duplicate #ifdefs
Linus Torvalds [Tue, 12 Mar 2024 01:33:03 +0000 (18:33 -0700)]
Merge tag 'locking-core-2024-03-11' of git://git./linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Micro-optimize local_xchg() and the rtmutex code on x86
- Fix percpu-rwsem contention tracepoints
- Simplify debugging Kconfig dependencies
- Update/clarify the documentation of atomic primitives
- Misc cleanups
* tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters()
locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix
locking/percpu-rwsem: Trigger contention tracepoints only if contended
locking/rwsem: Make DEBUG_RWSEMS and PREEMPT_RT mutually exclusive
locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint
locking/mutex: Simplify <linux/mutex.h>
locking/qspinlock: Fix 'wait_early' set but not used warning
locking/atomic: scripts: Clarify ordering of conditional atomics
Linus Torvalds [Tue, 12 Mar 2024 01:14:06 +0000 (18:14 -0700)]
Merge tag 'edac_updates_for_v6.9' of git://git./linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add a FRU (Field Replaceable Unit) memory poison manager which
collects and manages previously encountered hw errors in order to
save them to persistent storage across reboots. Previously recorded
errors are "replayed" upon reboot in order to poison memory which has
caused said errors in the past.
The main use case is stacked, on-chip memory which cannot simply be
replaced so poisoning faulty areas of it and thus making them
inaccessible is the only strategy to prolong its lifetime.
- Add an AMD address translation library glue which converts the
reported addresses of hw errors into system physical addresses in
order to be used by other subsystems like memory failure, for
example. Add support for MI300 accelerators to that library.
- igen6: Add support for Alder Lake-N SoC
- i10nm: Add Grand Ridge support
- The usual fixlets and cleanups
* tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/versal: Convert to platform remove callback returning void
RAS/AMD/FMPM: Fix off by one when unwinding on error
RAS/AMD/FMPM: Add debugfs interface to print record entries
RAS/AMD/FMPM: Save SPA values
RAS: Export helper to get ras_debugfs_dir
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
RAS: Introduce a FRU memory poison manager
RAS/AMD/ATL: Add MI300 row retirement support
Documentation: Move RAS section to admin-guide
EDAC/versal: Make the bit position of injected errors configurable
EDAC/i10nm: Add Intel Grand Ridge micro-server support
EDAC/igen6: Add one more Intel Alder Lake-N SoC support
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
RAS/AMD/ATL: Add MI300 support
Documentation: RAS: Add index and address translation section
EDAC/amd64: Use new AMD Address Translation Library
RAS: Introduce AMD Address Translation Library
EDAC/synopsys: Convert to devm_platform_ioremap_resource()
Jakub Kicinski [Tue, 12 Mar 2024 01:06:04 +0000 (18:06 -0700)]
Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2024-03-11
We've added 59 non-merge commits during the last 9 day(s) which contain
a total of 88 files changed, 4181 insertions(+), 590 deletions(-).
The main changes are:
1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce
VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena,
from Alexei.
2) Introduce bpf_arena which is sparse shared memory region between bpf
program and user space where structures inside the arena can have
pointers to other areas of the arena, and pointers work seamlessly for
both user-space programs and bpf programs, from Alexei and Andrii.
3) Introduce may_goto instruction that is a contract between the verifier
and the program. The verifier allows the program to loop assuming it's
behaving well, but reserves the right to terminate it, from Alexei.
4) Use IETF format for field definitions in the BPF standard
document, from Dave.
5) Extend struct_ops libbpf APIs to allow specify version suffixes for
stuct_ops map types, share the same BPF program between several map
definitions, and other improvements, from Eduard.
6) Enable struct_ops support for more than one page in trampolines,
from Kui-Feng.
7) Support kCFI + BPF on riscv64, from Puranjay.
8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay.
9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke.
====================
Link: https://lore.kernel.org/r/20240312003646.8692-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Tue, 12 Mar 2024 01:02:44 +0000 (18:02 -0700)]
Merge tag 'x86_misc_for_v6.9_rc1' of git://git./linux/kernel/git/tip/tip
Pull misc x86 fixes from Borislav Petkov:
- Fix a wrong check in the function reporting whether a CPU executes
(or not) a NMI handler
- Ratelimit unknown NMIs messages in order to not potentially slow down
the machine
- Other fixlets
* tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix the inverse "in NMI handler" check
Documentation/maintainer-tip: Add C++ tail comments exception
Documentation/maintainer-tip: Add Closes tag
x86/nmi: Rate limit unknown NMI messages
Documentation/kernel-parameters: Add spec_rstack_overflow to mitigations=off
Linus Torvalds [Tue, 12 Mar 2024 00:44:11 +0000 (17:44 -0700)]
Merge tag 'x86_sev_for_v6.9_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Add the x86 part of the SEV-SNP host support.
This will allow the kernel to be used as a KVM hypervisor capable of
running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP
is the ultimate goal of the AMD confidential computing side,
providing the most comprehensive confidential computing environment
up to date.
This is the x86 part and there is a KVM part which did not get ready
in time for the merge window so latter will be forthcoming in the
next cycle.
- Rework the early code's position-dependent SEV variable references in
order to allow building the kernel with clang and -fPIE/-fPIC and
-mcmodel=kernel
- The usual set of fixes, cleanups and improvements all over the place
* tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/sev: Disable KMSAN for memory encryption TUs
x86/sev: Dump SEV_STATUS
crypto: ccp - Have it depend on AMD_IOMMU
iommu/amd: Fix failure return from snp_lookup_rmpentry()
x86/sev: Fix position dependent variable references in startup code
crypto: ccp: Make snp_range_list static
x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT
Documentation: virt: Fix up pre-formatted text block for SEV ioctls
crypto: ccp: Add the SNP_SET_CONFIG command
crypto: ccp: Add the SNP_COMMIT command
crypto: ccp: Add the SNP_PLATFORM_STATUS command
x86/cpufeatures: Enable/unmask SEV-SNP CPU feature
KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump
iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown
crypto: ccp: Handle legacy SEV commands when SNP is enabled
crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled
crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
x86/sev: Introduce an SNP leaked pages list
crypto: ccp: Provide an API to issue SEV and SNP commands
...