Jens Axboe [Wed, 13 Nov 2019 20:54:49 +0000 (13:54 -0700)]
io-wq: ensure free/busy list browsing see all items
We have two lists for workers in io-wq, a busy and a free list. For
certain operations we want to browse all workers, and we currently do
that by browsing the two separate lists. But since these lists are RCU
protected, we can potentially miss workers if they move between the two
lists while we're browsing them.
Add a third list, all_list, that simply holds all workers. A worker is
added to that list when it starts, and removed when it exits. This makes
the worker iteration cleaner, too.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 13 Nov 2019 16:43:34 +0000 (09:43 -0700)]
io-wq: ensure we have a stable view of ->cur_work for cancellations
worker->cur_work is currently protected by the lock of the wqe that the
worker belongs to. When we send a signal to a worker, we need a stable
view of ->cur_work, so we need to hold that lock. But this doesn't work
so well, since we have the opposite order potentially on queueing work.
If POLL_ADD is used with a signalfd, then io_poll_wake() is called with
the signal lock, and that sometimes needs to insert work items.
Add a specific worker lock that protects the current work item. Then we
can guarantee that the task we're sending a signal is currently
processing the exact work we think it is.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 13 Nov 2019 05:31:31 +0000 (22:31 -0700)]
io_wq: add get/put_work handlers to io_wq_create()
For cancellation, we need to ensure that the work item stays valid for
as long as ->cur_work is valid. Right now we can't safely dereference
the work item even under the wqe->lock, because while the ->cur_work
pointer will remain valid, the work could be completing and be freed
in parallel.
Only invoke ->get/put_work() on items we know that the caller queued
themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
when we're queueing a flush item, for instance.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 13 Nov 2019 16:09:23 +0000 (09:09 -0700)]
io_uring: check for validity of ->rings in teardown
Normally the rings are always valid, the exception is if we failed to
allocate the rings at setup time. syzbot reports this:
RSP: 002b:
00007ffd6e8aa078 EFLAGS:
00000246 ORIG_RAX:
00000000000001a9
RAX:
ffffffffffffffda RBX:
0000000000000000 RCX:
0000000000441229
RDX:
0000000000000002 RSI:
0000000020000140 RDI:
0000000000000d0d
RBP:
00007ffd6e8aa090 R08:
0000000000000001 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
ffffffffffffffff
R13:
0000000000000003 R14:
0000000000000000 R15:
0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-
20191113
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:
ffff88808f51fc08 EFLAGS:
00010006
RAX:
dffffc0000000000 RBX:
0000000000000000 RCX:
ffffffff815abe4a
RDX:
0000000000000018 RSI:
ffffffff81d168d5 RDI:
ffff8880a9166100
RBP:
ffff88808f51fc70 R08:
0000000000000004 R09:
ffffed1011ea3f7d
R10:
ffffed1011ea3f7c R11:
0000000000000003 R12:
00000000000000c0
R13:
ffff8880a91661c0 R14:
1ffff1101522cc10 R15:
0000000000000000
FS:
0000000001e7a880(0000) GS:
ffff8880ae900000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000020000140 CR3:
000000009a74c000 CR4:
00000000001406e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673
io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260
io_uring_create fs/io_uring.c:4600 [inline]
io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626
__do_sys_io_uring_setup fs/io_uring.c:4639 [inline]
__se_sys_io_uring_setup fs/io_uring.c:4636 [inline]
__x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636
do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441229
Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:
00007ffd6e8aa078 EFLAGS:
00000246 ORIG_RAX:
00000000000001a9
RAX:
ffffffffffffffda RBX:
0000000000000000 RCX:
0000000000441229
RDX:
0000000000000002 RSI:
0000000020000140 RDI:
0000000000000d0d
RBP:
00007ffd6e8aa090 R08:
0000000000000001 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
ffffffffffffffff
R13:
0000000000000003 R14:
0000000000000000 R15:
0000000000000000
Modules linked in:
---[ end trace
b0f5b127a57f623f ]---
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:
ffff88808f51fc08 EFLAGS:
00010006
RAX:
dffffc0000000000 RBX:
0000000000000000 RCX:
ffffffff815abe4a
RDX:
0000000000000018 RSI:
ffffffff81d168d5 RDI:
ffff8880a9166100
RBP:
ffff88808f51fc70 R08:
0000000000000004 R09:
ffffed1011ea3f7d
R10:
ffffed1011ea3f7c R11:
0000000000000003 R12:
00000000000000c0
R13:
ffff8880a91661c0 R14:
1ffff1101522cc10 R15:
0000000000000000
FS:
0000000001e7a880(0000) GS:
ffff8880ae900000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000020000140 CR3:
000000009a74c000 CR4:
00000000001406e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
which is exactly the case of failing to allocate the SQ/CQ rings, and
then entering shutdown. Check if the rings are valid before trying to
access them at shutdown time.
Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com
Fixes: 1d7bb1d50fb4 ("io_uring: add support for backlogged CQ ring")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 12 Nov 2019 15:15:53 +0000 (08:15 -0700)]
io_uring: fix potential deadlock in io_poll_wake()
We attempt to run the poll completion inline, but we're using trylock to
do so. This avoids a deadlock since we're grabbing the locks in reverse
order at this point, we already hold the poll wq lock and we're trying
to grab the completion lock, while the normal rules are the reverse of
that order.
IO completion for a timeout link will need to grab the completion lock,
but that's not safe from this context. Put the completion under the
completion_lock in io_poll_wake(), and mark the request as entering
the completion with the completion_lock already held.
Fixes: 2665abfd757f ("io_uring: add support for linked SQE timeouts")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 12 Nov 2019 14:56:39 +0000 (07:56 -0700)]
io_uring: use correct "is IO worker" helper
Since we switched to io-wq, the dependent link optimization for when to
pass back work inline has been broken. Fix this by providing a suitable
io-wq helper for io_uring to use to detect when to do this.
Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 11 Nov 2019 06:34:16 +0000 (23:34 -0700)]
io_uring: fix -ENOENT issue with linked timer with short timeout
If you prep a read (for example) that needs to get punted to async
context with a timer, if the timeout is sufficiently short, the timer
request will get completed with -ENOENT as it could not find the read.
The issue is that we prep and start the timer before we start the read.
Hence the timer can trigger before the read is even started, and the end
result is then that the timer completes with -ENOENT, while the read
starts instead of being cancelled by the timer.
Fix this by splitting the linked timer into two parts:
1) Prep and validate the linked timer
2) Start timer
The read is then started between steps 1 and 2, so we know that the
timer will always have a consistent view of the read request state.
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 11 Nov 2019 03:30:53 +0000 (20:30 -0700)]
io_uring: don't do flush cancel under inflight_lock
We can't safely cancel under the inflight lock. If the work hasn't been
started yet, then io_wq_cancel_work() simply marks the work as cancelled
and invokes the work handler. But if the work completion needs to grab
the inflight lock because it's grabbing user files, then we'll deadlock
trying to finish the work as we already hold that lock.
Instead grab a reference to the request, if it isn't already zero. If
it's zero, then we know it's going through completion anyway, and we
can safely ignore it. If it's not zero, then we can drop the lock and
attempt to cancel from there.
This also fixes a missing finish_wait() at the end of
io_uring_cancel_files().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 10 Nov 2019 23:56:04 +0000 (16:56 -0700)]
io_uring: flag SQPOLL busy condition to userspace
Now that we have backpressure, for SQPOLL, we have one more condition
that warrants flagging that the application needs to enter the kernel:
we failed to submit IO due to backpressure. Make sure we catch that
and flag it appropriately.
If we run into backpressure issues with the SQPOLL thread, flag it
as such to the application by setting IORING_SQ_NEED_WAKEUP. This will
cause the application to enter the kernel, and that will flush the
backlog and clear the condition.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 10 Nov 2019 00:43:02 +0000 (17:43 -0700)]
io_uring: make ASYNC_CANCEL work with poll and timeout
It's a little confusing that we have multiple types of command
cancellation opcodes now that we have a generic one. Make the generic
one work with POLL_ADD and TIMEOUT commands as well, that makes for an
easier to use API for the application. The fact that they currently
don't is a bit confusing.
Add a helper that takes care of it, so we can user it from both
IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation.
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 8 Nov 2019 15:52:53 +0000 (08:52 -0700)]
io_uring: provide fallback request for OOM situations
One thing that really sucks for userspace APIs is if the kernel passes
back -ENOMEM/-EAGAIN for resource shortages. The application really has
no idea of what to do in those cases. Should it try and reap
completions? Probably a good idea. Will it solve the issue? Who knows.
This patch adds a simple fallback mechanism if we fail to allocate
memory for a request. If we fail allocating memory from the slab for a
request, we punt to a pre-allocated request. There's just one of these
per io_ring_ctx, but the important part is if we ever return -EBUSY to
the application, the applications knows that it can wait for events and
make forward progress when events have completed. This is the important
part.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 10 Nov 2019 02:52:33 +0000 (19:52 -0700)]
io_uring: convert accept4() -ERESTARTSYS into -EINTR
If we cancel a pending accept operating with a signal, we get
-ERESTARTSYS returned. Turn that into -EINTR for userspace, we should
not be return -ERESTARTSYS.
Fixes: 17f2fe35d080 ("io_uring: add support for IORING_OP_ACCEPT")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 10 Nov 2019 15:40:53 +0000 (08:40 -0700)]
io_uring: fix error clear of ->file_table in io_sqe_files_register()
syzbot reports that when using failslab and friends, we can get a double
free in io_sqe_files_unregister():
BUG: KASAN: double-free or invalid-free in
io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-
20191108
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x197/0x210 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468
__kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450
kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
__cache_free mm/slab.c:3426 [inline]
kfree+0x10a/0x2c0 mm/slab.c:3757
io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
io_ring_ctx_free fs/io_uring.c:3998 [inline]
io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060
io_uring_release+0x42/0x50 fs/io_uring.c:4068
__fput+0x2ff/0x890 fs/file_table.c:280
____fput+0x16/0x20 fs/file_table.c:313
task_work_run+0x145/0x1c0 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x904/0x2e60 kernel/exit.c:817
do_group_exit+0x135/0x360 kernel/exit.c:921
__do_sys_exit_group kernel/exit.c:932 [inline]
__se_sys_exit_group kernel/exit.c:930 [inline]
__x64_sys_exit_group+0x44/0x50 kernel/exit.c:930
do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43f2c8
Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b
6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00
00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48
RSP: 002b:
00007ffd5b976008 EFLAGS:
00000246 ORIG_RAX:
00000000000000e7
RAX:
ffffffffffffffda RBX:
0000000000000000 RCX:
000000000043f2c8
RDX:
0000000000000000 RSI:
000000000000003c RDI:
0000000000000000
RBP:
00000000004bf0a8 R08:
00000000000000e7 R09:
ffffffffffffffd0
R10:
0000000000000001 R11:
0000000000000246 R12:
0000000000000001
R13:
00000000006d1180 R14:
0000000000000000 R15:
0000000000000000
This happens if we fail allocating the file tables. For that case we do
free the file table correctly, but we forget to set it to NULL. This
means that ring teardown will see it as being non-NULL, and attempt to
free it again.
Fix this by clearing the file_table pointer if we free the table.
Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com
Fixes: 65e19f54d29c ("io_uring: support for larger fixed file sets")
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jackie Liu [Sat, 9 Nov 2019 03:00:08 +0000 (11:00 +0800)]
io_uring: separate the io_free_req and io_free_req_find_next interface
Similar to the distinction between io_put_req and io_put_req_find_next,
io_free_req has been modified similarly, with no functional changes.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jackie Liu [Fri, 8 Nov 2019 15:50:36 +0000 (23:50 +0800)]
io_uring: keep io_put_req only responsible for release and put req
We already have io_put_req_find_next to find the next req of the link.
we should not use the io_put_req function to find them. They should be
functions of the same level.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jackie Liu [Fri, 8 Nov 2019 15:09:12 +0000 (08:09 -0700)]
io_uring: remove passed in 'ctx' function parameter ctx if possible
Many times, the core of the function is req, and req has already set
req->ctx at initialization time, so there is no need to pass in the
ctx from the caller.
Cleanup, no functional change.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 8 Nov 2019 01:27:42 +0000 (18:27 -0700)]
io_uring: reduce/pack size of io_ring_ctx
With the recent flurry of additions and changes to io_uring, the
layout of io_ring_ctx has become a bit stale. We're right now at
704 bytes in size on my x86-64 build, or 11 cachelines. This
patch does two things:
- We have to completion structs embedded, that we only use for
quiesce of the ctx (or shutdown) and for sqthread init cases.
That 2x32 bytes right there, let's dynamically allocate them.
- Reorder the struct a bit with an eye on cachelines, use cases,
and holes.
With this patch, we're down to 512 bytes, or 8 cachelines.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 7 Nov 2019 17:57:36 +0000 (10:57 -0700)]
io_uring: properly mark async work as bounded vs unbounded
Now that io-wq supports separating the two request lifetime types, mark
the following IO as having unbounded runtimes:
- Any read/write to a non-regular file
- Any specific networked IO
- Any poll command
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 7 Nov 2019 18:41:16 +0000 (11:41 -0700)]
io-wq: add support for bounded vs unbunded work
io_uring supports request types that basically have two different
lifetimes:
1) Bounded completion time. These are requests like disk reads or writes,
which we know will finish in a finite amount of time.
2) Unbounded completion time. These are generally networked IO, where we
have no idea how long they will take to complete. Another example is
POLL commands.
This patch provides support for io-wq to handle these differently, so we
don't starve bounded requests by tying up workers for too long. By default
all work is bounded, unless otherwise specified in the work item.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 7 Nov 2019 16:17:36 +0000 (09:17 -0700)]
io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful()
We hold the wqe lock at this point (which is also annotated), so there's
no need to use the careful variant of list_empty().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Nov 2019 18:31:17 +0000 (11:31 -0700)]
io_uring: add support for backlogged CQ ring
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.
After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.
Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Nov 2019 22:21:34 +0000 (15:21 -0700)]
io_uring: pass in io_kiocb to fill/add CQ handlers
This is in preparation for handling CQ ring overflow a bit smarter. We
should not have any functional changes in this patch. Most of the
changes are fairly straight forward, the only ones that stick out a bit
are the ones that change __io_free_req() to take the reference count
into account. If the request hasn't been submitted yet, we know it's
safe to simply ignore references and free it. But let's clean these up
too, as later patches will depend on the caller doing the right thing if
the completion logging grabs a reference to the request.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Nov 2019 18:27:53 +0000 (11:27 -0700)]
io_uring: make io_cqring_events() take 'ctx' as argument
The rings can be derived from the ctx, and we need the ctx there for
a future change.
No functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 5 Nov 2019 19:40:47 +0000 (12:40 -0700)]
io_uring: add support for linked SQE timeouts
While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.
This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 5 Nov 2019 19:39:45 +0000 (12:39 -0700)]
io_uring: abstract out io_async_cancel_one() helper
We're going to need this helper in a future patch, so move it out
of io_async_cancel() and into its own separate function.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 6 Nov 2019 22:41:08 +0000 (01:41 +0300)]
io_uring: use inlined struct sqe_submit
req->submit is always up-to-date, use it directly
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 6 Nov 2019 22:41:07 +0000 (01:41 +0300)]
io_uring: Use submit info inlined into req
Stack allocated struct sqe_submit is passed down to the submission path
along with a request (a.k.a. struct io_kiocb), and will be copied into
req->submit for async requests.
As space for it is already allocated, fill req->submit in the first
place instead of using on-stack one. As a result:
1. sqe->submit is the only place for sqe_submit and is always valid,
so we don't need to track which one to use.
2. don't need to copy in case of async
3. allows to simplify the code by not carrying it as an argument all
the way down
4. allows to reduce number of function arguments / potentially improve
spilling
The downside is that stack is most probably be cached, that's not true
for just allocated memory for a request. Another concern is cache
pollution. Though, a request would be touched and fetched along with
req->submit at some point anyway, so shouldn't be a problem.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 6 Nov 2019 22:41:06 +0000 (01:41 +0300)]
io_uring: allocate io_kiocb upfront
Let io_submit_sqes() to allocate io_kiocb before fetching an sqe.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 5 Nov 2019 21:22:15 +0000 (00:22 +0300)]
io_uring: io_queue_link*() right after submit
After a call to io_submit_sqe(), it's already known whether it needs
to queue a link or not. Do it there, as it's simplier and doesn't keep
an extra variable across the loop.
Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 5 Nov 2019 21:22:14 +0000 (00:22 +0300)]
io_uring: Merge io_submit_sqes and io_ring_submit
io_submit_sqes() and io_ring_submit() are doing the same stuff with
a little difference. Deduplicate them.
Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Nov 2019 03:34:32 +0000 (20:34 -0700)]
io_uring: kill dead REQ_F_LINK_DONE flag
We had no more use for this flag after the conversion to io-wq, kill it
off.
Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Nov 2019 03:33:16 +0000 (20:33 -0700)]
io_uring: fixup a few spots where link failure isn't flagged
If a request fails, we need to ensure we set REQ_F_FAIL_LINK on it if
REQ_F_LINK is set. Any failure in the chain should break the chain.
We were missing a few spots where this should be done. It might be nice
to generalize this somewhat at some point, as long as we factor in the
fact that failure looks different for each request type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 5 Nov 2019 22:32:58 +0000 (15:32 -0700)]
io_uring: enable optimized link handling for IORING_OP_POLL_ADD
As introduced by commit:
ba816ad61fdf ("io_uring: run dependent links inline if possible")
enable inline dependent link running for poll commands.
io_poll_complete_work() is the most important change, as it allows a
linked sequence of { POLL, READ } (for example) to proceed inline
instead of needing to get punted to another async context. The
submission side only potentially matters for sqthread, but may as well
include that bit.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 5 Nov 2019 20:51:51 +0000 (13:51 -0700)]
io-wq: use proper nesting IRQ disabling spinlocks for cancel
We don't know what context we'll be called in for cancel, it could very
well be with IRQs disabled already. Use the IRQ saving variants of the
locking primitives.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 4 Nov 2019 15:50:02 +0000 (08:50 -0700)]
MAINTAINERS: update io_uring entry
We now have a list that's appropriate for both kernel and userspace
discussions on io_uring usage and development, add that to the
MAINTAINERS entry.
Also add the io-wq files.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 3 Nov 2019 13:52:50 +0000 (06:52 -0700)]
io_uring: add completion trace event
We currently don't have a completion event trace, add one of those. And
to better be able to match up submissions and completions, add user_data
to the submission trace as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
YueHaibing [Sat, 2 Nov 2019 07:55:01 +0000 (15:55 +0800)]
io-wq: use kfree_rcu() to simplify the code
The callback function of call_rcu() just calls kfree(), so we can use
kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 1 Nov 2019 18:44:40 +0000 (12:44 -0600)]
io_uring: remove io_uring_add_to_prev() trace event
This internal logic was killed with the conversion to io-wq, so we no
longer have a need for this particular trace. Kill it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jackie Liu [Tue, 29 Oct 2019 03:16:42 +0000 (11:16 +0800)]
io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait
We didn't use -ERESTARTSYS to tell the application layer to restart the
system call, but instead return -EINTR. we can set -EINTR directly when
wakeup by the signal, which can help us save an assignment operation and
comparison operation.
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 29 Oct 2019 03:49:21 +0000 (21:49 -0600)]
io_uring: support for generic async request cancel
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
cancel requests that have been punted to async context and are now
in-flight. This works for regular read/write requests to files, as
long as they haven't been started yet. For socket based IO (or things
like accept4(2)), we can cancel work that is already running as well.
To cancel a request, the sqe must have ->addr set to the user_data of
the request it wishes to cancel. If the request is cancelled
successfully, the original request is completed with -ECANCELED
and the cancel request is completed with a result of 0. If the
request was already running, the original may or may not complete
in error. The cancel request will complete with -EALREADY for that
case. And finally, if the request to cancel wasn't found, the cancel
request is completed with -ENOENT.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 30 Oct 2019 14:42:56 +0000 (08:42 -0600)]
io_uring: io_wq_create() returns an error pointer, not NULL
syzbot reported an issue where we crash at setup time if failslab is
used. The issue is that io_wq_create() returns an error pointer on
failure, not NULL. Hence io_uring thought the io-wq was setup just
fine, but in reality it's a garbage error pointer.
Use IS_ERR() instead of a NULL check, and assign ret appropriately.
Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com
Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 29 Oct 2019 18:34:10 +0000 (12:34 -0600)]
io_uring: fix race with canceling timeouts
If we get -1 from hrtimer_try_to_cancel(), we know that the timer
is running. Hence leave all completion to the timeout handler. If
we don't, we can corrupt the list and miss a completion.
Fixes: 11365043e527 ("io_uring: add support for canceling timeout requests")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 26 Oct 2019 13:20:21 +0000 (07:20 -0600)]
io_uring: support for larger fixed file sets
There's been a few requests for supporting more fixed files than 1024.
This isn't really tricky to do, we just need to split up the file table
into multiple tables and index appropriately. As we do so, reduce the
max single file table to 512. This enables us to do single page allocs
always for the tables, which is an improvement over the situation prior.
This patch adds support for up to 64K files, which should be enough for
everyone.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 26 Oct 2019 13:22:55 +0000 (07:22 -0600)]
io_uring: protect fixed file indexing with array_index_nospec()
We index the file tables with a user given value. After we check
it's within our limits, use array_index_nospec() to prevent any
spectre attacks here.
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 17 Oct 2019 20:42:58 +0000 (14:42 -0600)]
io_uring: add support for IORING_OP_ACCEPT
This allows an application to call accept4() in an async fashion. Like
other opcodes, we first try a non-blocking accept, then punt to async
context if we have to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 17 Oct 2019 20:41:29 +0000 (14:41 -0600)]
net: add __sys_accept4_file() helper
This is identical to __sys_accept4(), except it takes a struct file
instead of an fd, and it also allows passing in extra file->f_flags
flags. The latter is done to support masking in O_NONBLOCK without
manipulating the original file flags.
No functional changes in this patch.
Cc: netdev@vger.kernel.org
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 24 Oct 2019 18:39:47 +0000 (12:39 -0600)]
io_uring: io_uring: add support for async work inheriting files
This is in preparation for adding opcodes that need to add new files
in a process file table, system calls like open(2) or accept4(2).
If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
item. If work that needs to get punted to async context have this
set, the async worker will assume the original task file table before
executing the work.
Note that opcodes that need access to the current files of an
application cannot be done through IORING_SETUP_SQPOLL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 24 Oct 2019 13:25:42 +0000 (07:25 -0600)]
io_uring: replace workqueue usage with io-wq
Drop various work-arounds we have for workqueues:
- We no longer need the async_list for tracking sequential IO.
- We don't have to maintain our own mm tracking/setting.
- We don't need a separate workqueue for buffered writes. This didn't
even work that well to begin with, as it was suboptimal for multiple
buffered writers on multiple files.
- We can properly cancel pending interruptible work. This fixes
deadlocks with particularly socket IO, where we cannot cancel them
when the io_uring is closed. Hence the ring will wait forever for
these requests to complete, which may never happen. This is different
from disk IO where we know requests will complete in a finite amount
of time.
- Due to being able to cancel work interruptible work that is already
running, we can implement file table support for work. We need that
for supporting system calls that add to a process file table.
- It gets us one step closer to adding async support for any system
call.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 22 Oct 2019 16:25:58 +0000 (10:25 -0600)]
io-wq: small threadpool implementation for io_uring
This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:
- We can assign memory context smarter and more persistently if we
manage the life time of threads.
- We can drop various work-arounds we have in io_uring, like the
async_list.
- We can implement hashed work insertion, to manage concurrency of
buffered writes without needing a) an extra workqueue, or b)
needlessly making the concurrency of said workqueue very low
which hurts performance of multiple buffered file writers.
- We can implement cancel through signals, for cancelling
interruptible work like read/write (or send/recv) to/from sockets.
- We need the above cancel for being able to assign and use file tables
from a process.
- We can implement a more thorough cancel operation in general.
- We need it to move towards a syslet/threadlet model for even faster
async execution. For that we need to take ownership of the used
threads.
This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.
io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 27 Oct 2019 20:15:41 +0000 (23:15 +0300)]
io_uring: Fix mm_fault with READ/WRITE_FIXED
Commit
fb5ccc98782f ("io_uring: Fix broken links with offloading")
introduced a potential performance regression with unconditionally
taking mm even for READ/WRITE_FIXED operations.
Return the logic handling it back. mm-faulted requests will go through
the generic submission path, so honoring links and drains, but will
fail further on req->has_user check.
Fixes: fb5ccc98782f ("io_uring: Fix broken links with offloading")
Cc: stable@vger.kernel.org # v5.4
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 27 Oct 2019 15:52:20 +0000 (18:52 +0300)]
io_uring: remove index from sqe_submit
submit->index is used only for inbound check in submission path (i.e.
head < ctx->sq_entries). However, it always will be true, as
1. it's already validated by io_get_sqring()
2. ctx->sq_entries can't be changedd in between, because of held
ctx->uring_lock and ctx->refs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitrii Dolgov [Tue, 15 Oct 2019 17:02:01 +0000 (19:02 +0200)]
io_uring: add set of tracing events
To trace io_uring activity one can get an information from workqueue and
io trace events, but looks like some parts could be hard to identify via
this approach. Making what happens inside io_uring more transparent is
important to be able to reason about many aspects of it, hence introduce
the set of tracing events.
All such events could be roughly divided into two categories:
* those, that are helping to understand correctness (from both kernel
and an application point of view). E.g. a ring creation, file
registration, or waiting for available CQE. Proposed approach is to
get a pointer to an original structure of interest (ring context, or
request), and then find relevant events. io_uring_queue_async_work
also exposes a pointer to work_struct, to be able to track down
corresponding workqueue events.
* those, that provide performance related information. Mostly it's about
events that change the flow of requests, e.g. whether an async work
was queued, or delayed due to some dependencies. Another important
case is how io_uring optimizations (e.g. registered files) are
utilized.
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 16 Oct 2019 15:08:32 +0000 (09:08 -0600)]
io_uring: add support for canceling timeout requests
We might have cases where the need for a specific timeout is gone, add
support for canceling an existing timeout operation. This works like the
POLL_REMOVE command, where the application passes in the user_data of
the timeout it wishes to cancel in the sqe->addr field.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 15 Oct 2019 22:48:15 +0000 (16:48 -0600)]
io_uring: add support for absolute timeouts
This is a pretty trivial addition on top of the relative timeouts
we have now, but it's handy for ensuring tighter timing for those
that are building scheduling primitives on top of io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jackie Liu [Wed, 9 Oct 2019 01:19:59 +0000 (09:19 +0800)]
io_uring: replace s->needs_lock with s->in_async
There is no function change, just to clean up the code, use s->in_async
to make the code know where it is.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 4 Oct 2019 18:10:03 +0000 (12:10 -0600)]
io_uring: allow application controlled CQ ring size
We currently size the CQ ring as twice the SQ ring, to allow some
flexibility in not overflowing the CQ ring. This is done because the
SQE life time is different than that of the IO request itself, the SQE
is consumed as soon as the kernel has seen the entry.
Certain application don't need a huge SQ ring size, since they just
submit IO in batches. But they may have a lot of requests pending, and
hence need a big CQ ring to hold them all. By allowing the application
to control the CQ ring size multiplier, we can cater to those
applications more efficiently.
If an application wants to define its own CQ ring size, it must set
IORING_SETUP_CQSIZE in the setup flags, and fill out
io_uring_params->cq_entries. The value must be a power of two.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 3 Oct 2019 19:59:56 +0000 (13:59 -0600)]
io_uring: add support for IORING_REGISTER_FILES_UPDATE
Allows the application to remove/replace/add files to/from a file set.
Passes in a struct:
struct io_uring_files_update {
__u32 offset;
__s32 *fds;
};
that holds an array of fds, size of array passed in through the usual
nr_args part of the io_uring_register() system call. The logic is as
follows:
1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
the set.
2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
replaced with ->fds[i].
For case #2, is the existing file is currently empty (fd == -1), the
new fd is simply added to the array.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 3 Oct 2019 14:11:03 +0000 (08:11 -0600)]
io_uring: allow sparse fixed file sets
This is in preparation for allowing updates to fixed file sets without
requiring a full unregister+register.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 28 Sep 2019 17:36:45 +0000 (11:36 -0600)]
io_uring: run dependent links inline if possible
Currently any dependent link is executed from a new workqueue context,
which means that we'll be doing a context switch per link in the chain.
If we are running the completion of the current request from our async
workqueue and find that the next request is a link, then run it directly
from the workqueue context instead of forcing another switch.
This improves the performance of linked SQEs, and reduces the CPU
overhead.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Anton Ivanov [Tue, 29 Oct 2019 09:13:34 +0000 (09:13 +0000)]
um-ubd: Entrust re-queue to the upper layers
Fixes crashes due to ubd requeue logic conflicting with the block-mq
logic. Crash is reproducible in 5.0 - 5.3.
Fixes: 53766defb8c8 ("um: Clean-up command processing in UML UBD driver")
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Anton Eidelman [Fri, 18 Oct 2019 18:32:51 +0000 (11:32 -0700)]
nvme-multipath: remove unused groups_only mode in ana log
groups_only mode in nvme_read_ana_log() is no longer used: remove it.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Anton Eidelman [Fri, 18 Oct 2019 18:32:50 +0000 (11:32 -0700)]
nvme-multipath: fix possible io hang after ctrl reconnect
The following scenario results in an IO hang:
1) ctrl completes a request with NVME_SC_ANA_TRANSITION.
NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered.
2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl.
This fails because ctrl disconnects.
Therefore nvme_update_ns_ana_state() is not called
and NVME_NS_ANA_PENDING bit in ns->flags is not cleared.
3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls
nvme_read_ana_log(ctrl, groups_only=true).
However, nvme_update_ana_state() does not update namespaces
because nr_nsids = 0 (due to groups_only mode).
4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK.
Result:
The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set.
Consequently ctrl will never be considered a viable path by __nvme_find_path().
IO will hang if ctrl is the only or the last path to the namespace.
More generally, while ctrl is reconnecting, its ANA state may change.
And because nvme_mpath_init() requests ANA log in groups_only mode,
these changes are not propagated to the existing ctrl namespaces.
This may result in a mal-function or an IO hang.
Solution:
nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false.
This will not harm the new ctrl case (no namespaces present),
and will make sure the ANA state of namespaces gets updated after reconnect.
Note: Another option would be for nvme_mpath_init() to invoke
nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 28 Oct 2019 15:15:33 +0000 (09:15 -0600)]
io_uring: don't touch ctx in setup after ring fd install
syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.
Defer ring fd installation until after we're done reading those
values.
Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 27 Oct 2019 19:10:36 +0000 (22:10 +0300)]
io_uring: Fix leaked shadow_req
io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 27 Oct 2019 17:19:19 +0000 (13:19 -0400)]
Linux 5.4-rc5
Linus Torvalds [Sun, 27 Oct 2019 11:14:40 +0000 (07:14 -0400)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two fixes for the VMWare guest support:
- Unbreak VMWare platform detection which got wreckaged by converting
an integer constant to a string constant.
- Fix the clang build of the VMWAre hypercall by explicitely
specifying the ouput register for INL instead of using the short
form"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/vmware: Fix platform detection VMWARE_PORT macro
x86/cpu/vmware: Use the full form of INL in VMWARE_HYPERCALL, for clang/llvm
Linus Torvalds [Sun, 27 Oct 2019 11:04:22 +0000 (07:04 -0400)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A small set of fixes for time(keeping):
- Add a missing include to prevent compiler warnings.
- Make the VDSO implementation of clock_getres() POSIX compliant
again. A recent change dropped the NULL pointer guard which is
required as NULL is a valid pointer value for this function.
- Fix two function documentation typos"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Fix two trivial comments
timers/sched_clock: Include local timekeeping.h for missing declarations
lib/vdso: Make clock_getres() POSIX compliant again
Linus Torvalds [Sun, 27 Oct 2019 10:59:34 +0000 (06:59 -0400)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A set of perf fixes:
kernel:
- Unbreak the tracking of auxiliary buffer allocations which got
imbalanced causing recource limit failures.
- Fix the fallout of splitting of ToPA entries which missed to shift
the base entry PA correctly.
- Use the correct context to lookup the AUX event when unmapping the
associated AUX buffer so the event can be stopped and the buffer
reference dropped.
tools:
- Fix buildiid-cache mode setting in copyfile_mode_ns() when copying
/proc/kcore
- Fix freeing id arrays in the event list so the correct event is
closed.
- Sync sched.h anc kvm.h headers with the kernel sources.
- Link jvmti against tools/lib/ctype.o to have weak strlcpy().
- Fix multiple memory and file descriptor leaks, found by coverity in
perf annotate.
- Fix leaks in error handling paths in 'perf c2c', 'perf kmem', found
by a static analysis tool"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Fix AUX output stopping
perf/aux: Fix tracking of auxiliary trace buffer allocation
perf/x86/intel/pt: Fix base for single entry topa
perf kmem: Fix memory leak in compact_gfp_flags()
tools headers UAPI: Sync sched.h with the kernel
tools headers kvm: Sync kvm.h headers with the kernel sources
tools headers kvm: Sync kvm headers with the kernel sources
tools headers kvm: Sync kvm headers with the kernel sources
perf c2c: Fix memory leak in build_cl_output()
perf tools: Fix mode setting in copyfile_mode_ns()
perf annotate: Fix multiple memory and file descriptor leaks
perf tools: Fix resource leak of closedir() on the error paths
perf evlist: Fix fix for freed id arrays
perf jvmti: Link against tools/lib/ctype.h to have weak strlcpy()
Linus Torvalds [Sun, 27 Oct 2019 10:55:55 +0000 (06:55 -0400)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"Two fixes for interrupt controller drivers:
- Skip IRQ_M_EXT entries in the device tree when initializing the
RISCV PLIC controller to avoid a double init attempt.
- Use the correct ITS list when issuing the VMOVP synchronization
command so the operation works only on the ITS instances which are
associated to a VM"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Skip contexts except supervisor in plic_init()
irqchip/gic-v3-its: Use the exact ITSList for VMOVP
Linus Torvalds [Sun, 27 Oct 2019 10:41:52 +0000 (06:41 -0400)]
Merge tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Seven cifs/smb3 fixes, including three for stable"
* tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs
CIFS: Fix use after free of file info structures
CIFS: Fix retry mid list corruption on reconnects
cifs: Fix missed free operations
CIFS: avoid using MID 0xFFFF
cifs: clarify comment about timestamp granularity for old servers
cifs: Handle -EINPROGRESS only when noblockcnt is set
Linus Torvalds [Sun, 27 Oct 2019 10:36:57 +0000 (06:36 -0400)]
Merge tag 'riscv/for-v5.4-rc5-b' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
"Several minor fixes and cleanups for v5.4-rc5:
- Three build fixes for various SPARSEMEM-related kernel
configurations
- Two cleanup patches for the kernel bug and breakpoint trap handler
code"
* tag 'riscv/for-v5.4-rc5-b' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: cleanup do_trap_break
riscv: cleanup <asm/bug.h>
riscv: Fix undefined reference to vmemmap_populate_basepages
riscv: Fix implicit declaration of 'page_to_section'
riscv: fix fs/proc/kcore.c compilation with sparsemem enabled
Linus Torvalds [Sat, 26 Oct 2019 23:43:12 +0000 (19:43 -0400)]
Merge tag 'mips_fixes_5.4_3' of git://git./linux/kernel/git/mips/linux
Pull MIPS fixes from Paul Burton:
"A few MIPS fixes:
- Fix VDSO time-related function behavior for systems where we need
to fall back to syscalls, but were instead returning bogus results.
- A fix to TLB exception handlers for Cavium Octeon systems where
they would inadvertently clobber the $1/$at register.
- A build fix for bcm63xx configurations.
- Switch to using my @kernel.org email address"
* tag 'mips_fixes_5.4_3' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: tlbex: Fix build_restore_pagemask KScratch restore
MIPS: bmips: mark exception vectors as char arrays
mips: vdso: Fix __arch_get_hw_counter()
MAINTAINERS: Use @kernel.org address for Paul Burton
Linus Torvalds [Sat, 26 Oct 2019 20:40:04 +0000 (16:40 -0400)]
Merge tag 'tty-5.4-rc5' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial driver fix from Greg KH:
"Here is a single tty/serial driver fix for 5.4-rc5 that resolves a
reported issue.
It has been in linux-next for a while with no problems"
* tag 'tty-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
8250-men-mcb: fix error checking when get_num_ports returns -ENODEV
Linus Torvalds [Sat, 26 Oct 2019 20:36:47 +0000 (16:36 -0400)]
Merge tag 'staging-5.4-rc5' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver fix from Greg KH:
"Here is a single staging driver fix, for the wlan-ng driver, that
resolves a reported issue.
It is been in linux-next for a while with no reported issues"
* tag 'staging-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: wlan-ng: fix exit return when sme->key_idx >= NUM_WEPKEYS
Linus Torvalds [Sat, 26 Oct 2019 19:23:08 +0000 (15:23 -0400)]
Merge tag 'driver-core-5.4-rc5' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"Here is a single sysfs fix for 5.4-rc5.
It resolves an error if you actually try to use the __BIN_ATTR_WO()
macro, seems I never tested it properly before :(
This has been in linux-next for a while with no reported issues"
* tag 'driver-core-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
sysfs: Fixes __BIN_ATTR_WO() macro
Linus Torvalds [Sat, 26 Oct 2019 19:17:54 +0000 (15:17 -0400)]
Merge tag 'char-misc-5.4-rc5' of git://git./linux/kernel/git/gregkh/char-misc
Pull binder fix from Greg KH:
"This is a single binder fix to resolve a reported issue by Jann. It's
been in linux-next for a while with no reported issues"
* tag 'char-misc-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
binder: Don't modify VMA bounds in ->mmap handler
Linus Torvalds [Sat, 26 Oct 2019 19:14:55 +0000 (15:14 -0400)]
Merge tag 'usb-5.4-rc5' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of small USB driver fixes for 5.4-rc5.
More "fun" with some of the misc USB drivers as found by syzbot, and
there are a number of other small bugfixes in here for reported
issues.
All have been in linux-next for a while with no reported issues"
* tag 'usb-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: cdns3: Error out if USB_DR_MODE_UNKNOWN in cdns3_core_init_role()
USB: ldusb: fix read info leaks
USB: serial: ti_usb_3410_5052: clean up serial data access
USB: serial: ti_usb_3410_5052: fix port-close races
USB: usblp: fix use-after-free on disconnect
usb: udc: lpc32xx: fix bad bit shift operation
usb: cdns3: Fix dequeue implementation.
USB: legousbtower: fix a signedness bug in tower_probe()
USB: legousbtower: fix memleak on disconnect
USB: ldusb: fix memleak on disconnect
Linus Torvalds [Sat, 26 Oct 2019 19:06:58 +0000 (15:06 -0400)]
Merge branch 'i2c/for-current-fixed' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A few driver fixes for the I2C subsystem"
* 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: stm32f7: remove warning when compiling with W=1
i2c: stm32f7: fix a race in slave mode with arbitration loss irq
i2c: stm32f7: fix first byte to send in slave mode
i2c: mt65xx: fix NULL ptr dereference
i2c: aspeed: fix master pending state handling
Linus Torvalds [Sat, 26 Oct 2019 18:59:51 +0000 (14:59 -0400)]
Merge tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block
Pull block and io_uring fixes from Jens Axboe:
"A bit bigger than usual at this point in time, mostly due to some good
bug hunting work by Pavel that resulted in three io_uring fixes from
him and two from me. Anyway, this pull request contains:
- Revert of the submit-and-wait optimization for io_uring, it can't
always be done safely. It depends on commands always making
progress on their own, which isn't necessarily the case outside of
strict file IO. (me)
- Series of two patches from me and three from Pavel, fixing issues
with shared data and sequencing for io_uring.
- Lastly, two timeout sequence fixes for io_uring (zhangyi)
- Two nbd patches fixing races (Josef)
- libahci regulator_get_optional() fix (Mark)"
* tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block:
nbd: verify socket is supported during setup
ata: libahci_platform: Fix regulator_get_optional() misuse
nbd: handle racing with error'ed out commands
nbd: protect cmd->status with cmd->lock
io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
io_uring: used cached copies of sq->dropped and cq->overflow
io_uring: Fix race for sqes with userspace
io_uring: Fix broken links with offloading
io_uring: Fix corrupted user_data
io_uring: correct timeout req sequence when inserting a new entry
io_uring : correct timeout req sequence when waiting timeout
io_uring: revert "io_uring: optimize submit_and_wait API"
Linus Torvalds [Sat, 26 Oct 2019 10:35:46 +0000 (06:35 -0400)]
Merge tag 's390-5.4-5' of git://git./linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Add R_390_GLOB_DAT relocation type support. This fixes boot problem
on linux-next.
- Fix memory leak in zcrypt
* tag 's390-5.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/kaslr: add support for R_390_GLOB_DAT relocation type
s390/zcrypt: fix memleak at release
Linus Torvalds [Sat, 26 Oct 2019 10:32:12 +0000 (06:32 -0400)]
Merge tag 'for-linus-5.4-rc5-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixlet from Juergen Gross:
"Just one patch for issuing a deprecation warning for 32-bit Xen pv
guests"
* tag 'for-linus-5.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: issue deprecation warning for 32-bit pv guest
Linus Torvalds [Sat, 26 Oct 2019 10:29:04 +0000 (06:29 -0400)]
Merge tag 'dma-mapping-5.4-2' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
"Fix a regression in the intel-iommu get_required_mask conversion
(Arvind Sankar)"
* tag 'dma-mapping-5.4-2' of git://git.infradead.org/users/hch/dma-mapping:
iommu/vt-d: Return the correct dma mask when we are bypassing the IOMMU
Linus Torvalds [Sat, 26 Oct 2019 10:26:04 +0000 (06:26 -0400)]
Merge tag 'dax-fix-5.4-rc5' of git://git./linux/kernel/git/nvdimm/nvdimm
Pull dax fix from Dan Williams:
"Fix a performance regression that followed from a fix to the
conversion of the fsdax implementation to the xarray. v5.3 users
report that they stop seeing huge page mappings on an application +
filesystem layout that was seeing huge pages previously on v5.2"
* tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
fs/dax: Fix pmd vs pte conflict detection
Linus Torvalds [Sat, 26 Oct 2019 00:11:33 +0000 (20:11 -0400)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Nine changes, eight to drivers (qla2xxx, hpsa, lpfc, alua, ch,
53c710[x2], target) and one core change that tries to close a race
between sysfs delete and module removal"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: remove left-over BUILD_NVME defines
scsi: core: try to get module before removing device
scsi: hpsa: add missing hunks in reset-patch
scsi: target: core: Do not overwrite CDB byte 1
scsi: ch: Make it possible to open a ch device multiple times again
scsi: fix kconfig dependency warning related to 53C700_LE_ON_BE
scsi: sni_53c710: fix compilation error
scsi: scsi_dh_alua: handle RTPG sense code correctly during state transitions
scsi: qla2xxx: fix a potential NULL pointer dereference
Christoph Hellwig [Thu, 17 Oct 2019 17:37:30 +0000 (19:37 +0200)]
riscv: cleanup do_trap_break
If we always compile the get_break_insn_length inline function we can
remove the ifdefs and let dead code elimination take care of the warn
branch that is now unreadable because the report_bug stub always
returns BUG_TRAP_TYPE_BUG.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Linus Torvalds [Fri, 25 Oct 2019 21:31:53 +0000 (17:31 -0400)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input fix from Dmitry Torokhov:
"A fix for st1232 driver to properly report coordinates for 2nd and
subsequent fingers when more than one is on the surface"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: st1232 - fix reporting multitouch coordinates
Mike Christie [Thu, 17 Oct 2019 21:27:34 +0000 (16:27 -0500)]
nbd: verify socket is supported during setup
nbd requires socket families to support the shutdown method so the nbd
recv workqueue can be woken up from its sock_recvmsg call. If the socket
does not support the callout we will leave recv works running or get hangs
later when the device or module is removed.
This adds a check during socket connection/reconnection to make sure the
socket being passed in supports the needed callout.
Reported-by: syzbot+24c12fa8d218ed26011a@syzkaller.appspotmail.com
Fixes: e9e006f5fcf2 ("nbd: fix max number of supported devs")
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mark Brown [Wed, 16 Oct 2019 10:51:05 +0000 (11:51 +0100)]
ata: libahci_platform: Fix regulator_get_optional() misuse
This driver is using regulator_get_optional() to handle all the supplies
that it handles, and only ever enables and disables all supplies en masse
without ever doing any other configuration of the device to handle missing
power. These are clear signs that the API is being misused - it should only
be used for supplies that may be physically absent from the system and in
these cases the hardware usually needs different configuration if the
supply is missing. Instead use normal regualtor_get(), if the supply is
not described in DT then the framework will substitute a dummy regulator in
so no special handling is needed by the consumer driver.
In the case of the PHY regulator the handling in the driver is a hack to
deal with integrated PHYs; the supplies are only optional in the sense
that that there's some confusion in the code about where they're bound to.
From a code point of view they function exactly as normal supplies so can
be treated as such. It'd probably be better to model this by instantiating
a PHY object for integrated PHYs.
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Mon, 21 Oct 2019 19:56:28 +0000 (15:56 -0400)]
nbd: handle racing with error'ed out commands
We hit the following warning in production
print_req_error: I/O error, dev nbd0, sector
7213934408 flags 80700
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 25 PID: 32407 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
Workqueue: knbd-recv recv_work [nbd]
RIP: 0010:refcount_sub_and_test_checked+0x53/0x60
Call Trace:
blk_mq_free_request+0xb7/0xf0
blk_mq_complete_request+0x62/0xf0
recv_work+0x29/0xa1 [nbd]
process_one_work+0x1f5/0x3f0
worker_thread+0x2d/0x3d0
? rescuer_thread+0x340/0x340
kthread+0x111/0x130
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
---[ end trace
b079c3c67f98bb7c ]---
This was preceded by us timing out everything and shutting down the
sockets for the device. The problem is we had a request in the queue at
the same time, so we completed the request twice. This can actually
happen in a lot of cases, we fail to get a ref on our config, we only
have one connection and just error out the command, etc.
Fix this by checking cmd->status in nbd_read_stat. We only change this
under the cmd->lock, so we are safe to check this here and see if we've
already error'ed this command out, which would indicate that we've
completed it as well.
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Mon, 21 Oct 2019 19:56:27 +0000 (15:56 -0400)]
nbd: protect cmd->status with cmd->lock
We already do this for the most part, except in timeout and clear_req.
For the timeout case we take the lock after we grab a ref on the config,
but that isn't really necessary because we're safe to touch the cmd at
this point, so just move the order around.
For the clear_req cause this is initiated by the user, so again is safe.
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 25 Oct 2019 20:11:55 +0000 (16:11 -0400)]
Merge tag 'modules-for-v5.4-rc5' of git://git./linux/kernel/git/jeyu/linux
Pull modules fixes from Jessica Yu:
- Revert __ksymtab_$namespace.$symbol naming scheme back to
__ksymtab_$symbol, as it was causing issues with depmod.
Instead, have modpost extract a symbol's namespace from __kstrtabns
and __ksymtab_strings.
- Fix `make nsdeps` for out of tree kernel builds (make O=...) caused
by unescaped '/'.
Use a different sed delimiter to avoid this problem.
* tag 'modules-for-v5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
scripts/nsdeps: use alternative sed delimiter
symbol namespaces: revert to previous __ksymtab name scheme
modpost: make updating the symbol namespace explicit
modpost: delegate updating namespaces to separate function
Linus Torvalds [Fri, 25 Oct 2019 20:00:47 +0000 (16:00 -0400)]
Merge tag 'armsoc-fixes' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC fixes from Olof Johansson:
"A slightly larger set of fixes have accrued in the last two weeks.
Mostly a collection of the usual smaller fixes:
- Marvell Armada: USB phy setup issues on Turris Mox
- Broadcom: GPIO/pinmux DT mapping corrections for Stingray, MMC bus
width fix for RPi Zero W, GPIO LED removal for RPI CM3. Also some
maintainer updates.
- OMAP: Fixlets for display config, interrupt settings for wifi, some
clock/PM pieces. Also IOMMU regression fix and a ti-sysc
no-watchdog regression fix.
- i.MX: A few fixes around PM/settings, some devicetree fixlets and
catching up with config option changes in DRM
- Rockchip: RockRro64 misc DT fixups, Hugsun X99 USB-C, Kevin display
panel settings
... and some smaller fixes for Davinci (backlight, McBSP DMA),
Allwinner (phy regulators, PMU removal on A64, etc)"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (42 commits)
ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
MAINTAINERS: Update the Spreadtrum SoC maintainer
MAINTAINERS: Remove Gregory and Brian for ARCH_BRCMSTB
ARM: dts: bcm2837-rpi-cm3: Avoid leds-gpio probing issue
bus: ti-sysc: Fix watchdog quirk handling
ARM: OMAP2+: Add pdata for OMAP3 ISP IOMMU
ARM: OMAP2+: Plug in device_enable/idle ops for IOMMUs
ARM: davinci_all_defconfig: enable GPIO backlight
ARM: davinci: dm365: Fix McBSP dma_slave_map entry
ARM: dts: bcm2835-rpi-zero-w: Fix bus-width of sdhci
ARM: imx_v6_v7_defconfig: Enable CONFIG_DRM_MSM
arm64: dts: imx8mn: Use correct clock for usdhc's ipg clk
arm64: dts: imx8mm: Use correct clock for usdhc's ipg clk
arm64: dts: imx8mq: Use correct clock for usdhc's ipg clk
ARM: dts: imx7s: Correct GPT's ipg clock source
ARM: dts: vf610-zii-scu4-aib: Specify 'i2c-mux-idle-disconnect'
ARM: dts: imx6q-logicpd: Re-Enable SNVS power key
arm64: dts: lx2160a: Correct CPU core idle state name
mailmap: Add Simon Arlott (replacement for expired email address)
arm64: dts: rockchip: Fix override mode for rk3399-kevin panel
...
Linus Torvalds [Fri, 25 Oct 2019 19:52:05 +0000 (15:52 -0400)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Bugfixes for ARM, PPC and x86, plus selftest improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: nVMX: Don't leak L1 MMIO regions to L2
KVM: SVM: Fix potential wrong physical id in avic_handle_ldr_update
kvm: clear kvmclock MSR on reset
KVM: x86: fix bugon.cocci warnings
KVM: VMX: Remove specialized handling of unexpected exit-reasons
selftests: kvm: fix sync_regs_test with newer gccs
selftests: kvm: vmx_dirty_log_test: skip the test when VMX is not supported
selftests: kvm: consolidate VMX support checks
selftests: kvm: vmx_set_nested_state_test: don't check for VMX support twice
KVM: Don't shrink/grow vCPU halt_poll_ns if host side polling is disabled
selftests: kvm: synchronize .gitignore to Makefile
kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID
KVM: arm64: pmu: Reset sample period on overflow handling
KVM: arm64: pmu: Set the CHAINED attribute before creating the in-kernel event
arm64: KVM: Handle PMCR_EL0.LC as RES1 on pure AArch64 systems
KVM: arm64: pmu: Fix cycle counter truncation
KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
Linus Torvalds [Fri, 25 Oct 2019 19:41:14 +0000 (15:41 -0400)]
Merge tag 'drm-fixes-2019-10-25' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Quiet week this week, which I suspect means some people just didn't
get around to sending me fixes pulls in time. This has 2 komeda and a
bunch of amdgpu fixes in it:
komeda:
- typo fixes
- flushing pipes fix
amdgpu:
- Fix suspend/resume issue related to multi-media engines
- Fix memory leak in user ptr code related to hmm conversion
- Fix possible VM faults when allocating page table memory
- Fix error handling in bo list ioctl"
* tag 'drm-fixes-2019-10-25' of git://anongit.freedesktop.org/drm/drm:
drm/komeda: Fix typos in komeda_splitter_validate
drm/komeda: Don't flush inactive pipes
drm/amdgpu/vce: fix allocation size in enc ring test
drm/amdgpu: fix error handling in amdgpu_bo_list_create
drm/amdgpu: fix potential VM faults
drm/amdgpu: user pages array memory leak fix
drm/amdgpu/vcn: fix allocation size in enc ring test
drm/amdgpu/uvd7: fix allocation size in enc ring test (v2)
drm/amdgpu/uvd6: fix allocation size in enc ring test (v2)
Linus Torvalds [Fri, 25 Oct 2019 19:25:51 +0000 (15:25 -0400)]
Merge tag 'mmc-v5.4-rc4' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC host fixes:
- mxs: Fix flags passed to dmaengine_prep_slave_sg
- cqhci: Add a missing memory barrier
- sdhci-omap: Fix tuning procedure for temperatures < -20C"
* tag 'mmc-v5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: mxs: fix flags passed to dmaengine_prep_slave_sg
mmc: cqhci: Commit descriptors before setting the doorbell
mmc: sdhci-omap: Fix Tuning procedure for temperatures < -20C
Jens Axboe [Fri, 25 Oct 2019 16:06:15 +0000 (10:06 -0600)]
io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.
For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 25 Oct 2019 16:04:25 +0000 (10:04 -0600)]
io_uring: used cached copies of sq->dropped and cq->overflow
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Patrice Chotard [Fri, 25 Oct 2019 13:01:22 +0000 (15:01 +0200)]
ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
Relax qspi pins slew-rate to minimize peak currents.
Fixes: 844030057339 ("ARM: dts: stm32: add flash nor support on stm32mp157c eval board")
Link: https://lore.kernel.org/r/20191025130122.11407-1-alexandre.torgue@st.com
Signed-off-by: Patrice Chotard <patrice.chotard@st.com>
Signed-off-by: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: Olof Johansson <olof@lixom.net>
Pavel Begunkov [Fri, 25 Oct 2019 09:31:31 +0000 (12:31 +0300)]
io_uring: Fix race for sqes with userspace
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe
Reorder them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 25 Oct 2019 09:31:30 +0000 (12:31 +0300)]
io_uring: Fix broken links with offloading
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.
The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().
Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>