From: Linus Torvalds Date: Fri, 8 Mar 2019 22:48:40 +0000 (-0800) Subject: Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block X-Git-Url: http://git.maquefel.me/?a=commitdiff_plain;h=38e7571c07be01f9f19b355a9306a4e3d5cb0f5b;p=linux.git Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block Pull io_uring IO interface from Jens Axboe: "Second attempt at adding the io_uring interface. Since the first one, we've added basic unit testing of the three system calls, that resides in liburing like the other unit tests that we have so far. It'll take a while to get full coverage of it, but we're working towards it. I've also added two basic test programs to tools/io_uring. One uses the raw interface and has support for all the various features that io_uring supports outside of standard IO, like fixed files, fixed IO buffers, and polled IO. The other uses the liburing API, and is a simplified version of cp(1). This adds support for a new IO interface, io_uring. io_uring allows an application to communicate with the kernel through two rings, the submission queue (SQ) and completion queue (CQ) ring. This allows for very efficient handling of IOs, see the v5 posting for some basic numbers: https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/ Outside of just efficiency, the interface is also flexible and extendable, and allows for future use cases like the upcoming NVMe key-value store API, networked IO, and so on. It also supports async buffered IO, something that we've always failed to support in the kernel. Outside of basic IO features, it supports async polled IO as well. This particular feature has already been tested at Facebook months ago for flash storage boxes, with 25-33% improvements. It makes polled IO actually useful for real world use cases, where even basic flash sees a nice win in terms of efficiency, latency, and performance. These boxes were IOPS bound before, now they are not. This series adds three new system calls. One for setting up an io_uring instance (io_uring_setup(2)), one for submitting/completing IO (io_uring_enter(2)), and one for aux functions like registrating file sets, buffers, etc (io_uring_register(2)). Through the help of Arnd, I've coordinated the syscall numbers so merge on that front should be painless. Jon did a writeup of the interface a while back, which (except for minor details that have been tweaked) is still accurate. Find that here: https://lwn.net/Articles/776703/ Huge thanks to Al Viro for helping getting the reference cycle code correct, and to Jann Horn for his extensive reviews focused on both security and bugs in general. There's a userspace library that provides basic functionality for applications that don't need or want to care about how to fiddle with the rings directly. It has helpers to allow applications to easily set up an io_uring instance, and submit/complete IO through it without knowing about the intricacies of the rings. It also includes man pages (thanks to Jeff Moyer), and will continue to grow support helper functions and features as time progresses. Find it here: git://git.kernel.dk/liburing Fio has full support for the raw interface, both in the form of an IO engine (io_uring), but also with a small test application (t/io_uring) that can exercise and benchmark the interface" * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block: io_uring: add a few test tools io_uring: allow workqueue item to handle multiple buffered requests io_uring: add support for IORING_OP_POLL io_uring: add io_kiocb ref count io_uring: add submission polling io_uring: add file set registration net: split out functions related to registering inflight socket files io_uring: add support for pre-mapped user IO buffers block: implement bio helper to add iter bvec pages to bio io_uring: batch io_kiocb allocation io_uring: use fget/fput_many() for file references fs: add fget_many() and fput_many() io_uring: support for IO polling io_uring: add fsync support Add io_uring IO interface --- 38e7571c07be01f9f19b355a9306a4e3d5cb0f5b diff --cc arch/x86/entry/syscalls/syscall_32.tbl index 955ab6a3b61f5,2eefd2a7c1cef..8da78595d69db --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@@ -396,36 -396,8 +396,39 @@@ 382 i386 pkey_free sys_pkey_free __ia32_sys_pkey_free 383 i386 statx sys_statx __ia32_sys_statx 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl -385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents +385 i386 io_pgetevents sys_io_pgetevents_time32 __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq +# don't use numbers 387 through 392, add new calls at the end +393 i386 semget sys_semget __ia32_sys_semget +394 i386 semctl sys_semctl __ia32_compat_sys_semctl +395 i386 shmget sys_shmget __ia32_sys_shmget +396 i386 shmctl sys_shmctl __ia32_compat_sys_shmctl +397 i386 shmat sys_shmat __ia32_compat_sys_shmat +398 i386 shmdt sys_shmdt __ia32_sys_shmdt +399 i386 msgget sys_msgget __ia32_sys_msgget +400 i386 msgsnd sys_msgsnd __ia32_compat_sys_msgsnd +401 i386 msgrcv sys_msgrcv __ia32_compat_sys_msgrcv +402 i386 msgctl sys_msgctl __ia32_compat_sys_msgctl +403 i386 clock_gettime64 sys_clock_gettime __ia32_sys_clock_gettime +404 i386 clock_settime64 sys_clock_settime __ia32_sys_clock_settime +405 i386 clock_adjtime64 sys_clock_adjtime __ia32_sys_clock_adjtime +406 i386 clock_getres_time64 sys_clock_getres __ia32_sys_clock_getres +407 i386 clock_nanosleep_time64 sys_clock_nanosleep __ia32_sys_clock_nanosleep +408 i386 timer_gettime64 sys_timer_gettime __ia32_sys_timer_gettime +409 i386 timer_settime64 sys_timer_settime __ia32_sys_timer_settime +410 i386 timerfd_gettime64 sys_timerfd_gettime __ia32_sys_timerfd_gettime +411 i386 timerfd_settime64 sys_timerfd_settime __ia32_sys_timerfd_settime +412 i386 utimensat_time64 sys_utimensat __ia32_sys_utimensat +413 i386 pselect6_time64 sys_pselect6 __ia32_compat_sys_pselect6_time64 +414 i386 ppoll_time64 sys_ppoll __ia32_compat_sys_ppoll_time64 +416 i386 io_pgetevents_time64 sys_io_pgetevents __ia32_sys_io_pgetevents +417 i386 recvmmsg_time64 sys_recvmmsg __ia32_compat_sys_recvmmsg_time64 +418 i386 mq_timedsend_time64 sys_mq_timedsend __ia32_sys_mq_timedsend +419 i386 mq_timedreceive_time64 sys_mq_timedreceive __ia32_sys_mq_timedreceive +420 i386 semtimedop_time64 sys_semtimedop __ia32_sys_semtimedop +421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait __ia32_compat_sys_rt_sigtimedwait_time64 +422 i386 futex_time64 sys_futex __ia32_sys_futex +423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval __ia32_sys_sched_rr_get_interval + 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup + 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter + 427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register diff --cc arch/x86/entry/syscalls/syscall_64.tbl index 2ae92fddb6d5f,65c026185e612..c768447f97ec6 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@@ -343,8 -343,9 +343,11 @@@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +# don't use numbers 387 through 423, add new calls after the last +# 'common' entry + 425 common io_uring_setup __x64_sys_io_uring_setup + 426 common io_uring_enter __x64_sys_io_uring_enter + 427 common io_uring_register __x64_sys_io_uring_register # # x32-specific system call numbers start at 512 to avoid cache impact diff --cc include/uapi/asm-generic/unistd.h index 12cdf611d217e,d346229a1eb0d..bf4624efe5e62 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@@ -780,52 -740,15 +780,59 @@@ __SC_COMP_3264(__NR_io_pgetevents, sys_ __SYSCALL(__NR_rseq, sys_rseq) #define __NR_kexec_file_load 294 __SYSCALL(__NR_kexec_file_load, sys_kexec_file_load) +/* 295 through 402 are unassigned to sync up with generic numbers, don't use */ +#if __BITS_PER_LONG == 32 +#define __NR_clock_gettime64 403 +__SYSCALL(__NR_clock_gettime64, sys_clock_gettime) +#define __NR_clock_settime64 404 +__SYSCALL(__NR_clock_settime64, sys_clock_settime) +#define __NR_clock_adjtime64 405 +__SYSCALL(__NR_clock_adjtime64, sys_clock_adjtime) +#define __NR_clock_getres_time64 406 +__SYSCALL(__NR_clock_getres_time64, sys_clock_getres) +#define __NR_clock_nanosleep_time64 407 +__SYSCALL(__NR_clock_nanosleep_time64, sys_clock_nanosleep) +#define __NR_timer_gettime64 408 +__SYSCALL(__NR_timer_gettime64, sys_timer_gettime) +#define __NR_timer_settime64 409 +__SYSCALL(__NR_timer_settime64, sys_timer_settime) +#define __NR_timerfd_gettime64 410 +__SYSCALL(__NR_timerfd_gettime64, sys_timerfd_gettime) +#define __NR_timerfd_settime64 411 +__SYSCALL(__NR_timerfd_settime64, sys_timerfd_settime) +#define __NR_utimensat_time64 412 +__SYSCALL(__NR_utimensat_time64, sys_utimensat) +#define __NR_pselect6_time64 413 +__SC_COMP(__NR_pselect6_time64, sys_pselect6, compat_sys_pselect6_time64) +#define __NR_ppoll_time64 414 +__SC_COMP(__NR_ppoll_time64, sys_ppoll, compat_sys_ppoll_time64) +#define __NR_io_pgetevents_time64 416 +__SYSCALL(__NR_io_pgetevents_time64, sys_io_pgetevents) +#define __NR_recvmmsg_time64 417 +__SC_COMP(__NR_recvmmsg_time64, sys_recvmmsg, compat_sys_recvmmsg_time64) +#define __NR_mq_timedsend_time64 418 +__SYSCALL(__NR_mq_timedsend_time64, sys_mq_timedsend) +#define __NR_mq_timedreceive_time64 419 +__SYSCALL(__NR_mq_timedreceive_time64, sys_mq_timedreceive) +#define __NR_semtimedop_time64 420 +__SYSCALL(__NR_semtimedop_time64, sys_semtimedop) +#define __NR_rt_sigtimedwait_time64 421 +__SC_COMP(__NR_rt_sigtimedwait_time64, sys_rt_sigtimedwait, compat_sys_rt_sigtimedwait_time64) +#define __NR_futex_time64 422 +__SYSCALL(__NR_futex_time64, sys_futex) +#define __NR_sched_rr_get_interval_time64 423 +__SYSCALL(__NR_sched_rr_get_interval_time64, sys_sched_rr_get_interval) +#endif + + #define __NR_io_uring_setup 425 + __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) + #define __NR_io_uring_enter 426 + __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) + #define __NR_io_uring_register 427 + __SYSCALL(__NR_io_uring_register, sys_io_uring_register) + #undef __NR_syscalls - #define __NR_syscalls 424 + #define __NR_syscalls 428 /* * 32 bit systems traditionally used different diff --cc kernel/sys_ni.c index 62a6c87077994,1bb6604dc19fc..51d7c6794bf11 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@@ -42,12 -42,13 +42,15 @@@ COND_SYSCALL(io_destroy) COND_SYSCALL(io_submit); COND_SYSCALL_COMPAT(io_submit); COND_SYSCALL(io_cancel); +COND_SYSCALL(io_getevents_time32); COND_SYSCALL(io_getevents); +COND_SYSCALL(io_pgetevents_time32); COND_SYSCALL(io_pgetevents); -COND_SYSCALL_COMPAT(io_getevents); +COND_SYSCALL_COMPAT(io_pgetevents_time32); COND_SYSCALL_COMPAT(io_pgetevents); + COND_SYSCALL(io_uring_setup); + COND_SYSCALL(io_uring_enter); + COND_SYSCALL(io_uring_register); /* fs/xattr.c */