1. 21 Jan, 2020 22 commits
    • Jens Axboe's avatar
      io_uring: add IORING_OP_FADVISE · 4840e418
      Jens Axboe authored
      This adds support for doing fadvise through io_uring. We assume that
      WILLNEED doesn't block, but that DONTNEED may block.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4840e418
    • Jens Axboe's avatar
      io_uring: allow use of offset == -1 to mean file position · ba04291e
      Jens Axboe authored
      This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
      update) the current file position. This obviously comes with the caveat
      that if the application has multiple read/writes in flight, then the
      end result will not be as expected. This is similar to threads sharing
      a file descriptor and doing IO using the current file position.
      
      Since this feature isn't easily detectable by doing a read or write,
      add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
      detect presence of this feature.
      Reported-by: default avatar李通洲 <carter.li@eoitek.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba04291e
    • Jens Axboe's avatar
      io_uring: add non-vectored read/write commands · 3a6820f2
      Jens Axboe authored
      For uses cases that don't already naturally have an iovec, it's easier
      (or more convenient) to just use a buffer address + length. This is
      particular true if the use case is from languages that want to create
      a memory safe abstraction on top of io_uring, and where introducing
      the need for the iovec may impose an ownership issue. For those cases,
      they currently need an indirection buffer, which means allocating data
      just for this purpose.
      
      Add basic read/write that don't require the iovec.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3a6820f2
    • Jens Axboe's avatar
      io_uring: improve poll completion performance · e94f141b
      Jens Axboe authored
      For busy IORING_OP_POLL_ADD workloads, we can have enough contention
      on the completion lock that we fail the inline completion path quite
      often as we fail the trylock on that lock. Add a list for deferred
      completions that we can use in that case. This helps reduce the number
      of async offloads we have to do, as if we get multiple completions in
      a row, we'll piggy back on to the poll_llist instead of having to queue
      our own offload.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e94f141b
    • Jens Axboe's avatar
      io_uring: split overflow state into SQ and CQ side · ad3eb2c8
      Jens Axboe authored
      We currently check ->cq_overflow_list from both SQ and CQ context, which
      causes some bouncing of that cache line. Add separate bits of state for
      this instead, so that the SQ side can check using its own state, and
      likewise for the CQ side.
      
      This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow
      with the CQ state. If we hit an overflow condition, both of these bits
      are set. Likewise for overflow flush clear, we clear both bits. For the
      fast path of just checking if there's an overflow condition on either
      the SQ or CQ side, we can use our own private bit for this.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ad3eb2c8
    • Jens Axboe's avatar
      io_uring: add lookup table for various opcode needs · d3656344
      Jens Axboe authored
      We currently have various switch statements that check if an opcode needs
      a file, mm, etc. These are hard to keep in sync as opcodes are added. Add
      a struct io_op_def that holds all of this information, so we have just
      one spot to update when opcodes are added.
      
      This also enables us to NOT allocate req->io if a deferred command
      doesn't need it, and corrects some mistakes we had in terms of what
      commands need mm context.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d3656344
    • Jens Axboe's avatar
      io_uring: remove two unnecessary function declarations · add7b6b8
      Jens Axboe authored
      __io_free_req() and io_double_put_req() aren't used before they are
      defined, so we can kill these two forwards.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      add7b6b8
    • Pavel Begunkov's avatar
      io_uring: move *queue_link_head() from common path · 32fe525b
      Pavel Begunkov authored
      Move io_queue_link_head() to links handling code in io_submit_sqe(),
      so it wouldn't need extra checks and would have better data locality.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32fe525b
    • Pavel Begunkov's avatar
      io_uring: rename prev to head · 9d76377f
      Pavel Begunkov authored
      Calling "prev" a head of a link is a bit misleading. Rename it
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9d76377f
    • Jens Axboe's avatar
      io_uring: add IOSQE_ASYNC · ce35a47a
      Jens Axboe authored
      io_uring defaults to always doing inline submissions, if at all
      possible. But for larger copies, even if the data is fully cached, that
      can take a long time. Add an IOSQE_ASYNC flag that the application can
      set on the SQE - if set, it'll ensure that we always go async for those
      kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
      get the concurrency we desire for this case.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ce35a47a
    • Jens Axboe's avatar
      io-wq: support concurrent non-blocking work · 895e2ca0
      Jens Axboe authored
      io-wq assumes that work will complete fast (and not block), so it
      doesn't create a new worker when work is enqueued, if we already have
      at least one worker running. This is done on the assumption that if work
      is running, then it will complete fast.
      
      Add an option to force io-wq to fork a new worker for work queued. This
      is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
      case, io-wq will create a new worker, even though workers are already
      running.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      895e2ca0
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_STATX · eddc7ef5
      Jens Axboe authored
      This provides support for async statx(2) through io_uring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eddc7ef5
    • Jens Axboe's avatar
      fs: make two stat prep helpers available · 3934e36f
      Jens Axboe authored
      To implement an async stat, we need to provide the flags mapping and
      the statx user copy. Make them available internally, through
      fs/internal.h.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3934e36f
    • Jens Axboe's avatar
      io_uring: avoid ring quiesce for fixed file set unregister and update · 05f3fb3c
      Jens Axboe authored
      We currently fully quiesce the ring before an unregister or update of
      the fixed fileset. This is very expensive, and we can be a bit smarter
      about this.
      
      Add a percpu refcount for the file tables as a whole. Grab a percpu ref
      when we use a registered file, and put it on completion. This is cheap
      to do. Upon removal of a file from a set, switch the ref count to atomic
      mode. When we hit zero ref on the completion side, then we know we can
      drop the previously registered files. When the old files have been
      dropped, switch the ref back to percpu mode for normal operation.
      
      Since there's a period between doing the update and the kernel being
      done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
      same action. The application knows the update has completed when it gets
      the CQE for it. Between doing the update and receiving this completion,
      the application must continue to use the unregistered fd if submitting
      IO on this particular file.
      
      This takes the runtime of test/file-register from liburing from 14s to
      about 0.7s.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05f3fb3c
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_CLOSE · b5dba59e
      Jens Axboe authored
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b5dba59e
    • Jens Axboe's avatar
      io-wq: add support for uncancellable work · 0c9d5ccd
      Jens Axboe authored
      Not all work can be cancelled, some of it we may need to guarantee
      that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
      on work that must not be cancelled. Note that the caller work function
      must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
      IO_WQ_WORK_CANCEL.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0c9d5ccd
    • Jens Axboe's avatar
      fs: move filp_close() outside of __close_fd_get_file() · 6e802a4b
      Jens Axboe authored
      Just one caller of this, and just use filp_close() there manually.
      This is important to allow async close/removal of the fd.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e802a4b
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_OPENAT · 15b71abe
      Jens Axboe authored
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      15b71abe
    • Jens Axboe's avatar
      fs: make build_open_flags() available internally · 35cb6d54
      Jens Axboe authored
      This is a prep patch for supporting non-blocking open from io_uring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      35cb6d54
    • Jens Axboe's avatar
      io_uring: add support for fallocate() · d63d1b5e
      Jens Axboe authored
      This exposes fallocate(2) through io_uring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d63d1b5e
    • Jens Axboe's avatar
      Merge branch 'io_uring-5.5' into for-5.6/io_uring-vfs · 4d927483
      Jens Axboe authored
      Pull in compatability fix for the files_update command.
      
      * io_uring-5.5:
        io_uring: fix compat for IORING_REGISTER_FILES_UPDATE
      4d927483
    • Eugene Syromiatnikov's avatar
      io_uring: fix compat for IORING_REGISTER_FILES_UPDATE · 1292e972
      Eugene Syromiatnikov authored
      fds field of struct io_uring_files_update is problematic with regards
      to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
      and 64-bit user space.  In order to avoid custom handling of compat in
      the syscall implementation, make fds __u64 and use u64_to_user_ptr in
      order to retrieve it.  Also, align the field naturally and check that
      no garbage is passed there.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Signed-off-by: default avatarEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1292e972
  2. 20 Jan, 2020 2 commits
    • Jens Axboe's avatar
      Merge branch 'work.openat2' of... · fa7773de
      Jens Axboe authored
      Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-5.6/io_uring-vfs
      
      Pull in Al's openat2 branch, since we'll need that for the openat2
      support.
      
      * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        Documentation: path-lookup: include new LOOKUP flags
        selftests: add openat2(2) selftests
        open: introduce openat2(2) syscall
        namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
        namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
        namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
        namei: LOOKUP_NO_XDEV: block mountpoint crossing
        namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
        namei: LOOKUP_NO_SYMLINKS: block symlink resolution
        namei: allow set_root() to produce errors
        namei: allow nd_jump_link() to produce errors
        nsfs: clean-up ns_get_path() signature to return int
        namei: only return -ECHILD from follow_dotdot_rcu()
      fa7773de
    • Linus Torvalds's avatar
      Linux 5.5-rc7 · def9d278
      Linus Torvalds authored
      def9d278
  3. 19 Jan, 2020 8 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv/for-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 7008ee12
      Linus Torvalds authored
      Pull RISC-V fixes from Paul Walmsley:
       "Three fixes for RISC-V:
      
         - Don't free and reuse memory containing the code that CPUs parked at
           boot reside in.
      
         - Fix rv64 build problems for ubsan and some modules by adding
           logical and arithmetic shift helpers for 128-bit values. These are
           from libgcc and are similar to what's present for ARM64.
      
         - Fix vDSO builds to clean up their own temporary files"
      
      * tag 'riscv/for-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Less inefficient gcc tishift helpers (and export their symbols)
        riscv: delete temporary files
        riscv: make sure the cores stay looping in .Lsecondary_park
      7008ee12
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 11a82729
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix non-blocking connect() in x25, from Martin Schiller.
      
       2) Fix spurious decryption errors in kTLS, from Jakub Kicinski.
      
       3) Netfilter use-after-free in mtype_destroy(), from Cong Wang.
      
       4) Limit size of TSO packets properly in lan78xx driver, from Eric
          Dumazet.
      
       5) r8152 probe needs an endpoint sanity check, from Johan Hovold.
      
       6) Prevent looping in tcp_bpf_unhash() during sockmap/tls free, from
          John Fastabend.
      
       7) hns3 needs short frames padded on transmit, from Yunsheng Lin.
      
       8) Fix netfilter ICMP header corruption, from Eyal Birger.
      
       9) Fix soft lockup when low on memory in hns3, from Yonglong Liu.
      
      10) Fix NTUPLE firmware command failures in bnxt_en, from Michael Chan.
      
      11) Fix memory leak in act_ctinfo, from Eric Dumazet.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
        cxgb4: reject overlapped queues in TC-MQPRIO offload
        cxgb4: fix Tx multi channel port rate limit
        net: sched: act_ctinfo: fix memory leak
        bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal.
        bnxt_en: Fix ipv6 RFS filter matching logic.
        bnxt_en: Fix NTUPLE firmware command failures.
        net: systemport: Fixed queue mapping in internal ring map
        net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec
        net: dsa: sja1105: Don't error out on disabled ports with no phy-mode
        net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset
        net: hns: fix soft lockup when there is not enough memory
        net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()
        net/sched: act_ife: initalize ife->metalist earlier
        netfilter: nat: fix ICMP header corruption on ICMP errors
        net: wan: lapbether.c: Use built-in RCU list checking
        netfilter: nf_tables: fix flowtable list del corruption
        netfilter: nf_tables: fix memory leak in nf_tables_parse_netdev_hooks()
        netfilter: nf_tables: remove WARN and add NLA_STRING upper limits
        netfilter: nft_tunnel: ERSPAN_VERSION must not be null
        netfilter: nft_tunnel: fix null-attribute check
        ...
      11a82729
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 5f436443
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Two runtime PM fixes and one leak fix"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: iop3xx: Fix memory leak in probe error path
        i2c: tegra: Properly disable runtime PM on driver's probe error
        i2c: tegra: Fix suspending in active runtime PM state
      5f436443
    • Rahul Lakkireddy's avatar
      cxgb4: reject overlapped queues in TC-MQPRIO offload · b2383ad9
      Rahul Lakkireddy authored
      A queue can't belong to multiple traffic classes. So, reject
      any such configuration that results in overlapped queues for a
      traffic class.
      
      Fixes: b1396c2b ("cxgb4: parse and configure TC-MQPRIO offload")
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2383ad9
    • Rahul Lakkireddy's avatar
      cxgb4: fix Tx multi channel port rate limit · c856e2b6
      Rahul Lakkireddy authored
      T6 can support 2 egress traffic management channels per port to
      double the total number of traffic classes that can be configured.
      In this configuration, if the class belongs to the other channel,
      then all the queues must be bound again explicitly to the new class,
      for the rate limit parameters on the other channel to take effect.
      
      So, always explicitly bind all queues to the port rate limit traffic
      class, regardless of the traffic management channel that it belongs
      to. Also, only bind queues to port rate limit traffic class, if all
      the queues don't already belong to an existing different traffic
      class.
      
      Fixes: 4ec4762d ("cxgb4: add TC-MATCHALL classifier egress offload")
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c856e2b6
    • Eric Dumazet's avatar
      net: sched: act_ctinfo: fix memory leak · 09d4f10a
      Eric Dumazet authored
      Implement a cleanup method to properly free ci->params
      
      BUG: memory leak
      unreferenced object 0xffff88811746e2c0 (size 64):
        comm "syz-executor617", pid 7106, jiffies 4294943055 (age 14.250s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          c0 34 60 84 ff ff ff ff 00 00 00 00 00 00 00 00  .4`.............
        backtrace:
          [<0000000015aa236f>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<0000000015aa236f>] slab_post_alloc_hook mm/slab.h:586 [inline]
          [<0000000015aa236f>] slab_alloc mm/slab.c:3320 [inline]
          [<0000000015aa236f>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3549
          [<000000002c946bd1>] kmalloc include/linux/slab.h:556 [inline]
          [<000000002c946bd1>] kzalloc include/linux/slab.h:670 [inline]
          [<000000002c946bd1>] tcf_ctinfo_init+0x21a/0x530 net/sched/act_ctinfo.c:236
          [<0000000086952cca>] tcf_action_init_1+0x400/0x5b0 net/sched/act_api.c:944
          [<000000005ab29bf8>] tcf_action_init+0x135/0x1c0 net/sched/act_api.c:1000
          [<00000000392f56f9>] tcf_action_add+0x9a/0x200 net/sched/act_api.c:1410
          [<0000000088f3c5dd>] tc_ctl_action+0x14d/0x1bb net/sched/act_api.c:1465
          [<000000006b39d986>] rtnetlink_rcv_msg+0x178/0x4b0 net/core/rtnetlink.c:5424
          [<00000000fd6ecace>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477
          [<0000000047493d02>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
          [<00000000bdcf8286>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
          [<00000000bdcf8286>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328
          [<00000000fc5b92d9>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917
          [<00000000da84d076>] sock_sendmsg_nosec net/socket.c:639 [inline]
          [<00000000da84d076>] sock_sendmsg+0x54/0x70 net/socket.c:659
          [<0000000042fb2eee>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330
          [<000000008f23f67e>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384
          [<00000000d838e4f6>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417
          [<00000000289a9cb1>] __do_sys_sendmsg net/socket.c:2426 [inline]
          [<00000000289a9cb1>] __se_sys_sendmsg net/socket.c:2424 [inline]
          [<00000000289a9cb1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424
      
      Fixes: 24ec483c ("net: sched: Introduce act_ctinfo action")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Kevin 'ldir' Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Toke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarKevin 'ldir' Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09d4f10a
    • Olof Johansson's avatar
      riscv: Less inefficient gcc tishift helpers (and export their symbols) · fc585d4a
      Olof Johansson authored
      The existing __lshrti3 was really inefficient, and the other two helpers
      are also needed to compile some modules.
      
      Add the missing versions, and export all of the symbols like arm64
      already does.
      
      This code is based on the assembly generated by libgcc builds.
      
      This fixes a build break triggered by ubsan:
      
      riscv64-unknown-linux-gnu-ld: lib/ubsan.o: in function `.L2':
      ubsan.c:(.text.unlikely+0x38): undefined reference to `__ashlti3'
      riscv64-unknown-linux-gnu-ld: ubsan.c:(.text.unlikely+0x42): undefined reference to `__ashrti3'
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      [paul.walmsley@sifive.com: use SYM_FUNC_{START,END} instead of
       ENTRY/ENDPROC; note libgcc origin]
      Signed-off-by: default avatarPaul Walmsley <paul.walmsley@sifive.com>
      fc585d4a
    • Linus Torvalds's avatar
      Merge tag 'mtd/fixes-for-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux · 8f8972a3
      Linus Torvalds authored
      Pull MTD fixes from Miquel Raynal:
       "Raw NAND:
         - GPMI: Fix the suspend/resume
      
        SPI-NOR:
         - Fix quad enable on Spansion like flashes
         - Fix selection of 4-byte addressing opcodes on Spansion"
      
      * tag 'mtd/fixes-for-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
        mtd: rawnand: gpmi: Restore nfc timing setup after suspend/resume
        mtd: rawnand: gpmi: Fix suspend/resume problem
        mtd: spi-nor: Fix quad enable for Spansion like flashes
        mtd: spi-nor: Fix selection of 4-byte addressing opcodes on Spansion
      8f8972a3
  4. 18 Jan, 2020 8 commits