You need to sign in or sign up before continuing.
  1. 23 Nov, 2022 2 commits
    • Dharmendra Singh's avatar
      fuse: allow non-extending parallel direct writes on the same file · 15352405
      Dharmendra Singh authored
      
      In general, as of now, in FUSE, direct writes on the same file are
      serialized over inode lock i.e we hold inode lock for the full duration of
      the write request.  I could not find in fuse code and git history a comment
      which clearly explains why this exclusive lock is taken for direct writes.
      Following might be the reasons for acquiring an exclusive lock but not be
      limited to
      
       1) Our guess is some USER space fuse implementations might be relying on
          this lock for serialization.
      
       2) The lock protects against file read/write size races.
      
       3) Ruling out any issues arising from partial write failures.
      
      This patch relaxes the exclusive lock for direct non-extending writes only.
      File size extending writes might not need the lock either, but we are not
      entirely sure if there is a risk to introduce any kind of regression.
      Furthermore, benchmarking with fio does not show a difference between patch
      versions that take on file size extension a) an exclusive lock and b) a
      shared lock.
      
      A possible example of an issue with i_size extending writes are write error
      cases.  Some writes might succeed and others might fail for file system
      internal reasons - for example ENOSPACE.  With parallel file size extending
      writes it _might_ be difficult to revert the action of the failing write,
      especially to restore the right i_size.
      
      With these changes, we allow non-extending parallel direct writes on the
      same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES.  If
      this flag is set on the file (flag is passed from libfuse to fuse kernel as
      part of file open/create), we do not take exclusive lock anymore, but
      instead use a shared lock that allows non-extending writes to run in
      parallel.  FUSE implementations which rely on this inode lock for
      serialization can continue to do so and serialized direct writes are still
      the default.  Implementations that do not do write serialization need to be
      updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
      open/create reply.
      
      On patch review there were concerns that network file systems (or vfs
      multiple mounts of the same file system) might have issues with parallel
      writes.  We believe this is not the case, as this is just a local lock,
      which network file systems could not rely on anyway.  I.e. this lock is
      just for local consistency.
      Signed-off-by: default avatarDharmendra Singh <dsingh@ddn.com>
      Signed-off-by: default avatarBernd Schubert <bschubert@ddn.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      15352405
    • Miklos Szeredi's avatar
      fuse: lock inode unconditionally in fuse_fallocate() · 44361e8c
      Miklos Szeredi authored
      
      file_modified() must be called with inode lock held.  fuse_fallocate()
      didn't lock the inode in case of just FALLOC_KEEP_SIZE flags value, which
      resulted in a kernel Warning in notify_change().
      
      Lock the inode unconditionally, like all other fallocate implementations
      do.
      Reported-by: default avatarPengfei Xu <pengfei.xu@intel.com>
      Reported-and-tested-by: syzbot+462da39f0667b357c4b6@syzkaller.appspotmail.com
      Fixes: 4a6f278d
      
       ("fuse: add file_modified() to fallocate")
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      44361e8c
  2. 28 Oct, 2022 1 commit
  3. 09 Aug, 2022 2 commits
    • Al Viro's avatar
      iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() · 1ef255e2
      Al Viro authored
      
      Most of the users immediately follow successful iov_iter_get_pages()
      with advancing by the amount it had returned.
      
      Provide inline wrappers doing that, convert trivial open-coded
      uses of those.
      
      BTW, iov_iter_get_pages() never returns more than it had been asked
      to; such checks in cifs ought to be removed someday...
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1ef255e2
    • Al Viro's avatar
      new iov_iter flavour - ITER_UBUF · fcb14cb1
      Al Viro authored
      
      Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
      checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
      ones.
      
      We are going to expose the things like ->write_iter() et.al. to those
      in subsequent commits.
      
      New predicate (user_backed_iter()) that is true for ITER_IOVEC and
      ITER_UBUF; places like direct-IO handling should use that for
      checking that pages we modify after getting them from iov_iter_get_pages()
      would need to be dirtied.
      
      DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
      will solve all problems - there's code that uses iter_is_iovec() to
      decide how to poke around in iov_iter guts and for that the predicate
      replacement obviously won't suffice.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fcb14cb1
  4. 21 Jul, 2022 2 commits
    • Miklos Szeredi's avatar
      fuse: fix deadlock between atomic O_TRUNC and page invalidation · 2fdbb8dd
      Miklos Szeredi authored
      fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
      O_TRUNC open(), so commit 76224355
      
       ("fuse: truncate pagecache on
      atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
      in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
      deadlock related to the case above, which will cause the xfstests
      generic/464 testcase hung in our virtio-fs test environment.
      
      For example, consider two processes concurrently open one same file, one
      with O_TRUNC and another without O_TRUNC. The deadlock case is described
      below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
      to lock a page (acquiring B), open() could have held the page lock
      (acquired B), and waiting on the page writeback (acquiring A). This would
      lead to deadlocks.
      
      open(O_TRUNC)
      ----------------------------------------------------------------
      fuse_open_common
        inode_lock            [C acquire]
        fuse_set_nowrite      [A acquire]
      
        fuse_finish_open
          truncate_pagecache
            lock_page         [B acquire]
            truncate_inode_page
            unlock_page       [B release]
      
        fuse_release_nowrite  [A release]
        inode_unlock          [C release]
      ----------------------------------------------------------------
      
      open()
      ----------------------------------------------------------------
      fuse_open_common
        fuse_finish_open
          invalidate_inode_pages2
            lock_page         [B acquire]
              fuse_launder_page
                fuse_wait_on_page_writeback [A acquire & release]
            unlock_page       [B release]
      ----------------------------------------------------------------
      
      Besides this case, all calls of invalidate_inode_pages2() and
      invalidate_inode_pages2_range() in fuse code also can deadlock with
      open(O_TRUNC).
      
      Fix by moving the truncate_pagecache() call outside the nowrite protected
      region.  The nowrite protection is only for delayed writeback
      (writeback_cache) case, where inode lock does not protect against
      truncation racing with writes on the server.  Write syscalls racing with
      page cache truncation still get the inode lock protection.
      
      This patch also changes the order of filemap_invalidate_lock()
      vs. fuse_set_nowrite() in fuse_open_common().  This new order matches the
      order found in fuse_file_fallocate() and fuse_do_setattr().
      Reported-by: default avatarJiachen Zhang <zhangjiachen.jaycee@bytedance.com>
      Tested-by: default avatarJiachen Zhang <zhangjiachen.jaycee@bytedance.com>
      Fixes: e4648309
      
       ("fuse: truncate pending writes on O_TRUNC")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      2fdbb8dd
    • Miklos Szeredi's avatar
      fuse: write inode in fuse_release() · 035ff33c
      Miklos Szeredi authored
      A race between write(2) and close(2) allows pages to be dirtied after
      fuse_flush -> write_inode_now().  If these pages are not flushed from
      fuse_release(), then there might not be a writable open file later.  So any
      remaining dirty pages must be written back before the file is released.
      
      This is a partial revert of the blamed commit.
      
      Reported-by: syzbot+6e1efbd8efaaa6860e91@syzkaller.appspotmail.com
      Fixes: 36ea2337
      
       ("fuse: write inode in fuse_vma_close() instead of fuse_release()")
      Cc: <stable@vger.kernel.org> # v5.16
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      035ff33c
  5. 10 Jun, 2022 1 commit
  6. 09 May, 2022 1 commit
  7. 08 May, 2022 2 commits
  8. 22 Mar, 2022 1 commit
    • NeilBrown's avatar
      fuse: remove reliance on bdi congestion · 670d21c6
      NeilBrown authored
      The bdi congestion tracking in not widely used and will be removed.
      
      Fuse is one of a small number of filesystems that uses it, setting both
      the sync (read) and async (write) congestion flags at what it determines
      are appropriate times.
      
      The only remaining effect of the sync flag is to cause read-ahead to be
      skipped.  The only remaining effect of the async flag is to cause (some)
      WB_SYNC_NONE writes to be skipped.
      
      So instead of setting the flags, change:
      
       - .readahead to stop when it has submitted all non-async pages for
         read.
      
       - .writepages to do nothing if WB_SYNC_NONE and the flag would be set
      
       - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the
         flag would be set.
      
      The writepages change causes a behavioural change in that pageout() can
      now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will be
      called on the page which (I think) will further delay the next attempt at
      writeout.  This might be a good thing.
      
      Link: https://lkml.kernel.org/r/164549983737.9187.2627117501000365074.stgit@noble.brown
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      670d21c6
  9. 15 Mar, 2022 2 commits
  10. 07 Mar, 2022 1 commit
    • Miklos Szeredi's avatar
      fuse: fix pipe buffer lifetime for direct_io · 0c4bcfde
      Miklos Szeredi authored
      
      In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls
      fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then
      imports the write buffer with fuse_get_user_pages(), which uses
      iov_iter_get_pages() to grab references to userspace pages instead of
      actually copying memory.
      
      On the filesystem device side, these pages can then either be read to
      userspace (via fuse_dev_read()), or splice()d over into a pipe using
      fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops.
      
      This is wrong because after fuse_dev_do_read() unlocks the FUSE request,
      the userspace filesystem can mark the request as completed, causing write()
      to return. At that point, the userspace filesystem should no longer have
      access to the pipe buffer.
      
      Fix by copying pages coming from the user address space to new pipe
      buffers.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Fixes: c3021629
      
       ("fuse: support splice() reading from fuse device")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      0c4bcfde
  11. 14 Dec, 2021 1 commit
    • Jeffle Xu's avatar
      fuse: enable per inode DAX · 93a497b9
      Jeffle Xu authored
      
      DAX may be limited in some specific situation. When the number of usable
      DAX windows is under watermark, the recalim routine will be triggered to
      reclaim some DAX windows. It may have a negative impact on the
      performance, since some processes may need to wait for DAX windows to be
      recalimed and reused then. To mitigate the performance degradation, the
      overall DAX window need to be expanded larger.
      
      However, simply expanding the DAX window may not be a good deal in some
      scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
      (512 * 64 bytes) memory footprint will be consumed for page descriptors
      inside guest, which is greater than the memory footprint if it uses
      guest page cache when DAX disabled. Thus it'd better disable DAX for
      those files smaller than 32KB, to reduce the demand for DAX window and
      thus avoid the unworthy memory overhead.
      
      Per inode DAX feature is introduced to address this issue, by offering a
      finer grained control for dax to users, trying to achieve a balance
      between performance and memory overhead.
      
      The FUSE_ATTR_DAX flag in FUSE_LOOKUP reply is used to indicate whether
      DAX should be enabled or not for corresponding file. Currently the state
      whether DAX is enabled or not for the file is initialized only when
      inode is instantiated.
      Signed-off-by: default avatarJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      93a497b9
  12. 07 Dec, 2021 1 commit
  13. 28 Oct, 2021 8 commits
  14. 25 Oct, 2021 1 commit
  15. 22 Oct, 2021 4 commits
  16. 18 Oct, 2021 1 commit
    • Andreas Gruenbacher's avatar
      iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable · a6294593
      Andreas Gruenbacher authored
      
      Turn iov_iter_fault_in_readable into a function that returns the number
      of bytes not faulted in, similar to copy_to_user, instead of returning a
      non-zero value when any of the requested pages couldn't be faulted in.
      This supports the existing users that require all pages to be faulted in
      as well as new users that are happy if any pages can be faulted in.
      
      Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
      sure this change doesn't silently break things.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      a6294593
  17. 06 Sep, 2021 2 commits
    • Miklos Szeredi's avatar
      fuse: remove unused arg in fuse_write_file_get() · a9667ac8
      Miklos Szeredi authored
      
      The struct fuse_conn argument is not used and can be removed.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      a9667ac8
    • Miklos Szeredi's avatar
      fuse: wait for writepages in syncfs · 660585b5
      Miklos Szeredi authored
      
      In case of fuse the MM subsystem doesn't guarantee that page writeback
      completes by the time ->sync_fs() is called.  This is because fuse
      completes page writeback immediately to prevent DoS of memory reclaim by
      the userspace file server.
      
      This means that fuse itself must ensure that writes are synced before
      sending the SYNCFS request to the server.
      
      Introduce sync buckets, that hold a counter for the number of outstanding
      write requests.  On syncfs replace the current bucket with a new one and
      wait until the old bucket's counter goes down to zero.
      
      It is possible to have multiple syncfs calls in parallel, in which case
      there could be more than one waited-on buckets.  Descendant buckets must
      not complete until the parent completes.  Add a count to the child (new)
      bucket until the (parent) old bucket completes.
      
      Use RCU protection to dereference the current bucket and to wake up an
      emptied bucket.  Use fc->lock to protect against parallel assignments to
      the current bucket.
      
      This leaves just the counter to be a possible scalability issue.  The
      fc->num_waiting counter has a similar issue, so both should be addressed at
      the same time.
      Reported-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Fixes: 2d82ab25
      
       ("virtiofs: propagate sync() to file server")
      Cc: <stable@vger.kernel.org> # v5.14
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      660585b5
  18. 31 Aug, 2021 1 commit
    • Miklos Szeredi's avatar
      fuse: flush extending writes · 59bda8ec
      Miklos Szeredi authored
      Callers of fuse_writeback_range() assume that the file is ready for
      modification by the server in the supplied byte range after the call
      returns.
      
      If there's a write that extends the file beyond the end of the supplied
      range, then the file needs to be extended to at least the end of the range,
      but currently that's not done.
      
      There are at least two cases where this can cause problems:
      
       - copy_file_range() will return short count if the file is not extended
         up to end of the source range.
      
       - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file,
         hence the region may not be fully allocated.
      
      Fix by flushing writes from the start of the range up to the end of the
      file.  This could be optimized if the writes are non-extending, etc, but
      it's probably not worth the trouble.
      
      Fixes: a2bc9236 ("fuse: fix copy_file_range() in the writeback case")
      Fixes: 6b1bdb56
      
       ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)")
      Cc: <stable@vger.kernel.org>  # v5.2
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      59bda8ec
  19. 17 Aug, 2021 1 commit
    • Miklos Szeredi's avatar
      fuse: truncate pagecache on atomic_o_trunc · 76224355
      Miklos Szeredi authored
      
      fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
      O_TRUNC.  This can deadlock with fuse_wait_on_page_writeback() in
      fuse_launder_page() triggered by invalidate_inode_pages2().
      
      Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
      truncate_pagecache() call.  This makes sense regardless of FOPEN_KEEP_CACHE
      or fc->writeback cache, so do it unconditionally.
      Reported-by: default avatarXie Yongji <xieyongji@bytedance.com>
      Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com
      Fixes: e4648309
      
       ("fuse: truncate pending writes on O_TRUNC")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      76224355
  20. 13 Jul, 2021 1 commit
  21. 22 Jun, 2021 3 commits
  22. 10 Jun, 2021 1 commit