1. 24 Apr, 2019 7 commits
    • Kirill Smelkov's avatar
      fuse: Add FOPEN_STREAM to use stream_open() · bbd84f33
      Kirill Smelkov authored
      Starting from commit 9c225f26 ("vfs: atomic f_pos accesses as per
      POSIX") files opened even via nonseekable_open gate read and write via lock
      and do not allow them to be run simultaneously. This can create read vs
      write deadlock if a filesystem is trying to implement a socket-like file
      which is intended to be simultaneously used for both read and write from
      filesystem client.  See commit 10dce8af ("fs: stream_open - opener for
      stream-like files so that read and write can run simultaneously without
      deadlock") for details and e.g. commit 581d21a2 ("xenbus: fix deadlock
      on writes to /proc/xen/xenbus") for a similar deadlock example on
      /proc/xen/xenbus.
      
      To avoid such deadlock it was tempting to adjust fuse_finish_open to use
      stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
      but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
      and in particular GVFS which actually uses offset in its read and write
      handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
      so if we would do such a change it will break a real user.
      
      Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the
      opened handler is having stream-like semantics; does not use file position
      and thus the kernel is free to issue simultaneous read and write request on
      opened file handle.
      
      This patch together with stream_open() should be added to stable kernels
      starting from v3.14+. This will allow to patch OSSPD and other FUSE
      filesystems that provide stream-like files to return FOPEN_STREAM |
      FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all
      kernel versions. This should work because fuse_finish_open ignores unknown
      open flags returned from a filesystem and so passing FOPEN_STREAM to a
      kernel that is not aware of this flag cannot hurt. In turn the kernel that
      is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE
      is sufficient to implement streams without read vs write deadlock.
      
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      bbd84f33
    • Kirill Smelkov's avatar
      fuse: require /dev/fuse reads to have enough buffer capacity · d4b13963
      Kirill Smelkov authored
      A FUSE filesystem server queues /dev/fuse sys_read calls to get
      filesystem requests to handle. It does not know in advance what would be
      that request as it can be anything that client issues - LOOKUP, READ,
      WRITE, ... Many requests are short and retrieve data from the
      filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.
      
      Before getting into operation phase, FUSE filesystem server and kernel
      client negotiate what should be the maximum write size the client will
      ever issue. After negotiation the contract in between server/client is
      that the filesystem server then should queue /dev/fuse sys_read calls with
      enough buffer capacity to receive any client request - WRITE in
      particular, while FUSE client should not, in particular, send WRITE
      requests with > negotiated max_write payload. FUSE client in kernel and
      libfuse historically reserve 4K for request header. This way the
      contract is that filesystem server should queue sys_reads with
      4K+max_write buffer.
      
      If the filesystem server does not follow this contract, what can happen
      is that fuse_dev_do_read will see that request size is > buffer size,
      and then it will return EIO to client who issued the request but won't
      indicate in any way that there is a problem to filesystem server.
      This can be hard to diagnose because for some requests, e.g. for
      NOTIFY_REPLY which mimics WRITE, there is no client thread that is
      waiting for request completion and that EIO goes nowhere, while on
      filesystem server side things look like the kernel is not replying back
      after successful NOTIFY_RETRIEVE request made by the server.
      
      We can make the problem easy to diagnose if we indicate via error return to
      filesystem server when it is violating the contract.  This should not
      practically cause problems because if a filesystem server is using shorter
      buffer, writes to it were already very likely to cause EIO, and if the
      filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
      minimum buffer size.
      
      Please see [1] for context where the problem of stuck filesystem was hit
      for real (because kernel client was incorrectly sending more than
      max_write data with NOTIFY_REPLY; see also previous patch), how the
      situation was traced and for more involving patch that did not make it
      into the tree.
      
      [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      d4b13963
    • Kirill Smelkov's avatar
      fuse: retrieve: cap requested size to negotiated max_write · 7640682e
      Kirill Smelkov authored
      FUSE filesystem server and kernel client negotiate during initialization
      phase, what should be the maximum write size the client will ever issue.
      Correspondingly the filesystem server then queues sys_read calls to read
      requests with buffer capacity large enough to carry request header + that
      max_write bytes. A filesystem server is free to set its max_write in
      anywhere in the range between [1*page, fc->max_pages*page]. In particular
      go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages
      corresponds to 128K. Libfuse also allows users to configure max_write, but
      by default presets it to possible maximum.
      
      If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we
      allow to retrieve more than max_write bytes, corresponding prepared
      NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
      filesystem server, in full correspondence with server/client contract, will
      be only queuing sys_read with ~max_write buffer capacity, and
      fuse_dev_do_read throws away requests that cannot fit into server request
      buffer. In turn the filesystem server could get stuck waiting indefinitely
      for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is
      understood by clients as that NOTIFY_REPLY was queued and will be sent
      back.
      
      Cap requested size to negotiate max_write to avoid the problem.  This
      aligns with the way NOTIFY_RETRIEVE handler works, which already
      unconditionally caps requested retrieve size to fuse_conn->max_pages.  This
      way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than
      was originally requested.
      
      Please see [1] for context where the problem of stuck filesystem was hit
      for real, how the situation was traced and for more involving patch that
      did not make it into the tree.
      
      [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
      [2] https://github.com/hanwen/go-fuseSigned-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      7640682e
    • Kirill Smelkov's avatar
      fuse: allow filesystems to have precise control over data cache · ad2ba64d
      Kirill Smelkov authored
      On networked filesystems file data can be changed externally.  FUSE
      provides notification messages for filesystem to inform kernel that
      metadata or data region of a file needs to be invalidated in local page
      cache. That provides the basis for filesystem implementations to invalidate
      kernel cache explicitly based on observed filesystem-specific events.
      
      FUSE has also "automatic" invalidation mode(*) when the kernel
      automatically invalidates data cache of a file if it sees mtime change.  It
      also automatically invalidates whole data cache of a file if it sees file
      size being changed.
      
      The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA.
      However, due to probably historical reason, that capability controls only
      whether mtime change should be resulting in automatic invalidation or
      not. A change in file size always results in invalidating whole data cache
      of a file irregardless of whether FUSE_AUTO_INVAL_DATA was negotiated(+).
      
      The filesystem I write[1] represents data arrays stored in networked
      database as local files suitable for mmap. It is read-only filesystem -
      changes to data are committed externally via database interfaces and the
      filesystem only glues data into contiguous file streams suitable for mmap
      and traditional array processing. The files are big - starting from
      hundreds gigabytes and more. The files change regularly, and frequently by
      data being appended to their end. The size of files thus changes
      frequently.
      
      If a file was accessed locally and some part of its data got into page
      cache, we want that data to stay cached unless there is memory pressure, or
      unless corresponding part of the file was actually changed. However current
      FUSE behaviour - when it sees file size change - is to invalidate the whole
      file. The data cache of the file is thus completely lost even on small size
      change, and despite that the filesystem server is careful to accurately
      translate database changes into FUSE invalidation messages to kernel.
      
      Let's fix it: if a filesystem, through new FUSE_EXPLICIT_INVAL_DATA
      capability, indicates to kernel that it is fully responsible for data cache
      invalidation, then the kernel won't invalidate files data cache on size
      change and only truncate that cache to new size in case the size decreased.
      
      (*) see 72d0d248 "fuse: add FUSE_AUTO_INVAL_DATA init flag",
      eed2179e "fuse: invalidate inode mapping if mtime changes"
      
      (+) in writeback mode the kernel does not invalidate data cache on file
      size change, but neither it allows the filesystem to set the size due to
      external event (see 8373200b "fuse: Trust kernel i_size only")
      
      [1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      ad2ba64d
    • Kirill Smelkov's avatar
      fuse: convert printk -> pr_* · f2294482
      Kirill Smelkov authored
      Functions, like pr_err, are a more modern variant of printing compared to
      printk. They could be used to denoise sources by using needed level in
      the print function name, and by automatically inserting per-driver /
      function / ... print prefix as defined by pr_fmt macro. pr_* are also
      said to be used in Documentation/process/coding-style.rst and more
      recent code - for example overlayfs - uses them instead of printk.
      
      Convert CUSE and FUSE to use the new pr_* functions.
      
      CUSE output stays completely unchanged, while FUSE output is amended a
      bit for "trying to steal weird page" warning - the second line now comes
      also with "fuse:" prefix. I hope it is ok.
      Suggested-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      f2294482
    • Liu Bo's avatar
      fuse: honor RLIMIT_FSIZE in fuse_file_fallocate · 0cbade02
      Liu Bo authored
      fstests generic/228 reported this failure that fuse fallocate does not
      honor what 'ulimit -f' has set.
      
      This adds the necessary inode_newsize_ok() check.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Fixes: 05ba1f08 ("fuse: add FALLOCATE operation")
      Cc: <stable@vger.kernel.org> # v3.5
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      0cbade02
    • Miklos Szeredi's avatar
      fuse: fix writepages on 32bit · 9de5be06
      Miklos Szeredi authored
      Writepage requests were cropped to i_size & 0xffffffff, which meant that
      mmaped writes to any file larger than 4G might be silently discarded.
      
      Fix by storing the file size in a properly sized variable (loff_t instead
      of size_t).
      Reported-by: default avatarAntonio SJ Musumeci <trapexit@spawn.link>
      Fixes: 6eaf4782 ("fuse: writepages: crop secondary requests")
      Cc: <stable@vger.kernel.org> # v3.13
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      9de5be06
  2. 21 Apr, 2019 1 commit
  3. 20 Apr, 2019 11 commits
  4. 19 Apr, 2019 21 commits