1. 26 Mar, 2019 1 commit
    • Kirill Smelkov's avatar
      fs: stream_open - opener for stream-like files so that read and write can run... · c44fcf87
      Kirill Smelkov authored
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 (vfs: atomic f_pos accesses as per POSIX) added locking for
      file.f_pos access and in particular made concurrent read and write not possible
      - now both those functions take f_pos lock for the whole run, and so if e.g. a
      read is blocked waiting for data, write will deadlock waiting for that read to
      complete. This caused regression for stream-like files where previously read
      and write could run simultaneously, but after that patch could not do so
      anymore. See e.g. 581d21a2 (xenbus: fix deadlock on writes to /proc/xen/xenbus)
      which fixes such regression for particular case of /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 (see https://lkml.org/lkml/2014/2/17/324
      for background discussion) did so to guarantee POSIX thread safety for
      read/write/lseek and added the locking to file descriptors of all regular
      files. In 2014 that thread-safety problem was not new as it was already discussed
      earlier in 2006: https://lwn.net/Articles/180387. However even though 2006'th
      version of Linus's patch (https://lwn.net/Articles/180396) was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding
      the stream-like objects like pipes and sockets)", 2014'th version - the one that
      actually made it into the tree as 9c225f26 - is doing so irregardless of whether
      a file is seekable or not. The reason that it did so is, probably, that there are
      many files that are marked non-seekable, but e.g. their read implementation
      actually depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      In despite that, many nonseekable_open users implement read and write with pure
      stream semantics - they don't depend on passed ppos at all. And for those cases
      where read could wait for something inside, it creates a situation similar to
      xenbus - the write could be never made to go until read is done, and read is
      waiting for some, potentially external, event, for potentially unbounded time
      -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've
      found with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos locking is
      that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can
      no longer implement bidirectional stream-like files - for the same reason
      as above e.g. read can deadlock write locking on file.f_pos in the kernel.
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 (fuse: implement
      nonseekable open) to support OSSPD (https://github.com/libfuse/osspd;
      https://lwn.net/Articles/308445). OSSPD implements /dev/dsp in userspace with
      FOPEN_NONSEEKABLE flag, with corresponding read and write routines not
      depending on current position at all, and with both read and write being
      potentially blocking operations:
      
      	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
      	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
      	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset. However
      that test implements only read without write and cannot exercise the deadlock
      scenario:
      
      	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
      	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
      	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing my
      FUSE filesystem where there is /head/watch file, for which open creates
      separate bidirectional socket-like stream in between filesystem and its user
      with both read and write being later performed simultaneously. And there it is
      semantically not easy to split the stream into two separate read-only and
      write-only channels:
      
      	https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would
         break many in-kernel nonseekable_open users which actually use ppos in
         read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file descriptors.
         Read and write on such file descriptors would never use nor change ppos. And
         with that property on stream-like files read and write will be running without
         taking f_pos lock - i.e. read and write could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not depend on ppos and
         where there is no other methods in file_operations which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open
         if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open instead of
         nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian
         codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually
         uses offset in its read and write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from
         v3.14+ (the kernel where 9c225f26 first appeared). This will allow to patch
         OSSPD and other FUSE filesystems that provide stream-like files to return
         FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on
         all kernel versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that
         is not aware of this flag cannot hurt. In turn the kernel that is not aware of
         FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
         implement streams without read vs write deadlock.
      
      This patch: adds stream_open, converts /proc/xen/xenbus to it and adds semantic
      patch to automatically locate in-kernel places that are either required to be
      converted due to read vs write deadlock, or that are just safe to be converted
      because read and write do not use ppos and there are no other funky methods in
      file_operations.
      
      Followup patches are:
      
      - apply the result of semantic patch;
      - add FOPEN_STREAM to fs/fuse.
      
      Regarding semantic patch I've verified each generated change manually - that it is
      correct to convert - and each other nonseekable_open instance left - that it is
      either not correct to convert there, or that it is not converted due to current
      stream_open.cocci limitations. The script also does not convert files that should
      be valid to convert, but that currently have .llseek = noop_llseek or
      generic_file_llseek for unknown reason despite file being opened with
      nonseekable_open (e.g. drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
      c44fcf87
  2. 28 Feb, 2019 30 commits
    • Al Viro's avatar
    • David Howells's avatar
      afs: Use fs_context to pass parameters over automount · c99c2171
      David Howells authored
      Alter the AFS automounting code to create and modify an fs_context struct
      when parameterising a new mount triggered by an AFS mountpoint rather than
      constructing device name and option strings.
      
      Also remove the cell=, vol= and rwpath options as they are then redundant.
      The reason they existed is because the 'device name' may be derived
      literally from a mountpoint object in the filesystem, so default cell and
      parent-type information needed to be passed in by some other method from
      the automount routines.  The vol= option didn't end up being used.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Eric W. Biederman <ebiederm@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c99c2171
    • David Howells's avatar
      afs: Add fs_context support · 13fcc683
      David Howells authored
      Add fs_context support to the AFS filesystem, converting the parameter
      parsing to store options there.
      
      This will form the basis for namespace propagation over mountpoints within
      the AFS model, thereby allowing AFS to be used in containers more easily.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      13fcc683
    • David Howells's avatar
      vfs: Add some logging to the core users of the fs_context log · 06a2ae56
      David Howells authored
      Add some logging to the core users of the fs_context log so that
      information can be extracted from them as to the reason for failure.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      06a2ae56
    • David Howells's avatar
      vfs: Implement logging through fs_context · e7582e16
      David Howells authored
      Implement the ability for filesystems to log error, warning and
      informational messages through the fs_context.  In the future, these will
      be extractable by userspace by reading from an fd created by the fsopen()
      syscall.
      
      Error messages are prefixed with "e ", warnings with "w " and informational
      messages with "i ".
      
      In the future, inside the kernel, formatted messages will be malloc'd but
      unformatted messages will not copied if they're either in the core .rodata
      section or in the .rodata section of the filesystem module pinned by
      fs_context::fs_type.  The messages will only be good till the fs_type is
      released.
      
      Note that the logging object will be shared between duplicated fs_context
      structures.  This is so that such as NFS which do a mount within a mount
      can get at least some of the errors from the inner mount.
      
      Five logging functions are provided for this:
      
       (1) void logfc(struct fs_context *fc, const char *fmt, ...);
      
           This logs a message into the context.  If the buffer is full, the
           earliest message is discarded.
      
       (2) void errorf(fc, fmt, ...);
      
           This wraps logfc() to log an error.
      
       (3) void invalf(fc, fmt, ...);
      
           This wraps errorf() and returns -EINVAL for convenience.
      
       (4) void warnf(fc, fmt, ...);
      
           This wraps logfc() to log a warning.
      
       (5) void infof(fc, fmt, ...);
      
           This wraps logfc() to log an informational message.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e7582e16
    • David Howells's avatar
      vfs: Provide documentation for new mount API · 5fe1890d
      David Howells authored
      Provide documentation for the new mount API.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5fe1890d
    • David Howells's avatar
      vfs: Remove kern_mount_data() · d911b458
      David Howells authored
      The kern_mount_data() isn't used any more so remove it.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d911b458
    • David Howells's avatar
      hugetlbfs: Convert to fs_context · 32021982
      David Howells authored
      Convert the hugetlbfs to use the fs_context during mount.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      32021982
    • David Howells's avatar
      cpuset: Use fs_context · a1875374
      David Howells authored
      Make the cpuset filesystem use the filesystem context.  This is potentially
      tricky as the cpuset fs is almost an alias for the cgroup filesystem, but
      with some special parameters.
      
      This can, however, be handled by setting up an appropriate cgroup
      filesystem and returning the root directory of that as the root dir of this
      one.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a1875374
    • David Howells's avatar
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context · 23bf1b6b
      David Howells authored
      Make kernfs support superblock creation/mount/remount with fs_context.
      
      This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
      be made to support fs_context also.
      
      Notes:
      
       (1) A kernfs_fs_context struct is created to wrap fs_context and the
           kernfs mount parameters are moved in here (or are in fs_context).
      
       (2) kernfs_mount{,_ns}() are made into kernfs_get_tree().  The extra
           namespace tag parameter is passed in the context if desired
      
       (3) kernfs_free_fs_context() is provided as a destructor for the
           kernfs_fs_context struct, but for the moment it does nothing except
           get called in the right places.
      
       (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
           pass, but possibly this should be done anyway in case someone wants to
           add a parameter in future.
      
       (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
           the cgroup v1 and v2 mount parameters are all moved there.
      
       (6) cgroup1 parameter parsing error messages are now handled by invalf(),
           which allows userspace to collect them directly.
      
       (7) cgroup1 parameter cleanup is now done in the context destructor rather
           than in the mount/get_tree and remount functions.
      
      Weirdies:
      
       (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
           but then uses the resulting pointer after dropping the locks.  I'm
           told this is okay and needs commenting.
      
       (*) The cgroup refcount web.  This really needs documenting.
      
       (*) cgroup2 only has one root?
      
      Add a suggestion from Thomas Gleixner in which the RDT enablement code is
      placed into its own function.
      
      [folded a leak fix from Andrey Vagin]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc: Tejun Heo <tj@kernel.org>
      cc: Li Zefan <lizefan@huawei.com>
      cc: Johannes Weiner <hannes@cmpxchg.org>
      cc: cgroups@vger.kernel.org
      cc: fenghua.yu@intel.com
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      23bf1b6b
    • Al Viro's avatar
      cgroup: store a reference to cgroup_ns into cgroup_fs_context · cca8f327
      Al Viro authored
      ... and trim cgroup_do_mount() arguments (renaming it to cgroup_do_get_tree())
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cca8f327
    • Al Viro's avatar
    • Al Viro's avatar
      cgroup_do_mount(): massage calling conventions · 71d883c3
      Al Viro authored
      pass it fs_context instead of fs_type/flags/root triple, have
      it return int instead of dentry and make it deal with setting
      fc->root.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      71d883c3
    • Al Viro's avatar
      cgroup: stash cgroup_root reference into cgroup_fs_context · cf6299b1
      Al Viro authored
      Note that this reference is *NOT* contributing to refcount of
      cgroup_root in question and is valid only until cgroup_do_mount()
      returns.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cf6299b1
    • Al Viro's avatar
      cgroup2: switch to option-by-option parsing · e34a98d5
      Al Viro authored
      [again, carved out of patch by dhowells]
      [NB: we probably want to handle "source" in parse_param here]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e34a98d5
    • Al Viro's avatar
      cgroup1: switch to option-by-option parsing · 8d2451f4
      Al Viro authored
      [dhowells should be the author - it's carved out of his patch]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8d2451f4
    • Al Viro's avatar
      cgroup: take options parsing into ->parse_monolithic() · f5dfb531
      Al Viro authored
      Store the results in cgroup_fs_context.  There's a nasty twist caused
      by the enabling/disabling subsystems - we can't do the checks sensitive
      to that until cgroup_mutex gets grabbed.  Frankly, these checks are
      complete bullshit (e.g. all,none combination is accepted if all subsystems
      are disabled; so's cpusets,none and all,cpusets when cpusets is disabled,
      etc.), but touching that would be a userland-visible behaviour change ;-/
      
      So we do parsing in ->parse_monolithic() and have the consistency checks
      done in check_cgroupfs_options(), with the latter called (on already parsed
      options) from cgroup1_get_tree() and cgroup1_reconfigure().
      
      Freeing the strdup'ed strings is done from fs_context destructor, which
      somewhat simplifies the life for cgroup1_{get_tree,reconfigure}().
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f5dfb531
    • Al Viro's avatar
      7feeef58
    • Al Viro's avatar
      cgroup: start switching to fs_context · 90129625
      Al Viro authored
      Unfortunately, cgroup is tangled into kernfs infrastructure.
      To avoid converting all kernfs-based filesystems at once,
      we need to untangle the remount part of things, instead of
      having it go through kernfs_sop_remount_fs().  Fortunately,
      it's not hard to do.
      
      This commit just gets cgroup/cgroup1 to use fs_context to
      deliver options on mount and remount paths.  Parsing those
      is going to be done in the next commits; for now we do
      pretty much what legacy case does.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      90129625
    • David Howells's avatar
      ipc: Convert mqueue fs to fs_context · 935c6912
      David Howells authored
      Convert the mqueue filesystem to use the filesystem context stuff.
      
      Notes:
      
       (1) The relevant ipc namespace is selected in when the context is
           initialised (and it defaults to the current task's ipc namespace).
           The caller can override this before calling vfs_get_tree().
      
       (2) Rather than simply calling kern_mount_data(), mq_init_ns() and
           mq_internal_mount() create a context, adjust it and then do the rest
           of the mount procedure.
      
       (3) The lazy mqueue mounting on creation of a new namespace is retained
           from a previous patch, but the avoidance of sget() if no superblock
           yet exists is reverted and the superblock is again keyed on the
           namespace pointer.
      
           Yes, there was a performance gain in not searching the superblock
           hash, but it's only paid once per ipc namespace - and only if someone
           uses mqueue within that namespace, so I'm not sure it's worth it,
           especially as calling sget() allows avoidance of recursion.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      935c6912
    • David Howells's avatar
      proc: Add fs_context support to procfs · 66f592e2
      David Howells authored
      Add fs_context support to procfs.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      66f592e2
    • David Howells's avatar
      procfs: Move proc_fill_super() to fs/proc/root.c · 60a3c3a5
      David Howells authored
      Move proc_fill_super() to fs/proc/root.c as that's where the other
      superblock stuff is.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      60a3c3a5
    • Al Viro's avatar
      introduce cloning of fs_context · 0b52075e
      Al Viro authored
      new primitive: vfs_dup_fs_context().  Comes with fs_context
      method (->dup()) for copying the filesystem-specific parts
      of fs_context, along with LSM one (->fs_context_dup()) for
      doing the same to LSM parts.
      
      [needs better commit message, and change of Author:, anyway]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0b52075e
    • Al Viro's avatar
      convenience helpers: vfs_get_super() and sget_fc() · cb50b348
      Al Viro authored
      the former is an analogue of mount_{single,nodev} for use in
      ->get_tree() instances, the latter - analogue of sget() for the
      same.
      
      These are fairly similar to the originals, but the callback signature
      for sget_fc() is different from sget() ones, so getting bits and
      pieces shared would be too convoluted; we might get around to that
      later, but for now let's just remember to keep them in sync.  They
      do live next to each other, and changes in either won't be hard
      to spot.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cb50b348
    • David Howells's avatar
      vfs: Implement a filesystem superblock creation/configuration context · 3e1aeb00
      David Howells authored
      [AV - unfuck kern_mount_data(); we want non-NULL ->mnt_ns on long-living
      mounts]
      [AV - reordering fs/namespace.c is badly overdue, but let's keep it
      separate from that series]
      [AV - drop simple_pin_fs() change]
      [AV - clean vfs_kern_mount() failure exits up]
      
      Implement a filesystem context concept to be used during superblock
      creation for mount and superblock reconfiguration for remount.
      
      The mounting procedure then becomes:
      
       (1) Allocate new fs_context context.
      
       (2) Configure the context.
      
       (3) Create superblock.
      
       (4) Query the superblock.
      
       (5) Create a mount for the superblock.
      
       (6) Destroy the context.
      
      Rather than calling fs_type->mount(), an fs_context struct is created and
      fs_type->init_fs_context() is called to set it up.  Pointers exist for the
      filesystem and LSM to hang their private data off.
      
      A set of operations has to be set by ->init_fs_context() to provide
      freeing, duplication, option parsing, binary data parsing, validation,
      mounting and superblock filling.
      
      Legacy filesystems are supported by the provision of a set of legacy
      fs_context operations that build up a list of mount options and then invoke
      fs_type->mount() from within the fs_context ->get_tree() operation.  This
      allows all filesystems to be accessed using fs_context.
      
      It should be noted that, whilst this patch adds a lot of lines of code,
      there is quite a bit of duplication with existing code that can be
      eliminated should all filesystems be converted over.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3e1aeb00
    • David Howells's avatar
      vfs: Put security flags into the fs_context struct · 846e5662
      David Howells authored
      Put security flags, such as SECURITY_LSM_NATIVE_LABELS, into the filesystem
      context so that the filesystem can communicate them to the LSM more easily.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      846e5662
    • David Howells's avatar
      smack: Implement filesystem context security hooks · 2febd254
      David Howells authored
      Implement filesystem context security hooks for the smack LSM.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Casey Schaufler <casey@schaufler-ca.com>
      cc: linux-security-module@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2febd254
    • David Howells's avatar
      selinux: Implement the new mount API LSM hooks · 442155c1
      David Howells authored
      Implement the new mount API LSM hooks for SELinux.  At some point the old
      hooks will need to be removed.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Paul Moore <paul@paul-moore.com>
      cc: Stephen Smalley <sds@tycho.nsa.gov>
      cc: selinux@tycho.nsa.gov
      cc: linux-security-module@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      442155c1
    • David Howells's avatar
      vfs: Add LSM hooks for the new mount API · da2441fd
      David Howells authored
      Add LSM hooks for use by the new mount API and filesystem context code.
      This includes:
      
       (1) Hooks to handle allocation, duplication and freeing of the security
           record attached to a filesystem context.
      
       (2) A hook to snoop source specifications.  There may be multiple of these
           if the filesystem supports it.  They will to be local files/devices if
           fs_context::source_is_dev is true and will be something else, possibly
           remote server specifications, if false.
      
       (3) A hook to snoop superblock configuration options in key[=val] form.
           If the LSM decides it wants to handle it, it can suppress the option
           being passed to the filesystem.  Note that 'val' may include commas
           and binary data with the fsopen patch.
      
       (4) A hook to perform validation and allocation after the configuration
           has been done but before the superblock is allocated and set up.
      
       (5) A hook to transfer the security from the context to a newly created
           superblock.
      
       (6) A hook to rule on whether a path point can be used as a mountpoint.
      
      These are intended to replace:
      
      	security_sb_copy_data
      	security_sb_kern_mount
      	security_sb_mount
      	security_sb_set_mnt_opts
      	security_sb_clone_mnt_opts
      	security_sb_parse_opts_str
      
      [AV -- some of the methods being replaced are already gone, some of the
      methods are not added for the lack of need]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: linux-security-module@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      da2441fd
    • David Howells's avatar
      vfs: Add configuration parser helpers · 31d921c7
      David Howells authored
      Because the new API passes in key,value parameters, match_token() cannot be
      used with it.  Instead, provide three new helpers to aid with parsing:
      
       (1) fs_parse().  This takes a parameter and a simple static description of
           all the parameters and maps the key name to an ID.  It returns 1 on a
           match, 0 on no match if unknowns should be ignored and some other
           negative error code on a parse error.
      
           The parameter description includes a list of key names to IDs, desired
           parameter types and a list of enumeration name -> ID mappings.
      
           [!] Note that for the moment I've required that the key->ID mapping
           array is expected to be sorted and unterminated.  The size of the
           array is noted in the fsconfig_parser struct.  This allows me to use
           bsearch(), but I'm not sure any performance gain is worth the hassle
           of requiring people to keep the array sorted.
      
           The parameter type array is sized according to the number of parameter
           IDs and is indexed directly.  The optional enum mapping array is an
           unterminated, unsorted list and the size goes into the fsconfig_parser
           struct.
      
           The function can do some additional things:
      
      	(a) If it's not ambiguous and no value is given, the prefix "no" on
      	    a key name is permitted to indicate that the parameter should
      	    be considered negatory.
      
      	(b) If the desired type is a single simple integer, it will perform
      	    an appropriate conversion and store the result in a union in
      	    the parse result.
      
      	(c) If the desired type is an enumeration, {key ID, name} will be
      	    looked up in the enumeration list and the matching value will
      	    be stored in the parse result union.
      
      	(d) Optionally generate an error if the key is unrecognised.
      
           This is called something like:
      
      	enum rdt_param {
      		Opt_cdp,
      		Opt_cdpl2,
      		Opt_mba_mpbs,
      		nr__rdt_params
      	};
      
      	const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
      		[Opt_cdp]	= { fs_param_is_bool },
      		[Opt_cdpl2]	= { fs_param_is_bool },
      		[Opt_mba_mpbs]	= { fs_param_is_bool },
      	};
      
      	const const char *const rdt_param_keys[nr__rdt_params] = {
      		[Opt_cdp]	= "cdp",
      		[Opt_cdpl2]	= "cdpl2",
      		[Opt_mba_mpbs]	= "mba_mbps",
      	};
      
      	const struct fs_parameter_description rdt_parser = {
      		.name		= "rdt",
      		.nr_params	= nr__rdt_params,
      		.keys		= rdt_param_keys,
      		.specs		= rdt_param_specs,
      		.no_source	= true,
      	};
      
      	int rdt_parse_param(struct fs_context *fc,
      			    struct fs_parameter *param)
      	{
      		struct fs_parse_result parse;
      		struct rdt_fs_context *ctx = rdt_fc2context(fc);
      		int ret;
      
      		ret = fs_parse(fc, &rdt_parser, param, &parse);
      		if (ret < 0)
      			return ret;
      
      		switch (parse.key) {
      		case Opt_cdp:
      			ctx->enable_cdpl3 = true;
      			return 0;
      		case Opt_cdpl2:
      			ctx->enable_cdpl2 = true;
      			return 0;
      		case Opt_mba_mpbs:
      			ctx->enable_mba_mbps = true;
      			return 0;
      		}
      
      		return -EINVAL;
      	}
      
       (2) fs_lookup_param().  This takes a { dirfd, path, LOOKUP_EMPTY? } or
           string value and performs an appropriate path lookup to convert it
           into a path object, which it will then return.
      
           If the desired type was a blockdev, the type of the looked up inode
           will be checked to make sure it is one.
      
           This can be used like:
      
      	enum foo_param {
      		Opt_source,
      		nr__foo_params
      	};
      
      	const struct fs_parameter_spec foo_param_specs[nr__foo_params] = {
      		[Opt_source]	= { fs_param_is_blockdev },
      	};
      
      	const char *char foo_param_keys[nr__foo_params] = {
      		[Opt_source]	= "source",
      	};
      
      	const struct constant_table foo_param_alt_keys[] = {
      		{ "device",	Opt_source },
      	};
      
      	const struct fs_parameter_description foo_parser = {
      		.name		= "foo",
      		.nr_params	= nr__foo_params,
      		.nr_alt_keys	= ARRAY_SIZE(foo_param_alt_keys),
      		.keys		= foo_param_keys,
      		.alt_keys	= foo_param_alt_keys,
      		.specs		= foo_param_specs,
      	};
      
      	int foo_parse_param(struct fs_context *fc,
      			    struct fs_parameter *param)
      	{
      		struct fs_parse_result parse;
      		struct foo_fs_context *ctx = foo_fc2context(fc);
      		int ret;
      
      		ret = fs_parse(fc, &foo_parser, param, &parse);
      		if (ret < 0)
      			return ret;
      
      		switch (parse.key) {
      		case Opt_source:
      			return fs_lookup_param(fc, &foo_parser, param,
      					       &parse, &ctx->source);
      		default:
      			return -EINVAL;
      		}
      	}
      
       (3) lookup_constant().  This takes a table of named constants and looks up
           the given name within it.  The table is expected to be sorted such
           that bsearch() be used upon it.
      
           Possibly I should require the table be terminated and just use a
           for-loop to scan it instead of using bsearch() to reduce hassle.
      
           Tables look something like:
      
      	static const struct constant_table bool_names[] = {
      		{ "0",		false },
      		{ "1",		true },
      		{ "false",	false },
      		{ "no",		false },
      		{ "true",	true },
      		{ "yes",	true },
      	};
      
           and a lookup is done with something like:
      
      	b = lookup_constant(bool_names, param->string, -1);
      
      Additionally, optional validation routines for the parameter description
      are provided that can be enabled at compile time.  A later patch will
      invoke these when a filesystem is registered.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      31d921c7
  3. 26 Feb, 2019 1 commit
    • Eric Dumazet's avatar
      iov_iter: optimize page_copy_sane() · 6daef95b
      Eric Dumazet authored
      Avoid cache line miss dereferencing struct page if we can.
      
      page_copy_sane() mostly deals with order-0 pages.
      
      Extra cache line miss is visible on TCP recvmsg() calls dealing
      with GRO packets (typically 45 page frags are attached to one skb).
      
      Bringing the 45 struct pages into cpu cache while copying the data
      is not free, since the freeing of the skb (and associated
      page frags put_page()) can happen after cache lines have been evicted.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6daef95b
  4. 22 Feb, 2019 2 commits
    • Geert Uytterhoeven's avatar
      vfs: Make __vfs_write() static · 12e1e7af
      Geert Uytterhoeven authored
      __vfs_write() was unexported, and removed from <linux/fs.h>, but
      forgotten to be made static.
      
      Fixes: eb031849 ("fs: unexport __vfs_read/__vfs_write")
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      12e1e7af
    • Bart Van Assche's avatar
      aio: Fix locking in aio_poll() · d3d6a18d
      Bart Van Assche authored
      wake_up_locked() may but does not have to be called with interrupts
      disabled. Since the fuse filesystem calls wake_up_locked() without
      disabling interrupts aio_poll_wake() may be called with interrupts
      enabled. Since the kioctx.ctx_lock may be acquired from IRQ context,
      all code that acquires that lock from thread context must disable
      interrupts. Hence change the spin_trylock() call in aio_poll_wake()
      into a spin_trylock_irqsave() call. This patch fixes the following
      lockdep complaint:
      
      =====================================================
      WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
      5.0.0-rc4-next-20190131 #23 Not tainted
      -----------------------------------------------------
      syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline]
      0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline]
      0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline]
      0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
      
      and this task is already holding:
      000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
      000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
      000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
      000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
      which would create a new lock dependency:
       (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
      
      but this new dependency connects a SOFTIRQ-irq-safe lock:
       (&(&ctx->ctx_lock)->rlock){..-.}
      
      ... which became SOFTIRQ-irq-safe at:
        lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
        __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
        _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
        spin_lock_irq include/linux/spinlock.h:354 [inline]
        free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
        percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
        percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
        percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
        percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
        __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
        rcu_do_batch kernel/rcu/tree.c:2486 [inline]
        invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
        rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
        __do_softirq+0x266/0x95a kernel/softirq.c:292
        run_ksoftirqd kernel/softirq.c:654 [inline]
        run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
        smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
        kthread+0x357/0x430 kernel/kthread.c:247
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
      
      to a SOFTIRQ-irq-unsafe lock:
       (&fiq->waitq){+.+.}
      
      ... which became SOFTIRQ-irq-unsafe at:
      ...
        lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
        __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
        _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
        spin_lock include/linux/spinlock.h:329 [inline]
        flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
        fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
        fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
        fuse_send_init fs/fuse/inode.c:989 [inline]
        fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
        mount_nodev+0x68/0x110 fs/super.c:1392
        fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
        legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
        vfs_get_tree+0x123/0x450 fs/super.c:1481
        do_new_mount fs/namespace.c:2610 [inline]
        do_mount+0x1436/0x2c40 fs/namespace.c:2932
        ksys_mount+0xdb/0x150 fs/namespace.c:3148
        __do_sys_mount fs/namespace.c:3162 [inline]
        __se_sys_mount fs/namespace.c:3159 [inline]
        __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
        do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      other info that might help us debug this:
      
       Possible interrupt unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&fiq->waitq);
                                     local_irq_disable();
                                     lock(&(&ctx->ctx_lock)->rlock);
                                     lock(&fiq->waitq);
        <Interrupt>
          lock(&(&ctx->ctx_lock)->rlock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor2/13779:
       #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
       #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
       #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
       #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
      
      the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
      -> (&(&ctx->ctx_lock)->rlock){..-.} {
         IN-SOFTIRQ-W at:
                          lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                          __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
                          _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
                          spin_lock_irq include/linux/spinlock.h:354 [inline]
                          free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
                          percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
                          percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
                          percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
                          percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
                          __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
                          rcu_do_batch kernel/rcu/tree.c:2486 [inline]
                          invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
                          rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
                          __do_softirq+0x266/0x95a kernel/softirq.c:292
                          run_ksoftirqd kernel/softirq.c:654 [inline]
                          run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
                          smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
                          kthread+0x357/0x430 kernel/kthread.c:247
                          ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
         INITIAL USE at:
                         lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                         __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
                         _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
                         spin_lock_irq include/linux/spinlock.h:354 [inline]
                         __do_sys_io_cancel fs/aio.c:2052 [inline]
                         __se_sys_io_cancel fs/aio.c:2035 [inline]
                         __x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035
                         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                         entry_SYSCALL_64_after_hwframe+0x49/0xbe
       }
       ... key      at: [<ffffffff8a574140>] __key.52370+0x0/0x40
       ... acquired at:
         lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
         __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
         _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
         spin_lock include/linux/spinlock.h:329 [inline]
         aio_poll fs/aio.c:1772 [inline]
         __io_submit_one fs/aio.c:1875 [inline]
         io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
         __do_sys_io_submit fs/aio.c:1953 [inline]
         __se_sys_io_submit fs/aio.c:1923 [inline]
         __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      the dependencies between the lock to be acquired
       and SOFTIRQ-irq-unsafe lock:
      -> (&fiq->waitq){+.+.} {
         HARDIRQ-ON-W at:
                          lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                          __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                          _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                          spin_lock include/linux/spinlock.h:329 [inline]
                          flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                          fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                          fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                          fuse_send_init fs/fuse/inode.c:989 [inline]
                          fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                          mount_nodev+0x68/0x110 fs/super.c:1392
                          fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                          legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                          vfs_get_tree+0x123/0x450 fs/super.c:1481
                          do_new_mount fs/namespace.c:2610 [inline]
                          do_mount+0x1436/0x2c40 fs/namespace.c:2932
                          ksys_mount+0xdb/0x150 fs/namespace.c:3148
                          __do_sys_mount fs/namespace.c:3162 [inline]
                          __se_sys_mount fs/namespace.c:3159 [inline]
                          __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                          do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                          entry_SYSCALL_64_after_hwframe+0x49/0xbe
         SOFTIRQ-ON-W at:
                          lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                          __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                          _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                          spin_lock include/linux/spinlock.h:329 [inline]
                          flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                          fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                          fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                          fuse_send_init fs/fuse/inode.c:989 [inline]
                          fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                          mount_nodev+0x68/0x110 fs/super.c:1392
                          fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                          legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                          vfs_get_tree+0x123/0x450 fs/super.c:1481
                          do_new_mount fs/namespace.c:2610 [inline]
                          do_mount+0x1436/0x2c40 fs/namespace.c:2932
                          ksys_mount+0xdb/0x150 fs/namespace.c:3148
                          __do_sys_mount fs/namespace.c:3162 [inline]
                          __se_sys_mount fs/namespace.c:3159 [inline]
                          __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                          do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                          entry_SYSCALL_64_after_hwframe+0x49/0xbe
         INITIAL USE at:
                         lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                         __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                         _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                         spin_lock include/linux/spinlock.h:329 [inline]
                         flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                         fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                         fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                         fuse_send_init fs/fuse/inode.c:989 [inline]
                         fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                         mount_nodev+0x68/0x110 fs/super.c:1392
                         fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                         legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                         vfs_get_tree+0x123/0x450 fs/super.c:1481
                         do_new_mount fs/namespace.c:2610 [inline]
                         do_mount+0x1436/0x2c40 fs/namespace.c:2932
                         ksys_mount+0xdb/0x150 fs/namespace.c:3148
                         __do_sys_mount fs/namespace.c:3162 [inline]
                         __se_sys_mount fs/namespace.c:3159 [inline]
                         __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                         entry_SYSCALL_64_after_hwframe+0x49/0xbe
       }
       ... key      at: [<ffffffff8a60dec0>] __key.43450+0x0/0x40
       ... acquired at:
         lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
         __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
         _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
         spin_lock include/linux/spinlock.h:329 [inline]
         aio_poll fs/aio.c:1772 [inline]
         __io_submit_one fs/aio.c:1875 [inline]
         io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
         __do_sys_io_submit fs/aio.c:1953 [inline]
         __se_sys_io_submit fs/aio.c:1923 [inline]
         __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      stack backtrace:
      CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline]
       check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605
       check_irq_usage kernel/locking/lockdep.c:1650 [inline]
       check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline]
       check_prev_add kernel/locking/lockdep.c:1860 [inline]
       check_prevs_add kernel/locking/lockdep.c:1968 [inline]
       validate_chain kernel/locking/lockdep.c:2339 [inline]
       __lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320
       lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
       spin_lock include/linux/spinlock.h:329 [inline]
       aio_poll fs/aio.c:1772 [inline]
       __io_submit_one fs/aio.c:1875 [inline]
       io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
       __do_sys_io_submit fs/aio.c:1953 [inline]
       __se_sys_io_submit fs/aio.c:1923 [inline]
       __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: <stable@vger.kernel.org>
      Fixes: e8693bcf ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19
      Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      [ bvanassche: added a comment ]
      Reluctantly-Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d3d6a18d
  5. 19 Feb, 2019 1 commit
    • YueHaibing's avatar
      exec: Fix mem leak in kernel_read_file · f612acfa
      YueHaibing authored
      syzkaller report this:
      BUG: memory leak
      unreferenced object 0xffffc9000488d000 (size 9195520):
        comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
        hex dump (first 32 bytes):
          ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00  ................
          02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff  ..........z.....
        backtrace:
          [<000000000863775c>] __vmalloc_node mm/vmalloc.c:1795 [inline]
          [<000000000863775c>] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
          [<000000000863775c>] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
          [<000000003f668111>] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
          [<000000002385813f>] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
          [<0000000011953ff1>] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
          [<000000006f58491f>] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
          [<00000000ee78baf4>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<00000000241f889b>] 0xffffffffffffffff
      
      It should goto 'out_free' lable to free allocated buf while kernel_read
      fails.
      
      Fixes: 39d637af ("vfs: forbid write access when reading a file into memory")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f612acfa
  6. 16 Feb, 2019 1 commit
    • Aurelien Jarno's avatar
      vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 · cc4b1242
      Aurelien Jarno authored
      The preadv2 and pwritev2 syscalls are supposed to emulate the readv and
      writev syscalls when offset == -1. Therefore the compat code should
      check for offset before calling do_compat_preadv64 and
      do_compat_pwritev64. This is the case for the preadv2 and pwritev2
      syscalls, but handling of offset == -1 is missing in their 64-bit
      equivalent.
      
      This patch fixes that, calling do_compat_readv and do_compat_writev when
      offset == -1. This fixes the following glibc tests on x32:
       - misc/tst-preadvwritev2
       - misc/tst-preadvwritev64v2
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: default avatarAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cc4b1242
  7. 04 Feb, 2019 1 commit
  8. 01 Feb, 2019 3 commits
    • Jann Horn's avatar
      pipe: stop using ->can_merge · 01e7187b
      Jann Horn authored
      Al Viro pointed out that since there is only one pipe buffer type to which
      new data can be appended, it isn't necessary to have a ->can_merge field in
      struct pipe_buf_operations, we can just check for a magic type.
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      01e7187b
    • Jann Horn's avatar
      splice: don't merge into linked buffers · a0ce2f0a
      Jann Horn authored
      Before this patch, it was possible for two pipes to affect each other after
      data had been transferred between them with tee():
      
      ============
      $ cat tee_test.c
      
      int main(void) {
        int pipe_a[2];
        if (pipe(pipe_a)) err(1, "pipe");
        int pipe_b[2];
        if (pipe(pipe_b)) err(1, "pipe");
        if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
        if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
        if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");
      
        char buf[5];
        if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
        buf[4] = 0;
        printf("got back: '%s'\n", buf);
      }
      $ gcc -o tee_test tee_test.c
      $ ./tee_test
      got back: 'abxx'
      $
      ============
      
      As suggested by Al Viro, fix it by creating a separate type for
      non-mergeable pipe buffers, then changing the types of buffers in
      splice_pipe_to_pipe() and link_pipe().
      
      Cc: <stable@vger.kernel.org>
      Fixes: 7c77f0b3 ("splice: implement pipe to pipe splicing")
      Fixes: 70524490 ("[PATCH] splice: add support for sys_tee()")
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a0ce2f0a
    • Chandan Rajendra's avatar
      copy_mount_string: Limit string length to PATH_MAX · fbdb4401
      Chandan Rajendra authored
      On ppc64le, When a string with PAGE_SIZE - 1 (i.e. 64k-1) length is
      passed as a "filesystem type" argument to the mount(2) syscall,
      copy_mount_string() ends up allocating 64k (the PAGE_SIZE on ppc64le)
      worth of space for holding the string in kernel's address space.
      
      Later, in set_precision() (invoked by get_fs_type() ->
      __request_module() -> vsnprintf()), we end up assigning
      strlen(fs-type-string) i.e. 65535 as the
      value to 'struct printf_spec'->precision member. This field has a width
      of 16 bits and it is a signed data type. Hence an invalid value ends
      up getting assigned. This causes the "WARN_ONCE(spec->precision != prec,
      "precision %d too large", prec)" statement inside set_precision() to be
      executed.
      
      This commit fixes the bug by limiting the length of the string passed by
      copy_mount_string() to strndup_user() to PATH_MAX.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.ibm.com>
      Reported-by: default avatarAbdul Haleem <abdhalee@linux.ibm.com>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fbdb4401