1. 30 Nov, 2017 40 commits
    • Miklos Szeredi's avatar
      fsnotify: pin both inode and vfsmount mark · 9e9569f0
      Miklos Szeredi authored
      commit 0d6ec079 upstream.
      
      We may fail to pin one of the marks in fsnotify_prepare_user_wait() when
      dropping the srcu read lock, resulting in use after free at the next
      iteration.
      
      Solution is to store both marks in iter_info instead of just the one we'll
      be sending the event for.
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 9385a84d ("fsnotify: Pass fsnotify_iter_info into handle_event handler")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9e9569f0
    • Miklos Szeredi's avatar
      fsnotify: clean up fsnotify_prepare/finish_user_wait() · 47b02dca
      Miklos Szeredi authored
      commit 24c20305 upstream.
      
      This patch doesn't actually fix any bug, just paves the way for fixing mark
      and group pinning.
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47b02dca
    • Shaohua Li's avatar
      md/bitmap: revert a patch · a6ff2fb4
      Shaohua Li authored
      commit 938b533d upstream.
      
      This reverts commit 8031c3dd. That patches doesn't work well if PAGE_SIZE >
      4k. We will fix the original problem with a different approach.
      
      Fix: 8031c3dd(md/bitmap: copy correct data for bitmap super)
      Reported-by: default avatarJoshua Kinard <kumba@gentoo.org>
      Suggested-by: default avatarNeil Brown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6ff2fb4
    • Loic Poulain's avatar
      Bluetooth: btqcomsmd: Add support for BD address setup · 5912d9ca
      Loic Poulain authored
      commit 6e518111 upstream.
      
      This patch implements the hdev setup function since wcnss-bt does not have
      persistent memory to store an allocated BD address. The device is therefore
      marked as unconfigured if no BD address has been previously retrieved.
      Signed-off-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5912d9ca
    • Artur Paszkiewicz's avatar
      md: don't check MD_SB_CHANGE_CLEAN in md_allow_write · 7cd7a7aa
      Artur Paszkiewicz authored
      commit b90f6ff0 upstream.
      
      Only MD_SB_CHANGE_PENDING should be used to wait for transition from
      clean to dirty. Checking also MD_SB_CHANGE_CLEAN is unnecessary and can
      race with e.g. md_do_sync(). This sporadically causes a hang when
      changing consistency policy during resync:
      
      INFO: task mdadm:6183 blocked for more than 30 seconds.
            Not tainted 4.14.0-rc3+ #391
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdadm           D12752  6183   6022 0x00000000
      Call Trace:
       __schedule+0x93f/0x990
       schedule+0x6b/0x90
       md_allow_write+0x100/0x130 [md_mod]
       ? do_wait_intr_irq+0x90/0x90
       resize_stripes+0x3a/0x5b0 [raid456]
       ? kernfs_fop_write+0xbe/0x180
       raid5_change_consistency_policy+0xa6/0x200 [raid456]
       consistency_policy_store+0x2e/0x70 [md_mod]
       md_attr_store+0x90/0xc0 [md_mod]
       sysfs_kf_write+0x42/0x50
       kernfs_fop_write+0x119/0x180
       __vfs_write+0x28/0x110
       ? rcu_sync_lockdep_assert+0x12/0x60
       ? __sb_start_write+0x15a/0x1c0
       ? vfs_write+0xa3/0x1a0
       vfs_write+0xb4/0x1a0
       SyS_write+0x49/0xa0
       entry_SYSCALL_64_fastpath+0x18/0xad
      
      Fixes: 2214c260 ("md: don't return -EAGAIN in md_allow_write for external metadata arrays")
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cd7a7aa
    • NeilBrown's avatar
      md: fix deadlock error in recent patch. · 459aad50
      NeilBrown authored
      commit d47c8ad2 upstream.
      
      A recent patch aimed to cause md_write_start() to fail (rather than
      block) when the mddev was suspending, so as to avoid deadlocks.
      Unfortunately the test in wait_event() was wrong, and it didn't change
      behaviour at all.
      
      We wait_event() must wait until the metadata is written OR the array is
      suspending.
      
      Fixes: cc27b0c7 ("md: fix deadlock between mddev_suspend() and md_write_start()")
      Reported-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      459aad50
    • Thomas Backlund's avatar
      iwlwifi: fix firmware names for 9000 and A000 series hw · 55e357bc
      Thomas Backlund authored
      commit c2c48ddf upstream.
      
      iwlwifi 9000 and a0000 series hw contains an extra dash in firmware
      file name as seeen in modinfo output for kernel 4.14:
      
      firmware:       iwlwifi-9260-th-b0-jf-b0--34.ucode
      firmware:       iwlwifi-9260-th-a0-jf-a0--34.ucode
      firmware:       iwlwifi-9000-pu-a0-jf-b0--34.ucode
      firmware:       iwlwifi-9000-pu-a0-jf-a0--34.ucode
      firmware:       iwlwifi-QuQnj-a0-hr-a0--34.ucode
      firmware:       iwlwifi-QuQnj-a0-jf-b0--34.ucode
      firmware:       iwlwifi-QuQnj-f0-hr-a0--34.ucode
      firmware:       iwlwifi-Qu-a0-jf-b0--34.ucode
      firmware:       iwlwifi-Qu-a0-hr-a0--34.ucode
      
      Fix that by dropping the extra adding of '"-"'.
      Signed-off-by: default avatarThomas Backlund <tmb@mageia.org>
      Signed-off-by: default avatarLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55e357bc
    • Arnd Bergmann's avatar
      rtlwifi: fix uninitialized rtlhal->last_suspend_sec time · 404dcc55
      Arnd Bergmann authored
      commit 3f2a162f upstream.
      
      We set rtlhal->last_suspend_sec to an uninitialized stack variable,
      but unfortunately gcc never warned about this, I only found it
      while working on another patch. I opened a gcc bug for this.
      
      Presumably the value of rtlhal->last_suspend_sec is not all that
      important, but it does get used, so we probably want the
      patch backported to stable kernels.
      
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82839Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      404dcc55
    • Larry Finger's avatar
      rtlwifi: rtl8192ee: Fix memory leak when loading firmware · 9f724960
      Larry Finger authored
      commit 519ce2f9 upstream.
      
      In routine rtl92ee_set_fw_rsvdpagepkt(), the driver allocates an skb, but
      never calls rtl_cmd_send_packet(), which will free the buffer. All other
      rtlwifi drivers perform this operation correctly.
      
      This problem has been in the driver since it was included in the kernel.
      Fortunately, each firmware load only leaks 4 buffers, which likely
      explains why it has not previously been detected.
      Signed-off-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f724960
    • Andrew Elble's avatar
      nfsd: deal with revoked delegations appropriately · 584f0bb5
      Andrew Elble authored
      commit 95da1b3a upstream.
      
      If a delegation has been revoked by the server, operations using that
      delegation should error out with NFS4ERR_DELEG_REVOKED in the >4.1
      case, and NFS4ERR_BAD_STATEID otherwise.
      
      The server needs NFSv4.1 clients to explicitly free revoked delegations.
      If the server returns NFS4ERR_DELEG_REVOKED, the client will do that;
      otherwise it may just forget about the delegation and be unable to
      recover when it later sees SEQ4_STATUS_RECALLABLE_STATE_REVOKED set on a
      SEQUENCE reply.  That can cause the Linux 4.1 client to loop in its
      stage manager.
      Signed-off-by: default avatarAndrew Elble <aweits@rit.edu>
      Reviewed-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      584f0bb5
    • NeilBrown's avatar
      NFS: revalidate "." etc correctly on "open". · 57567073
      NeilBrown authored
      commit b688741c upstream.
      
      For correct close-to-open semantics, NFS must validate
      the change attribute of a directory (or file) on open.
      
      Since commit ecf3d1f1 ("vfs: kill FS_REVAL_DOT by adding a
      d_weak_revalidate dentry op"), open() of "." or a path ending ".." is
      not revalidated reliably (except when that direct is a mount point).
      
      Prior to that commit, "." was revalidated using nfs_lookup_revalidate()
      which checks the LOOKUP_OPEN flag and forces revalidation if the flag is
      set.
      Since that commit, nfs_weak_revalidate() is used for NFSv3 (which
      ignores the flags) and nothing is used for NFSv4.
      
      This is fixed by using nfs_lookup_verify_inode() in
      nfs_weak_revalidate().  This does the revalidation exactly when needed.
      Also, add a definition of .d_weak_revalidate for NFSv4.
      
      The incorrect behavior is easily demonstrated by running "echo *" in
      some non-mountpoint NFS directory while watching network traffic.
      Without this patch, "echo *" sometimes doesn't produce any traffic.
      With the patch it always does.
      
      Fixes: ecf3d1f1 ("vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      57567073
    • Anna Schumaker's avatar
      NFS: Avoid RCU usage in tracepoints · 2deb8945
      Anna Schumaker authored
      commit 3944369d upstream.
      
      There isn't an obvious way to acquire and release the RCU lock during a
      tracepoint, so we can't use the rpc_peeraddr2str() function here.
      Instead, rely on the client's cl_hostname, which should have similar
      enough information without needing an rcu_dereference().
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2deb8945
    • Chuck Lever's avatar
      nfs: Fix ugly referral attributes · aed1a433
      Chuck Lever authored
      commit c05cefcc upstream.
      
      Before traversing a referral and performing a mount, the mounted-on
      directory looks strange:
      
      dr-xr-xr-x. 2 4294967294 4294967294 0 Dec 31  1969 dir.0
      
      nfs4_get_referral is wiping out any cached attributes with what was
      returned via GETATTR(fs_locations), but the bit mask for that
      operation does not request any file attributes.
      
      Retrieve owner and timestamp information so that the memcpy in
      nfs4_get_referral fills in more attributes.
      
      Changes since v1:
      - Don't request attributes that the client unconditionally replaces
      - Request only MOUNTED_ON_FILEID or FILEID attribute, not both
      - encode_fs_locations() doesn't use the third bitmask word
      
      Fixes: 6b97fd3d ("NFSv4: Follow a referral")
      Suggested-by: default avatarPradeep Thomas <pradeepthomas@gmail.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aed1a433
    • Benjamin Coddington's avatar
      NFS: Revert "NFS: Move the flock open mode check into nfs_flock()" · 57f3c05d
      Benjamin Coddington authored
      commit fcfa4470 upstream.
      
      Commit e1293727 "NFS: Move the flock open mode check into nfs_flock()"
      changed NFSv3 behavior for flock() such that the open mode must match the
      lock type, however that requirement shouldn't be enforced for flock().
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      57f3c05d
    • Joshua Watt's avatar
      NFS: Fix typo in nomigration mount option · afaacc00
      Joshua Watt authored
      commit f02fee22 upstream.
      
      The option was incorrectly masking off all other options.
      Signed-off-by: default avatarJoshua Watt <JPEWhacker@gmail.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      afaacc00
    • Jaegeuk Kim's avatar
      f2fs: expose some sectors to user in inline data or dentry case · d628ac8a
      Jaegeuk Kim authored
      commit 5b4267d1 upstream.
      
      If there's some data written through inline data or dentry, we need to shouw
      st_blocks. This fixes reporting zero blocks even though there is small written
      data.
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      [Jaegeuk Kim: avoid link file for quotacheck]
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d628ac8a
    • Josef Bacik's avatar
      btrfs: change how we decide to commit transactions during flushing · f1117628
      Josef Bacik authored
      commit 996478ca upstream.
      
      Nikolay reported that generic/273 was failing currently with ENOSPC.
      Turns out this is because we get to the point where the outstanding
      reservations are greater than the pinned space on the fs.  This is a
      mistake, previously we used the current reservation amount in
      may_commit_transaction, not the entire outstanding reservation amount.
      Fix this to find the minimum byte size needed to make progress in
      flushing, and pass that into may_commit_transaction.  From there we can
      make a smarter decision on whether to commit the transaction or not.
      This fixes the failure in generic/273.
      
      From Nikolai, IOW: when we go to the final stage of deciding whether to
      do trans commit, instead of passing all the reservations from all
      tickets we just pass the reservation for the current ticket. Otherwise,
      in case all reservations exceed pinned space, then we don't commit
      transaction and fail prematurely. Before we passed num_bytes from
      flush_space, where num_bytes was the sum of all pending reserverations,
      but now all we do is take the first ticket and commit the trans if we
      can satisfy that.
      
      Fixes: 957780eb ("Btrfs: introduce ticketed enospc infrastructure")
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      [ added Nikolai's comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1117628
    • Arnd Bergmann's avatar
      isofs: fix timestamps beyond 2027 · f2122d66
      Arnd Bergmann authored
      commit 34be4dbf upstream.
      
      isofs uses a 'char' variable to load the number of years since
      1900 for an inode timestamp. On architectures that use a signed
      char type by default, this results in an invalid date for
      anything beyond 2027.
      
      This changes the function argument to a 'u8' array, which
      is defined the same way on all architectures, and unambiguously
      lets us use years until 2155.
      
      This should be backported to all kernels that might still be
      in use by that date.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2122d66
    • Miklos Szeredi's avatar
      fanotify: fix fsnotify_prepare_user_wait() failure · 1dd7dd07
      Miklos Szeredi authored
      commit f37650f1 upstream.
      
      If fsnotify_prepare_user_wait() fails, we leave the event on the
      notification list.  Which will result in a warning in
      fsnotify_destroy_event() and later use-after-free.
      
      Instead of adding a new helper to remove the event from the list in this
      case, I opted to move the prepare/finish up into fanotify_handle_event().
      
      This will allow these to be moved further out into the generic code later,
      and perhaps let us move to non-sleeping RCU.
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 05f0e387 ("fanotify: Release SRCU lock when waiting for userspace response")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1dd7dd07
    • Greg Edwards's avatar
      fs: guard_bio_eod() needs to consider partitions · 5c21c3dd
      Greg Edwards authored
      commit 67f2519f upstream.
      
      guard_bio_eod() needs to look at the partition capacity, not just the
      capacity of the whole device, when determining if truncation is
      necessary.
      
      [   60.268688] attempt to access beyond end of device
      [   60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
      [   60.268693] buffer_io_error: 2 callbacks suppressed
      [   60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read
      
      Fixes: 74d46992 ("block: replace bi_bdev with a gendisk pointer and partitions index")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Edwards <gedwards@ddn.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c21c3dd
    • Coly Li's avatar
      bcache: check ca->alloc_thread initialized before wake up it · e9c80881
      Coly Li authored
      commit 91af8300 upstream.
      
      In bcache code, sysfs entries are created before all resources get
      allocated, e.g. allocation thread of a cache set.
      
      There is posibility for NULL pointer deference if a resource is accessed
      but which is not initialized yet. Indeed Jorg Bornschein catches one on
      cache set allocation thread and gets a kernel oops.
      
      The reason for this bug is, when bch_bucket_alloc() is called during
      cache set registration and attaching, ca->alloc_thread is not properly
      allocated and initialized yet, call wake_up_process() on ca->alloc_thread
      triggers NULL pointer deference failure. A simple and fast fix is, before
      waking up ca->alloc_thread, checking whether it is allocated, and only
      wake up ca->alloc_thread when it is not NULL.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reported-by: default avatarJorg Bornschein <jb@capsec.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9c80881
    • Eric Biggers's avatar
      libceph: don't WARN() if user tries to add invalid key · bcae2363
      Eric Biggers authored
      commit b1127085 upstream.
      
      The WARN_ON(!key->len) in set_secret() in net/ceph/crypto.c is hit if a
      user tries to add a key of type "ceph" with an invalid payload as
      follows (assuming CONFIG_CEPH_LIB=y):
      
          echo -e -n '\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
      	| keyctl padd ceph desc @s
      
      This can be hit by fuzzers.  As this is merely bad input and not a
      kernel bug, replace the WARN_ON() with return -EINVAL.
      
      Fixes: 7af3ea18 ("libceph: stop allocating a new cipher on every crypto request")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bcae2363
    • Dan Carpenter's avatar
      eCryptfs: use after free in ecryptfs_release_messaging() · bc6e8968
      Dan Carpenter authored
      commit db86be3a upstream.
      
      We're freeing the list iterator so we should be using the _safe()
      version of hlist_for_each_entry().
      
      Fixes: 88b4a07e ("[PATCH] eCryptfs: Public key transport mechanism")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc6e8968
    • Eric Biggers's avatar
      fscrypt: lock mutex before checking for bounce page pool · ddf1264e
      Eric Biggers authored
      commit a0b3bc85 upstream.
      
      fscrypt_initialize(), which allocates the global bounce page pool when
      an encrypted file is first accessed, uses "double-checked locking" to
      try to avoid locking fscrypt_init_mutex.  However, it doesn't use any
      memory barriers, so it's theoretically possible for a thread to observe
      a bounce page pool which has not been fully initialized.  This is a
      classic bug with "double-checked locking".
      
      While "only a theoretical issue" in the latest kernel, in pre-4.8
      kernels the pointer that was checked was not even the last to be
      initialized, so it was easily possible for a crash (NULL pointer
      dereference) to happen.  This was changed only incidentally by the large
      refactor to use fs/crypto/.
      
      Solve both problems in a trivial way that can easily be backported: just
      always take the mutex.  It's theoretically less efficient, but it
      shouldn't be noticeable in practice as the mutex is only acquired very
      briefly once per encrypted file.
      
      Later I'd like to make this use a helper macro like DO_ONCE().  However,
      DO_ONCE() runs in atomic context, so we'd need to add a new macro that
      allows blocking.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddf1264e
    • Andreas Rohner's avatar
      nilfs2: fix race condition that causes file system corruption · f9478266
      Andreas Rohner authored
      commit 31ccb1f7 upstream.
      
      There is a race condition between nilfs_dirty_inode() and
      nilfs_set_file_dirty().
      
      When a file is opened, nilfs_dirty_inode() is called to update the
      access timestamp in the inode.  It calls __nilfs_mark_inode_dirty() in a
      separate transaction.  __nilfs_mark_inode_dirty() caches the ifile
      buffer_head in the i_bh field of the inode info structure and marks it
      as dirty.
      
      After some data was written to the file in another transaction, the
      function nilfs_set_file_dirty() is called, which adds the inode to the
      ns_dirty_files list.
      
      Then the segment construction calls nilfs_segctor_collect_dirty_files(),
      which goes through the ns_dirty_files list and checks the i_bh field.
      If there is a cached buffer_head in i_bh it is not marked as dirty
      again.
      
      Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
      transactions, it is possible that a segment construction that writes out
      the ifile occurs in-between the two.  If this happens the inode is not
      on the ns_dirty_files list, but its ifile block is still marked as dirty
      and written out.
      
      In the next segment construction, the data for the file is written out
      and nilfs_bmap_propagate() updates the b-tree.  Eventually the bmap root
      is written into the i_bh block, which is not dirty, because it was
      written out in another segment construction.
      
      As a result the bmap update can be lost, which leads to file system
      corruption.  Either the virtual block address points to an unallocated
      DAT block, or the DAT entry will be reused for something different.
      
      The error can remain undetected for a long time.  A typical error
      message would be one of the "bad btree" errors or a warning that a DAT
      entry could not be found.
      
      This bug can be reproduced reliably by a simple benchmark that creates
      and overwrites millions of 4k files.
      
      Link: http://lkml.kernel.org/r/1509367935-3086-2-git-send-email-konishi.ryusuke@lab.ntt.co.jpSigned-off-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9478266
    • NeilBrown's avatar
      autofs: don't fail mount for transient error · 7b7f5437
      NeilBrown authored
      commit ecc0c469 upstream.
      
      Currently if the autofs kernel module gets an error when writing to the
      pipe which links to the daemon, then it marks the whole moutpoint as
      catatonic, and it will stop working.
      
      It is possible that the error is transient.  This can happen if the
      daemon is slow and more than 16 requests queue up.  If a subsequent
      process tries to queue a request, and is then signalled, the write to
      the pipe will return -ERESTARTSYS and autofs will take that as total
      failure.
      
      So change the code to assess -ERESTARTSYS and -ENOMEM as transient
      failures which only abort the current request, not the whole mountpoint.
      
      It isn't a crash or a data corruption, but having autofs mountpoints
      suddenly stop working is rather inconvenient.
      
      Ian said:
      
      : And given the problems with a half dozen (or so) user space applications
      : consuming large amounts of CPU under heavy mount and umount activity this
      : could happen more easily than we expect.
      
      Link: http://lkml.kernel.org/r/87y3norvgp.fsf@notabene.neil.brown.nameSigned-off-by: default avatarNeilBrown <neilb@suse.com>
      Acked-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b7f5437
    • Vitaly Wool's avatar
      mm/z3fold.c: use kref to prevent page free/compact race · c1a14af3
      Vitaly Wool authored
      commit 5d03a661 upstream.
      
      There is a race in the current z3fold implementation between
      do_compact() called in a work queue context and the page release
      procedure when page's kref goes to 0.
      
      do_compact() may be waiting for page lock, which is released by
      release_z3fold_page_locked right before putting the page onto the
      "stale" list, and then the page may be freed as do_compact() modifies
      its contents.
      
      The mechanism currently implemented to handle that (checking the
      PAGE_STALE flag) is not reliable enough.  Instead, we'll use page's kref
      counter to guarantee that the page is not released if its compaction is
      scheduled.  It then becomes compaction function's responsibility to
      decrease the counter and quit immediately if the page was actually
      freed.
      
      Link: http://lkml.kernel.org/r/20171117092032.00ea56f42affbed19f4fcc6c@gmail.comSigned-off-by: default avatarVitaly Wool <vitaly.wool@sonymobile.com>
      Cc: <Oleksiy.Avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1a14af3
    • Stanislaw Gruszka's avatar
      rt2x00usb: mark device removed when get ENOENT usb error · 769bfea5
      Stanislaw Gruszka authored
      commit bfa62a52 upstream.
      
      ENOENT usb error mean "specified interface or endpoint does not exist or
      is not enabled". Mark device not present when we encounter this error
      similar like we do with ENODEV error.
      
      Otherwise we can have infinite loop in rt2x00usb_work_rxdone(), because
      we remove and put again RX entries to the queue infinitely.
      
      We can have similar situation when submit urb will fail all the time
      with other error, so we need consider to limit number of entries
      processed by rxdone work. But for now, since the patch fixes
      reproducible soft lockup issue on single processor systems
      and taken ENOENT error meaning, let apply this fix.
      
      Patch adds additional ENOENT check not only in rx kick routine, but
      also on other places where we check for ENODEV error.
      Reported-by: default avatarRichard Genoud <richard.genoud@gmail.com>
      Debugged-by: default avatarRichard Genoud <richard.genoud@gmail.com>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Tested-by: default avatarRichard Genoud <richard.genoud@gmail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      769bfea5
    • Aleksandar Markovic's avatar
      MIPS: math-emu: Fix final emulation phase for certain instructions · 085d6651
      Aleksandar Markovic authored
      commit 409fcace upstream.
      
      Fix final phase of <CLASS|MADDF|MSUBF|MAX|MIN|MAXA|MINA>.<D|S>
      emulation. Provide proper generation of SIGFPE signal and updating
      debugfs FP exception stats in cases of any exception flags set in
      preceding phases of emulation.
      
      CLASS.<D|S> instruction may generate "Unimplemented Operation" FP
      exception. <MADDF|MSUBF>.<D|S> instructions may generate "Inexact",
      "Unimplemented Operation", "Invalid Operation", "Overflow", and
      "Underflow" FP exceptions. <MAX|MIN|MAXA|MINA>.<D|S> instructions
      can generate "Unimplemented Operation" and "Invalid Operation" FP
      exceptions.
      
      The proper final processing of the cases when any FP exception
      flag is set is achieved by replacing "break" statement with "goto
      copcsr" statement. With such solution, this patch brings the final
      phase of emulation of the above instructions consistent with the
      one corresponding to the previously implemented emulation of other
      related FPU instructions (ADD, SUB, etc.).
      
      Fixes: 38db37ba ("MIPS: math-emu: Add support for the MIPS R6 CLASS FPU instruction")
      Fixes: e24c3bec ("MIPS: math-emu: Add support for the MIPS R6 MADDF FPU instruction")
      Fixes: 83d43305 ("MIPS: math-emu: Add support for the MIPS R6 MSUBF FPU instruction")
      Fixes: a79f5f9b ("MIPS: math-emu: Add support for the MIPS R6 MAX{, A} FPU instruction")
      Fixes: 4e9561b2 ("MIPS: math-emu: Add support for the MIPS R6 MIN{, A} FPU instruction")
      Signed-off-by: default avatarAleksandar Markovic <aleksandar.markovic@mips.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Douglas Leung <douglas.leung@mips.com>
      Cc: Goran Ferenc <goran.ferenc@mips.com>
      Cc: "Maciej W. Rozycki" <macro@imgtec.com>
      Cc: Miodrag Dinic <miodrag.dinic@mips.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petar Jovanovic <petar.jovanovic@mips.com>
      Cc: Raghu Gandham <raghu.gandham@mips.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/17581/Signed-off-by: default avatarJames Hogan <jhogan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      085d6651
    • Mirko Parthey's avatar
      MIPS: BCM47XX: Fix LED inversion for WRT54GSv1 · 8d187fa8
      Mirko Parthey authored
      commit 56a46acf upstream.
      
      The WLAN LED on the Linksys WRT54GSv1 is active low, but the software
      treats it as active high. Fix the inverted logic.
      
      Fixes: 7bb26b16 ("MIPS: BCM47xx: Fix LEDs on WRT54GS V1.0")
      Signed-off-by: default avatarMirko Parthey <mirko.parthey@web.de>
      Looks-ok-by: default avatarRafał Miłecki <zajec5@gmail.com>
      Cc: Hauke Mehrtens <hauke@hauke-m.de>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/16071/Signed-off-by: default avatarJames Hogan <jhogan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d187fa8
    • Maciej W. Rozycki's avatar
      MIPS: Fix an n32 core file generation regset support regression · dc3aceed
      Maciej W. Rozycki authored
      commit 547da673 upstream.
      
      Fix a commit 7aeb753b ("MIPS: Implement task_user_regset_view.")
      regression, then activated by commit 6a9c001b ("MIPS: Switch ELF
      core dumper to use regsets.)", that caused n32 processes to dump o32
      core files by failing to set the EF_MIPS_ABI2 flag in the ELF core file
      header's `e_flags' member:
      
      $ file tls-core
      tls-core: ELF 32-bit MSB executable, MIPS, N32 MIPS64 rel2 version 1 (SYSV), [...]
      $ ./tls-core
      Aborted (core dumped)
      $ file core
      core: ELF 32-bit MSB core file MIPS, MIPS-I version 1 (SYSV), SVR4-style
      $
      
      Previously the flag was set as the result of a:
      
      statement placed in arch/mips/kernel/binfmt_elfn32.c, however in the
      regset case, i.e. when CORE_DUMP_USE_REGSET is set, ELF_CORE_EFLAGS is
      no longer used by `fill_note_info' in fs/binfmt_elf.c, and instead the
      `->e_flags' member of the regset view chosen is.  We have the views
      defined in arch/mips/kernel/ptrace.c, however only an o32 and an n64
      one, and the latter is used for n32 as well.  Consequently an o32 core
      file is incorrectly dumped from n32 processes (the ELF32 vs ELF64 class
      is chosen elsewhere, and the 32-bit one is correctly selected for n32).
      
      Correct the issue then by defining an n32 regset view and using it as
      appropriate.  Issue discovered in GDB testing.
      
      Fixes: 7aeb753b ("MIPS: Implement task_user_regset_view.")
      Signed-off-by: default avatarMaciej W. Rozycki <macro@mips.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Djordje Todorovic <djordje.todorovic@rt-rk.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/17617/Signed-off-by: default avatarJames Hogan <jhogan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc3aceed
    • Masahiro Yamada's avatar
      MIPS: dts: remove bogus bcm96358nb4ser.dtb from dtb-y entry · 43bce9f2
      Masahiro Yamada authored
      commit 3cad14d5 upstream.
      
      arch/mips/boot/dts/brcm/bcm96358nb4ser.dts does not exist, so
      we cannot build bcm96358nb4ser.dtb .
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Fixes: 69583551 ("MIPS: BMIPS: rename bcm96358nb4ser to bcm6358-neufbox4-sercom")
      Acked-by: default avatarJames Hogan <jhogan@kernel.org>
      Signed-off-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43bce9f2
    • James Hogan's avatar
      MIPS: Fix MIPS64 FP save/restore on 32-bit kernels · d6353404
      James Hogan authored
      commit 22b8ba76 upstream.
      
      32-bit kernels can be configured to support MIPS64, in which case
      neither CONFIG_64BIT or CONFIG_CPU_MIPS32_R* will be set. This causes
      the CP0_Status.FR checks at the point of floating point register save
      and restore to be compiled out, which results in odd FP registers not
      being saved or restored to the task or signal context even when
      CP0_Status.FR is set.
      
      Fix the ifdefs to use CONFIG_CPU_MIPSR2 and CONFIG_CPU_MIPSR6, which are
      enabled for the relevant revisions of either MIPS32 or MIPS64, along
      with some other CPUs such as Octeon (r2), Loongson1 (r2), XLP (r2),
      Loongson 3A R2.
      
      The suspect code originates from commit 597ce172 ("MIPS: Support for
      64-bit FP with O32 binaries") in v3.14, however the code in
      __enable_fpu() was consistent and refused to set FR=1, falling back to
      software FPU emulation. This was suboptimal but should be functionally
      correct.
      
      Commit fcc53b5f ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6
      CPU") in v4.2 (and stable tagged back to 4.0) later introduced the bug
      by updating __enable_fpu() to set FR=1 but failing to update the other
      similar ifdefs to enable FR=1 state handling.
      
      Fixes: fcc53b5f ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6 CPU")
      Signed-off-by: default avatarJames Hogan <jhogan@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/16739/Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6353404
    • James Hogan's avatar
      MIPS: Fix odd fp register warnings with MIPS64r2 · 43292e65
      James Hogan authored
      commit c7fd89a6 upstream.
      
      Building 32-bit MIPS64r2 kernels produces warnings like the following
      on certain toolchains (such as GNU assembler 2.24.90, but not GNU
      assembler 2.28.51) since commit 22b8ba76 ("MIPS: Fix MIPS64 FP
      save/restore on 32-bit kernels"), due to the exposure of fpu_save_16odd
      from fpu_save_double and fpu_restore_16odd from fpu_restore_double:
      
      arch/mips/kernel/r4k_fpu.S:47: Warning: float register should be even, was 1
      ...
      arch/mips/kernel/r4k_fpu.S:59: Warning: float register should be even, was 1
      ...
      
      This appears to be because .set mips64r2 does not change the FPU ABI to
      64-bit when -march=mips64r2 (or e.g. -march=xlp) is provided on the
      command line on that toolchain, from the default FPU ABI of 32-bit due
      to the -mabi=32. This makes access to the odd FPU registers invalid.
      
      Fix by explicitly changing the FPU ABI with .set fp=64 directives in
      fpu_save_16odd and fpu_restore_16odd, and moving the undefine of fp up
      in asmmacro.h so fp doesn't turn into $30.
      
      Fixes: 22b8ba76 ("MIPS: Fix MIPS64 FP save/restore on 32-bit kernels")
      Signed-off-by: default avatarJames Hogan <jhogan@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/17656/Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43292e65
    • Mike Snitzer's avatar
      dm: discard support requires all targets in a table support discards · e39516d2
      Mike Snitzer authored
      commit 8a74d29d upstream.
      
      A DM device with a mix of discard capabilities (due to some underlying
      devices not having discard support) _should_ just return -EOPNOTSUPP for
      the region of the device that doesn't support discards (even if only by
      way of the underlying driver formally not supporting discards).  BUT,
      that does ask the underlying driver to handle something that it never
      advertised support for.  In doing so we're exposing users to the
      potential for a underlying disk driver hanging if/when a discard is
      issued a the device that is incapable and never claimed to support
      discards.
      
      Fix this by requiring that each DM target in a DM table provide discard
      support as a prereq for a DM device to advertise support for discards.
      
      This may cause some configurations that were happily supporting discards
      (even in the face of a mix of discard support) to stop supporting
      discards -- but the risk of users hitting driver hangs, and forced
      reboots, outweighs supporting those fringe mixed discard
      configurations.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e39516d2
    • Hou Tao's avatar
      dm: fix race between dm_get_from_kobject() and __dm_destroy() · 3bfb87ec
      Hou Tao authored
      commit b9a41d21 upstream.
      
      The following BUG_ON was hit when testing repeat creation and removal of
      DM devices:
      
          kernel BUG at drivers/md/dm.c:2919!
          CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
          Call Trace:
           [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
           [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
           [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
           [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
           [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
           [<ffffffff81199118>] seq_read+0x16f/0x325
           [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
           [<ffffffff8117b625>] __vfs_read+0x26/0x9d
           [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
           [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
           [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
           [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
           [<ffffffff8117c686>] SyS_read+0x4b/0x76
           [<ffffffff817b606e>] system_call_fastpath+0x12/0x71
      
      The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
      between the test of DMF_FREEING & DMF_DELETING and dm_get() in
      dm_get_from_kobject().
      
      To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
      dm_get() are done in an atomic way, so _minor_lock is used.
      
      The other callers of dm_get() have also been checked to be OK: some
      callers invoke dm_get() under _minor_lock, some callers invoke it under
      _hash_lock, and dm_start_request() invoke it after increasing
      md->open_count.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3bfb87ec
    • John Crispin's avatar
      MIPS: pci: Remove KERN_WARN instance inside the mt7620 driver · 9be341ed
      John Crispin authored
      commit 8593b18a upstream.
      
      Switch the printk() call to the prefered pr_warn() api.
      
      Fixes: 7e5873d3 ("MIPS: pci: Add MT7620a PCIE driver")
      Signed-off-by: default avatarJohn Crispin <john@phrozen.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/15321/Signed-off-by: default avatarJames Hogan <jhogan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9be341ed
    • Steven Rostedt (Red Hat)'s avatar
      sched/rt: Simplify the IPI based RT balancing logic · f17c786b
      Steven Rostedt (Red Hat) authored
      commit 4bdced5c upstream.
      
      When a CPU lowers its priority (schedules out a high priority task for a
      lower priority one), a check is made to see if any other CPU has overloaded
      RT tasks (more than one). It checks the rto_mask to determine this and if so
      it will request to pull one of those tasks to itself if the non running RT
      task is of higher priority than the new priority of the next task to run on
      the current CPU.
      
      When we deal with large number of CPUs, the original pull logic suffered
      from large lock contention on a single CPU run queue, which caused a huge
      latency across all CPUs. This was caused by only having one CPU having
      overloaded RT tasks and a bunch of other CPUs lowering their priority. To
      solve this issue, commit:
      
        b6366f04 ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
      
      changed the way to request a pull. Instead of grabbing the lock of the
      overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
      
      Although the IPI logic worked very well in removing the large latency build
      up, it still could suffer from a large number of IPIs being sent to a single
      CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
      when I tested this on a 120 CPU box, with a stress test that had lots of
      RT tasks scheduling on all CPUs, it actually triggered the hard lockup
      detector! One CPU had so many IPIs sent to it, and due to the restart
      mechanism that is triggered when the source run queue has a priority status
      change, the CPU spent minutes! processing the IPIs.
      
      Thinking about this further, I realized there's no reason for each run queue
      to send its own IPI. As all CPUs with overloaded tasks must be scanned
      regardless if there's one or many CPUs lowering their priority, because
      there's no current way to find the CPU with the highest priority task that
      can schedule to one of these CPUs, there really only needs to be one IPI
      being sent around at a time.
      
      This greatly simplifies the code!
      
      The new approach is to have each root domain have its own irq work, as the
      rto_mask is per root domain. The root domain has the following fields
      attached to it:
      
        rto_push_work	 - the irq work to process each CPU set in rto_mask
        rto_lock	 - the lock to protect some of the other rto fields
        rto_loop_start - an atomic that keeps contention down on rto_lock
      		    the first CPU scheduling in a lower priority task
      		    is the one to kick off the process.
        rto_loop_next	 - an atomic that gets incremented for each CPU that
      		    schedules in a lower priority task.
        rto_loop	 - a variable protected by rto_lock that is used to
      		    compare against rto_loop_next
        rto_cpu	 - The cpu to send the next IPI to, also protected by
      		    the rto_lock.
      
      When a CPU schedules in a lower priority task and wants to make sure
      overloaded CPUs know about it. It increments the rto_loop_next. Then it
      atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
      then it is done, as another CPU is kicking off the IPI loop. If the old
      value is "0", then it will take the rto_lock to synchronize with a possible
      IPI being sent around to the overloaded CPUs.
      
      If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
      IPI being sent around, or one is about to finish. Then rto_cpu is set to the
      first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
      in rto_mask, then there's nothing to be done.
      
      When the CPU receives the IPI, it will first try to push any RT tasks that is
      queued on the CPU but can't run because a higher priority RT task is
      currently running on that CPU.
      
      Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
      finds one, it simply sends an IPI to that CPU and the process continues.
      
      If there's no more CPUs in the rto_mask, then rto_loop is compared with
      rto_loop_next. If they match, everything is done and the process is over. If
      they do not match, then a CPU scheduled in a lower priority task as the IPI
      was being passed around, and the process needs to start again. The first CPU
      in rto_mask is sent the IPI.
      
      This change removes this duplication of work in the IPI logic, and greatly
      lowers the latency caused by the IPIs. This removed the lockup happening on
      the 120 CPU machine. It also simplifies the code tremendously. What else
      could anyone ask for?
      
      Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
      supplying me with the rto_start_trylock() and rto_start_unlock() helper
      functions.
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f17c786b
    • Mikulas Patocka's avatar
      dm: allocate struct mapped_device with kvzalloc · 2bf483c9
      Mikulas Patocka authored
      commit 856eb091 upstream.
      
      The structure srcu_struct can be very big, its size is proportional to the
      value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
      io_barrier in the struct mapped_device has 84kB in the debugging kernel
      and 50kB in the non-debugging kernel. The large size may result in failure
      of the function kzalloc_node.
      
      In order to avoid the allocation failure, we use the function
      kvzalloc_node, this function falls back to vmalloc if a large contiguous
      chunk of memory is not available. This patch also moves the field
      io_barrier to the last position of struct mapped_device - the reason is
      that on many processor architectures, short memory offsets result in
      smaller code than long memory offsets - on x86-64 it reduces code size by
      320 bytes.
      
      Note to stable kernel maintainers - the kernels 4.11 and older don't have
      the function kvzalloc_node, you can use the function vzalloc_node instead.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2bf483c9
    • Vivek Goyal's avatar
      ovl: Put upperdentry if ovl_check_origin() fails · 13e65600
      Vivek Goyal authored
      commit 5455f92b upstream.
      
      If ovl_check_origin() fails, we should put upperdentry. We have a reference
      on it by now. So goto out_put_upper instead of out.
      
      Fixes: a9d01957 ("ovl: lookup non-dir copy-up-origin by file handle")
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      13e65600