1. 21 Sep, 2016 18 commits
    • Chao Yu's avatar
      raid5: fix to detect failure of register_shrinker · 6a0f53ff
      Chao Yu authored
      register_shrinker can fail after commit 1d3d4437 ("vmscan: per-node
      deferred work"), we should detect the failure of it, otherwise we may
      fail to register shrinker after raid5 configuration was setup successfully.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      6a0f53ff
    • Shaohua Li's avatar
      md: fix a potential deadlock · 90bcf133
      Shaohua Li authored
      lockdep reports a potential deadlock. Fix this by droping the mutex
      before md_import_device
      
      [ 1137.126601] ======================================================
      [ 1137.127013] [ INFO: possible circular locking dependency detected ]
      [ 1137.127013] 4.8.0-rc4+ #538 Not tainted
      [ 1137.127013] -------------------------------------------------------
      [ 1137.127013] mdadm/16675 is trying to acquire lock:
      [ 1137.127013]  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
      [ 1137.127013]
      but task is already holding lock:
      [ 1137.127013]  (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50
      [ 1137.127013]
      which lock already depends on the new lock.
      
      [ 1137.127013]
      the existing dependency chain (in reverse order) is:
      [ 1137.127013]
      -> #1 (detected_devices_mutex){+.+.+.}:
      [ 1137.127013]        [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
      [ 1137.127013]        [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
      [ 1137.127013]        [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90
      [ 1137.127013]        [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0
      [ 1137.127013]        [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0
      [ 1137.127013]        [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40
      [ 1137.127013]        [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30
      [ 1137.127013]        [<ffffffff81242bf1>] block_ioctl+0x41/0x50
      [ 1137.127013]        [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
      [ 1137.127013]        [<ffffffff81215321>] SyS_ioctl+0x41/0x70
      [ 1137.127013]        [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [ 1137.127013]
      -> #0 (&bdev->bd_mutex){+.+.+.}:
      [ 1137.127013]        [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690
      [ 1137.127013]        [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
      [ 1137.127013]        [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
      [ 1137.127013]        [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
      [ 1137.127013]        [<ffffffff81244307>] blkdev_get+0x227/0x350
      [ 1137.127013]        [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50
      [ 1137.127013]        [<ffffffff81a46d65>] lock_rdev+0x35/0x80
      [ 1137.127013]        [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0
      [ 1137.127013]        [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50
      [ 1137.127013]        [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30
      [ 1137.127013]        [<ffffffff81242bf1>] block_ioctl+0x41/0x50
      [ 1137.127013]        [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
      [ 1137.127013]        [<ffffffff81215321>] SyS_ioctl+0x41/0x70
      [ 1137.127013]        [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [ 1137.127013]
      other info that might help us debug this:
      
      [ 1137.127013]  Possible unsafe locking scenario:
      
      [ 1137.127013]        CPU0                    CPU1
      [ 1137.127013]        ----                    ----
      [ 1137.127013]   lock(detected_devices_mutex);
      [ 1137.127013]                                lock(&bdev->bd_mutex);
      [ 1137.127013]                                lock(detected_devices_mutex);
      [ 1137.127013]   lock(&bdev->bd_mutex);
      [ 1137.127013]
       *** DEADLOCK ***
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      90bcf133
    • Shaohua Li's avatar
      md/bitmap: fix wrong cleanup · f71f1cf9
      Shaohua Li authored
      if bitmap_create fails, the bitmap is already cleaned up and the returned value
      is an error number. We can't do the cleanup again.
      Reported-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      f71f1cf9
    • Shaohua Li's avatar
      raid5: allow arbitrary max_hw_sectors · 1dffdddd
      Shaohua Li authored
      raid5 will split bio to proper size internally, there is no point to use
      underlayer disk's max_hw_sectors. In my qemu system, without the change,
      the raid5 only receives 128k size bio, which reduces the chance of bio
      merge sending to underlayer disks.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      1dffdddd
    • Gayatri Kammela's avatar
      lib/raid6: Add AVX512 optimized xor_syndrome functions · 694dda62
      Gayatri Kammela authored
      Optimize RAID6 xor_syndrome functions to take advantage of the 512-bit
      ZMM integer instructions introduced in AVX512.
      
      AVX512 optimized xor_syndrome functions, which is simply based on sse2.c
      written by hpa.
      
      The patch was tested and benchmarked before submission on
      a hardware that has AVX512 flags to support such instructions
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Megha Dey <megha.dey@linux.intel.com>
      Signed-off-by: default avatarGayatri Kammela <gayatri.kammela@intel.com>
      Reviewed-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      694dda62
    • Gayatri Kammela's avatar
      lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions · 161db5d1
      Gayatri Kammela authored
      Adding avx512 gen_syndrome and recovery functions so as to allow code to
      be compiled and tested successfully in userspace.
      
      This patch is tested in userspace and improvement in performace is
      observed.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarMegha Dey <megha.dey@linux.intel.com>
      Signed-off-by: default avatarGayatri Kammela <gayatri.kammela@intel.com>
      Reviewed-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      161db5d1
    • Gayatri Kammela's avatar
      lib/raid6: Add AVX512 optimized recovery functions · 13c520b2
      Gayatri Kammela authored
      Optimize RAID6 recovery functions to take advantage of
      the 512-bit ZMM integer instructions introduced in AVX512.
      
      AVX512 optimized recovery functions, which is simply based
      on recov_avx2.c written by Jim Kukunas
      
      This patch was tested and benchmarked before submission on
      a hardware that has AVX512 flags to support such instructions
      
      Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarMegha Dey <megha.dey@linux.intel.com>
      Signed-off-by: default avatarGayatri Kammela <gayatri.kammela@intel.com>
      Reviewed-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      13c520b2
    • Gayatri Kammela's avatar
      lib/raid6: Add AVX512 optimized gen_syndrome functions · e0a491c1
      Gayatri Kammela authored
      Optimize RAID6 gen_syndrom functions to take advantage of
      the 512-bit ZMM integer instructions introduced in AVX512.
      
      AVX512 optimized gen_syndrom functions, which is simply based
      on avx2.c written by Yuanhan Liu and sse2.c written by hpa.
      
      The patch was tested and benchmarked before submission on
      a hardware that has AVX512 flags to support such instructions
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarMegha Dey <megha.dey@linux.intel.com>
      Signed-off-by: default avatarGayatri Kammela <gayatri.kammela@intel.com>
      Reviewed-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      e0a491c1
    • Guoqing Jiang's avatar
      md-cluster: make resync lock also could be interruptted · d6385db9
      Guoqing Jiang authored
      When one node is perform resync or recovery, other nodes
      can't get resync lock and could block for a while before
      it holds the lock, so we can't stop array immediately for
      this scenario.
      
      To make array could be stop quickly, we check MD_CLOSING
      in dlm_lock_sync_interruptible to make us can interrupt
      the lock request.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      d6385db9
    • Guoqing Jiang's avatar
      md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang · 7bcda714
      Guoqing Jiang authored
      When some node leaves cluster, then it's bitmap need to be
      synced by another node, so "md*_recover" thread is triggered
      for the purpose. However, with below steps. we can find tasks
      hang happened either in B or C.
      
      1. Node A create a resyncing cluster raid1, assemble it in
         other two nodes (B and C).
      2. stop array in B and C.
      3. stop array in A.
      
      linux44:~ # ps aux|grep md|grep D
      root	5938	0.0  0.1  19852  1964 pts/0    D+   14:52   0:00 mdadm -S md0
      root	5939	0.0  0.0      0     0 ?        D    14:52   0:00 [md0_recover]
      
      linux44:~ # cat /proc/5939/stack
      [<ffffffffa04cf321>] dlm_lock_sync+0x71/0x90 [md_cluster]
      [<ffffffffa04d0705>] recover_bitmaps+0x125/0x220 [md_cluster]
      [<ffffffffa052105d>] md_thread+0x16d/0x180 [md_mod]
      [<ffffffff8107ad94>] kthread+0xb4/0xc0
      [<ffffffff8152a518>] ret_from_fork+0x58/0x90
      
      linux44:~ # cat /proc/5938/stack
      [<ffffffff8107afde>] kthread_stop+0x6e/0x120
      [<ffffffffa0519da0>] md_unregister_thread+0x40/0x80 [md_mod]
      [<ffffffffa04cfd20>] leave+0x70/0x120 [md_cluster]
      [<ffffffffa0525e24>] md_cluster_stop+0x14/0x30 [md_mod]
      [<ffffffffa05269ab>] bitmap_free+0x14b/0x150 [md_mod]
      [<ffffffffa0523f3b>] do_md_stop+0x35b/0x5a0 [md_mod]
      [<ffffffffa0524e83>] md_ioctl+0x873/0x1590 [md_mod]
      [<ffffffff81288464>] blkdev_ioctl+0x214/0x7d0
      [<ffffffff811dd3dd>] block_ioctl+0x3d/0x40
      [<ffffffff811b92d4>] do_vfs_ioctl+0x2d4/0x4b0
      [<ffffffff811b9538>] SyS_ioctl+0x88/0xa0
      [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b
      
      The problem is caused by recover_bitmaps can't reliably abort
      when the thread is unregistered. So dlm_lock_sync_interruptible
      is introduced to detect the thread's situation to fix the problem.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      7bcda714
    • Guoqing Jiang's avatar
      md-cluster: convert the completion to wait queue · fccb60a4
      Guoqing Jiang authored
      Previously, we used completion to sync between require dlm lock
      and sync_ast, however we will have to expose completion.wait
      and completion.done in dlm_lock_sync_interruptible (introduced
      later), it is not a common usage for completion, so convert
      related things to wait queue.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      fccb60a4
    • Guoqing Jiang's avatar
      md-cluster: protect md_find_rdev_nr_rcu with rcu lock · 5f0aa21d
      Guoqing Jiang authored
      We need to use rcu_read_lock/unlock to avoid potential
      race.
      Reported-by: default avatarShaohua Li <shli@fb.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      5f0aa21d
    • Guoqing Jiang's avatar
      md-cluster: clean related infos of cluster · c20c33f0
      Guoqing Jiang authored
      cluster_info and bitmap_info.nodes also need to be
      cleared when array is stopped.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      c20c33f0
    • Guoqing Jiang's avatar
      md: changes for MD_STILL_CLOSED flag · af8d8e6f
      Guoqing Jiang authored
      When stop clustered raid while it is pending on resync,
      MD_STILL_CLOSED flag could be cleared since udev rule
      is triggered to open the mddev. So obviously array can't
      be stopped soon and returns EBUSY.
      
      	mdadm -Ss          md-raid-arrays.rules
        set MD_STILL_CLOSED          md_open()
      	... ... ...          clear MD_STILL_CLOSED
      	do_md_stop
      
      We make below changes to resolve this issue:
      
      1. rename MD_STILL_CLOSED to MD_CLOSING since it is set
         when stop array and it means we are stopping array.
      2. let md_open returns early if CLOSING is set, so no
         other threads will open array if one thread is trying
         to close it.
      3. no need to clear CLOSING bit in md_open because 1 has
         ensure the bit is cleared, then we also don't need to
         test CLOSING bit in do_md_stop.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      af8d8e6f
    • Guoqing Jiang's avatar
      md-cluster: remove some unnecessary dlm_unlock_sync · e3f924d3
      Guoqing Jiang authored
      Since DLM_LKF_FORCEUNLOCK is used in lockres_free,
      we don't need to call dlm_unlock_sync before free
      lock resource.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      e3f924d3
    • Guoqing Jiang's avatar
      md-cluster: use FORCEUNLOCK in lockres_free · 400cb454
      Guoqing Jiang authored
      For dlm_unlock, we need to pass flag to dlm_unlock as the
      third parameter instead of set res->flags.
      
      Also, DLM_LKF_FORCEUNLOCK is more suitable for dlm_unlock
      since it works even the lock is on waiting or convert queue.
      Acked-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      400cb454
    • Guoqing Jiang's avatar
      md-cluster: call md_kick_rdev_from_array once ack failed · e566aef1
      Guoqing Jiang authored
      The new_disk_ack could return failure if WAITING_FOR_NEWDISK
      is not set, so we need to kick the dev from array in case
      failure happened.
      
      And we missed to check err before call new_disk_ack othwise
      we could kick a rdev which isn't in array, thanks for the
      reminder from Shaohua.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      e566aef1
    • Linus Torvalds's avatar
      Merge tag 'usercopy-v4.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 7d1e0423
      Linus Torvalds authored
      Pull usercopy hardening fix from Kees Cook:
       "Expand the arm64 vmalloc check to include skipping the module space
        too"
      
      * tag 'usercopy-v4.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        mm: usercopy: Check for module addresses
      7d1e0423
  2. 20 Sep, 2016 4 commits
    • Al Viro's avatar
      fix fault_in_multipages_...() on architectures with no-op access_ok() · e23d4159
      Al Viro authored
      Switching iov_iter fault-in to multipages variants has exposed an old
      bug in underlying fault_in_multipages_...(); they break if the range
      passed to them wraps around.  Normally access_ok() done by callers will
      prevent such (and it's a guaranteed EFAULT - ERR_PTR() values fall into
      such a range and they should not point to any valid objects).
      
      However, on architectures where userland and kernel live in different
      MMU contexts (e.g. s390) access_ok() is a no-op and on those a range
      with a wraparound can reach fault_in_multipages_...().
      
      Since any wraparound means EFAULT there, the fix is trivial - turn
      those
      
          while (uaddr <= end)
      	    ...
      into
      
          if (unlikely(uaddr > end))
      	    return -EFAULT;
          do
      	    ...
          while (uaddr <= end);
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Tested-by: default avatarJan Stancek <jstancek@redhat.com>
      Cc: stable@vger.kernel.org # v3.5+
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e23d4159
    • Laura Abbott's avatar
      mm: usercopy: Check for module addresses · aa4f0601
      Laura Abbott authored
      While running a compile on arm64, I hit a memory exposure
      
      usercopy: kernel memory exposure attempt detected from fffffc0000f3b1a8 (buffer_head) (1 bytes)
      ------------[ cut here ]------------
      kernel BUG at mm/usercopy.c:75!
      Internal error: Oops - BUG: 0 [#1] SMP
      Modules linked in: ip6t_rpfilter ip6t_REJECT
      nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp
      llc ebtable_nat ip6table_security ip6table_raw ip6table_nat
      nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
      iptable_security iptable_raw iptable_nat nf_conntrack_ipv4
      nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
      ebtable_filter ebtables ip6table_filter ip6_tables vfat fat xgene_edac
      xgene_enet edac_core i2c_xgene_slimpro i2c_core at803x realtek xgene_dma
      mdio_xgene gpio_dwapb gpio_xgene_sb xgene_rng mailbox_xgene_slimpro nfsd
      auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sdhci_of_arasan
      sdhci_pltfm sdhci mmc_core xhci_plat_hcd gpio_keys
      CPU: 0 PID: 19744 Comm: updatedb Tainted: G        W 4.8.0-rc3-threadinfo+ #1
      Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.12 Aug 12 2016
      task: fffffe03df944c00 task.stack: fffffe00d128c000
      PC is at __check_object_size+0x70/0x3f0
      LR is at __check_object_size+0x70/0x3f0
      ...
      [<fffffc00082b4280>] __check_object_size+0x70/0x3f0
      [<fffffc00082cdc30>] filldir64+0x158/0x1a0
      [<fffffc0000f327e8>] __fat_readdir+0x4a0/0x558 [fat]
      [<fffffc0000f328d4>] fat_readdir+0x34/0x40 [fat]
      [<fffffc00082cd8f8>] iterate_dir+0x190/0x1e0
      [<fffffc00082cde58>] SyS_getdents64+0x88/0x120
      [<fffffc0008082c70>] el0_svc_naked+0x24/0x28
      
      fffffc0000f3b1a8 is a module address. Modules may have compiled in
      strings which could get copied to userspace. In this instance, it
      looks like "." which matches with a size of 1 byte. Extend the
      is_vmalloc_addr check to be is_vmalloc_or_module_addr to cover
      all possible cases.
      Signed-off-by: default avatarLaura Abbott <labbott@redhat.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      aa4f0601
    • Jiri Olsa's avatar
      fs/proc/kcore.c: Add bounce buffer for ktext data · df04abfd
      Jiri Olsa authored
      We hit hardened usercopy feature check for kernel text access by reading
      kcore file:
      
        usercopy: kernel memory exposure attempt detected from ffffffff8179a01f (<kernel text>) (4065 bytes)
        kernel BUG at mm/usercopy.c:75!
      
      Bypassing this check for kcore by adding bounce buffer for ktext data.
      Reported-by: default avatarSteve Best <sbest@redhat.com>
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df04abfd
    • Jiri Olsa's avatar
      fs/proc/kcore.c: Make bounce buffer global for read · f5beeb18
      Jiri Olsa authored
      Next patch adds bounce buffer for ktext area, so it's
      convenient to have single bounce buffer for both
      vmalloc/module and ktext cases.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5beeb18
  3. 19 Sep, 2016 18 commits