1. 11 Aug, 2017 13 commits
    • Dima Zavin's avatar
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 45a636ec
      Dima Zavin authored
      commit 89affbf5 upstream.
      
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: default avatarDima Zavin <dmitriyz@waymo.com>
      Reported-by: default avatarCliff Spradlin <cspradlin@waymo.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45a636ec
    • Mel Gorman's avatar
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 5a1eef71
      Mel Gorman authored
      commit 3ea27719 upstream.
      
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      
      He described the race as follows:
      
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              try_to_unmap_one()
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
      
                                              user writes using cached RW PTE
              ...
      
              try_to_unmap_flush()
      
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      ack.
      
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a1eef71
    • Guenter Roeck's avatar
      mmc: core: Fix access to HS400-ES devices · 943281eb
      Guenter Roeck authored
      commit 773dc118 upstream.
      
      HS400-ES devices fail to initialize with the following error messages.
      
      mmc1: power class selection to bus width 8 ddr 0 failed
      mmc1: error -110 whilst initialising MMC card
      
      This was seen on Samsung Chromebook Plus. Code analysis points to
      commit 3d4ef329 ("mmc: core: fix multi-bit bus width without
      high-speed mode"), which attempts to set the bus width for all but
      HS200 devices unconditionally. However, for HS400-ES, the bus width
      is already selected.
      
      Cc: Anssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Douglas Anderson <dianders@chromium.org>
      Cc: Brian Norris <briannorris@chromium.org>
      Fixes: 3d4ef329 ("mmc: core: fix multi-bit bus width ...")
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarShawn Lin <shawn.lin@rock-chip.com>
      Tested-by: default avatarHeiko Stuebner <heiko@sntech.de>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      943281eb
    • Sakari Ailus's avatar
      device property: Make dev_fwnode() public · 1f32e67a
      Sakari Ailus authored
      commit e44bb0cb upstream.
      
      The function to obtain a fwnode related to a struct device is useful for
      drivers that use the fwnode property API: it allows not being aware of the
      underlying firmware implementation.
      Signed-off-by: default avatarSakari Ailus <sakari.ailus@linux.intel.com>
      Reviewed-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f32e67a
    • Ludovic Desroches's avatar
      mmc: sdhci-of-at91: force card detect value for non removable devices · 347be00b
      Ludovic Desroches authored
      commit 7a1e3f14 upstream.
      
      When the device is non removable, the card detect signal is often used
      for another purpose i.e. muxed to another SoC peripheral or used as a
      GPIO. It could lead to wrong behaviors depending the default value of
      this signal if not muxed to the SDHCI controller.
      
      Fixes: bb5f8ea4 ("mmc: sdhci-of-at91: introduce driver for the Atmel SDMMC")
      Signed-off-by: default avatarLudovic Desroches <ludovic.desroches@microchip.com>
      Acked-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      347be00b
    • Trond Myklebust's avatar
      NFSv4: Fix EXCHANGE_ID corrupt verifier issue · f7d3e54f
      Trond Myklebust authored
      commit fd40559c upstream.
      
      The verifier is allocated on the stack, but the EXCHANGE_ID RPC call was
      changed to be asynchronous by commit 8d89bd70. If we interrrupt
      the call to rpc_wait_for_completion_task(), we can therefore end up
      transmitting random stack contents in lieu of the verifier.
      
      Fixes: 8d89bd70 ("NFS setup async exchange_id")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7d3e54f
    • Arend Van Spriel's avatar
      brcmfmac: fix memleak due to calling brcmf_sdiod_sgtable_alloc() twice · f5214eb4
      Arend Van Spriel authored
      commit 5f5d0314 upstream.
      
      Due to a bugfix in wireless tree and the commit mentioned below a merge
      was needed which went haywire. So the submitted change resulted in the
      function brcmf_sdiod_sgtable_alloc() being called twice during the probe
      thus leaking the memory of the first call.
      
      Fixes: 4d792895 ("brcmfmac: switch to new platform data")
      Reported-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Reviewed-by: default avatarHante Meuleman <hante.meuleman@broadcom.com>
      Signed-off-by: default avatarArend van Spriel <arend.vanspriel@broadcom.com>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5214eb4
    • Emmanuel Grumbach's avatar
      iwlwifi: dvm: prevent an out of bounds access · de8c3329
      Emmanuel Grumbach authored
      commit 0b0f934e upstream.
      
      iwlagn_check_ratid_empty takes the tid as a parameter, but
      it doesn't check that it is not IWL_TID_NON_QOS.
      Since IWL_TID_NON_QOS = 8 and iwl_priv::tid_data is an array
      with 8 entries, accessing iwl_priv::tid_data[IWL_TID_NON_QOS]
      is a bad idea.
      This happened in iwlagn_rx_reply_tx. Since
      iwlagn_check_ratid_empty is relevant only to check whether
      we can open A-MPDU, this flow is irrelevant if tid is
      IWL_TID_NON_QOS. Call iwlagn_check_ratid_empty only inside
      the
      	if (tid != IWL_TID_NON_QOS)
      
      a few lines earlier in the function.
      Reported-by: default avatarSeraphime Kirkovski <kirkseraph@gmail.com>
      Tested-by: default avatarSeraphime Kirkovski <kirkseraph@gmail.com>
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de8c3329
    • Tejun Heo's avatar
      workqueue: restore WQ_UNBOUND/max_active==1 to be ordered · 61a0adbf
      Tejun Heo authored
      commit 5c0338c6 upstream.
      
      The combination of WQ_UNBOUND and max_active == 1 used to imply
      ordered execution.  After NUMA affinity 4c16bd32 ("workqueue:
      implement NUMA affinity for unbound workqueues"), this is no longer
      true due to per-node worker pools.
      
      While the right way to create an ordered workqueue is
      alloc_ordered_workqueue(), the documentation has been misleading for a
      long time and people do use WQ_UNBOUND and max_active == 1 for ordered
      workqueues which can lead to subtle bugs which are very difficult to
      trigger.
      
      It's unlikely that we'd see noticeable performance impact by enforcing
      ordering on WQ_UNBOUND / max_active == 1 workqueues.  Let's
      automatically set __WQ_ORDERED for those workqueues.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reported-by: default avatarAlexei Potashnik <alexei@purestorage.com>
      Fixes: 4c16bd32 ("workqueue: implement NUMA affinity for unbound workqueues")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61a0adbf
    • Dan Carpenter's avatar
      libata: array underflow in ata_find_dev() · 804b1a9f
      Dan Carpenter authored
      commit 59a5e266 upstream.
      
      My static checker complains that "devno" can be negative, meaning that
      we read before the start of the loop.  I've looked at the code, and I
      think the warning is right.  This come from /proc so it's root only or
      it would be quite a quite a serious bug.  The call tree looks like this:
      
      proc_scsi_write() <- gets id and channel from simple_strtoul()
      -> scsi_add_single_device() <- calls shost->transportt->user_scan()
         -> ata_scsi_user_scan()
            -> ata_find_dev()
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      804b1a9f
    • Tejun Heo's avatar
      cgroup: fix error return value from cgroup_subtree_control() · 445ee6cd
      Tejun Heo authored
      commit 3c745417 upstream.
      
      While refactoring, f7b2814b ("cgroup: factor out
      cgroup_{apply|finalize}_control() from
      cgroup_subtree_control_write()") broke error return value from the
      function.  The return value from the last operation is always
      overridden to zero.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      445ee6cd
    • Tejun Heo's avatar
      cgroup: create dfl_root files on subsys registration · 4a99eac8
      Tejun Heo authored
      commit 7af608e4 upstream.
      
      On subsystem registration, css_populate_dir() is not called on the new
      root css, so the interface files for the subsystem on cgrp_dfl_root
      aren't created on registration.  This is a residue from the days when
      cgrp_dfl_root was used only as the parking spot for unused subsystems,
      which no longer is true as it's used as the root for cgroup2.
      
      This is often fine as later operations tend to create them as a part
      of mount (cgroup1) or subtree_control operations (cgroup2); however,
      it's not difficult to mount cgroup2 with the controller interface
      files missing as Waiman found out.
      
      Fix it by invoking css_populate_dir() on the root css on subsys
      registration.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a99eac8
    • John David Anglin's avatar
      parisc: Handle vma's whose context is not current in flush_cache_range · 5d23e4f3
      John David Anglin authored
      commit 13d57093 upstream.
      
      In testing James' patch to drivers/parisc/pdc_stable.c, I hit the BUG
      statement in flush_cache_range() during a system shutdown:
      
      kernel BUG at arch/parisc/kernel/cache.c:595!
      CPU: 2 PID: 6532 Comm: kworker/2:0 Not tainted 4.13.0-rc2+ #1
      Workqueue: events free_ioctx
      
       IAOQ[0]: flush_cache_range+0x144/0x148
       IAOQ[1]: flush_cache_page+0x0/0x1a8
       RP(r2): flush_cache_range+0xec/0x148
      Backtrace:
       [<00000000402910ac>] unmap_page_range+0x84/0x880
       [<00000000402918f4>] unmap_single_vma+0x4c/0x60
       [<0000000040291a18>] zap_page_range_single+0x110/0x160
       [<0000000040291c34>] unmap_mapping_range+0x174/0x1a8
       [<000000004026ccd8>] truncate_pagecache+0x50/0xa8
       [<000000004026cd84>] truncate_setsize+0x54/0x70
       [<000000004033d534>] put_aio_ring_file+0x44/0xb0
       [<000000004033d5d8>] aio_free_ring+0x38/0x140
       [<000000004033d714>] free_ioctx+0x34/0xa8
       [<00000000401b0028>] process_one_work+0x1b8/0x4d0
       [<00000000401b04f4>] worker_thread+0x1b4/0x648
       [<00000000401b9128>] kthread+0x1b0/0x208
       [<0000000040150020>] end_fault_vector+0x20/0x28
       [<0000000040639518>] nf_ip_reroute+0x50/0xa8
       [<0000000040638ed0>] nf_ip_route+0x10/0x78
       [<0000000040638c90>] xfrm4_mode_tunnel_input+0x180/0x1f8
      
      CPU: 2 PID: 6532 Comm: kworker/2:0 Not tainted 4.13.0-rc2+ #1
      Workqueue: events free_ioctx
      Backtrace:
       [<0000000040163bf0>] show_stack+0x20/0x38
       [<0000000040688480>] dump_stack+0xa8/0x120
       [<0000000040163dc4>] die_if_kernel+0x19c/0x2b0
       [<0000000040164d0c>] handle_interruption+0xa24/0xa48
      
      This patch modifies flush_cache_range() to handle non current contexts.
      In as much as this occurs infrequently, the simplest approach is to
      flush the entire cache when this happens.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5d23e4f3
  2. 07 Aug, 2017 27 commits