1. 04 Aug, 2017 14 commits
  2. 03 Aug, 2017 19 commits
    • Linus Torvalds's avatar
      Merge tag 'vfio-v4.13-rc4' of git://github.com/awilliam/linux-vfio · 869c058f
      Linus Torvalds authored
      Pull VFIO fixes from Alex Williamson:
      
       - SPAPR/EEH config build fix (Murilo Opsfelder Araujo)
      
       - Fix possible device lock deadlock (Alex Williamson)
      
       - Correctly size integrated endpoint PCIe capabilities (Alex
         Williamson)
      
      * tag 'vfio-v4.13-rc4' of git://github.com/awilliam/linux-vfio:
        vfio/pci: Fix handling of RC integrated endpoint PCIe capability size
        vfio/pci: Use pci_try_reset_function() on initial open
        include/linux/vfio.h: Guard powerpc-specific functions with CONFIG_VFIO_SPAPR_EEH
      869c058f
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 995d03ae
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "15 fixes"
      
      [ This does not merge the "fortify: use WARN instead of BUG for now"
        patch, which needs a bit of extra work to build cleanly with all
        configurations. Arnd is on it.   - Linus ]
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        ocfs2: don't clear SGID when inheriting ACLs
        mm: allow page_cache_get_speculative in interrupt context
        userfaultfd: non-cooperative: flush event_wqh at release time
        ipc: add missing container_of()s for randstruct
        cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
        userfaultfd_zeropage: return -ENOSPC in case mm has gone
        mm: take memory hotplug lock within numa_zonelist_order_handler()
        mm/page_io.c: fix oops during block io poll in swapin path
        zram: do not free pool->size_class
        kthread: fix documentation build warning
        kasan: avoid -Wmaybe-uninitialized warning
        userfaultfd: non-cooperative: notify about unmap of destination during mremap
        mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
        pid: kill pidhash_size in pidhash_init()
        mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
      995d03ae
    • Linus Torvalds's avatar
      Merge tag 'acpi-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 8d3fe85f
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These fix two issues in the ACPI SoC drivers (Intel LPSS and AMD APD),
        a crash in the PCC mailbox initialization code and a WDAT watchdog
        initialization failure.
      
        Specifics:
      
         - Fix a device ID of Hisilicon Hip07/08 in the ACPI APD (AMD SoC)
           driver (Hanjun Guo).
      
         - Fix list corruption (introduced during the 4.11 cycle) in the ACPI
           LPSS (Intel SoC) driver (Hans de Goede).
      
         - Fix PCC mailbox handling code crash during initialization when PCCT
           is not present and PCC channel 0 is requested (Hoan Tran).
      
         - Fix a WDAT watchdog initialization issue causing platform device
           creation to fail due to partially overlapping address ranges in
           resources (Ryan Kennedy)"
      
      * tag 'acpi-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: APD: Fix HID for Hisilicon Hip07/08
        mailbox: pcc: Fix crash when request PCC channel 0
        ACPI / watchdog: Fix init failure with overlapping register regions
        ACPI / LPSS: Only call pwm_add_table() for the first PWM controller
      8d3fe85f
    • Linus Torvalds's avatar
      Merge tag 'pm-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 73784fb7
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These fix two cpufreq issues, one introduced recently and one related
        to recent changes, fix cpufreq documentation, fix up recently added
        code in the Thunderbolt driver and update runtime PM framework
        documentation.
      
        Specifics:
      
         - Fix the handling of the scaling_cur_freq cpufreq policy attribute
           on x86 systems with the MPERF/APERF registers present to make it
           behave more as expected after recent changes (Rafael Wysocki).
      
         - Drop a leftover callback from the intel_pstate driver which also
           prevents the cpuinfo_cur_freq cpufreq policy attribute from being
           incorrectly exposed when intel_pstate works in the active mode
           (Rafael Wysocki).
      
         - Add a missing piece describing the cpuinfo_cur_freq policy
           attribute to cpufreq documentation (Rafael Wysocki).
      
         - Fix up a recently added part of the Thunderbolt driver to avoid
           aborting system suspends if its mailbox commands time out (Rafael
           Wysocki).
      
         - Update device runtime PM framework documentation to reflect the
           current behavior of the code (Johan Hovold)"
      
      * tag 'pm-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thunderbolt: icm: Ignore mailbox errors in icm_suspend()
        cpufreq: x86: Make scaling_cur_freq behave more as expected
        PM / runtime: Document new pm_runtime_set_suspended() constraint
        cpufreq: docs: Add missing cpuinfo_cur_freq description
        cpufreq: intel_pstate: Drop ->get from intel_pstate structure
      73784fb7
    • Rafael J. Wysocki's avatar
      Merge branches 'acpi-soc', 'acpi-wdat' and 'acpi-cppc' · 3de559d4
      Rafael J. Wysocki authored
      * acpi-soc:
        ACPI: APD: Fix HID for Hisilicon Hip07/08
        ACPI / LPSS: Only call pwm_add_table() for the first PWM controller
      
      * acpi-wdat:
        ACPI / watchdog: Fix init failure with overlapping register regions
      
      * acpi-cppc:
        mailbox: pcc: Fix crash when request PCC channel 0
      3de559d4
    • Rafael J. Wysocki's avatar
      Merge branches 'pm-core' and 'pm-misc' · 78aa904a
      Rafael J. Wysocki authored
      * pm-core:
        PM / runtime: Document new pm_runtime_set_suspended() constraint
      
      * pm-misc:
        thunderbolt: icm: Ignore mailbox errors in icm_suspend()
      78aa904a
    • Rafael J. Wysocki's avatar
      Merge branches 'pm-cpufreq-x86', 'pm-cpufreq-docs' and 'intel_pstate' · 8a05c311
      Rafael J. Wysocki authored
      * pm-cpufreq-x86:
        cpufreq: x86: Make scaling_cur_freq behave more as expected
      
      * pm-cpufreq-docs:
        cpufreq: docs: Add missing cpuinfo_cur_freq description
      
      * intel_pstate:
        cpufreq: intel_pstate: Drop ->get from intel_pstate structure
      8a05c311
    • Shawn Lin's avatar
      mmc: block: bypass the queue even if usage is present for hotplug · 7c84b8b4
      Shawn Lin authored
      The commit 304419d8 ("mmc: core: Allocate per-request data using the
      block layer core") refactored mechanism of queue handling caused
      mmc_init_request() can be called just after mmc_cleanup_queue() caused null
      pointer dereference.
      
      Another commit bbdc74dc ("mmc: block: Prevent new req entering queue
      after its cleanup") tried to fix the problem. However it actually miss one
      corner case.
      
      We could still reproduce the issue mentioned with these steps:
      (1) insert a SD card and mount it
      (2) hotplug it, so it will leave md->usage still be counted
      (3) reboot the system which will sync data and umount the card
      
      [Unable to handle kernel NULL pointer dereference at virtual address
      00000000
      [user pgtable: 4k pages, 48-bit VAs, pgd = ffff80007bab3000
      [[0000000000000000] *pgd=000000007a828003, *pud=0000000078dce003,
      *pmd=000000007aab6003, *pte=0000000000000000
      [Internal error: Oops: 96000007 [#1] PREEMPT SMP
      [Modules linked in:
      [CPU: 3 PID: 3507 Comm: umount Tainted: G        W
      4.13.0-rc1-next-20170720-00012-g9d9bf45 #33
      [Hardware name: Firefly-RK3399 Board (DT)
      [task: ffff80007a1de200 task.stack: ffff80007a01c000
      [PC is at mmc_init_request+0x14/0xc4
      [LR is at alloc_request_size+0x4c/0x74
      [pc : [<ffff0000087d7150>] lr : [<ffff000008378fe0>] pstate: 600001c5
      [sp : ffff80007a01f8f0
      
      ....
      
      [[<ffff0000087d7150>] mmc_init_request+0x14/0xc4
      [[<ffff000008378fe0>] alloc_request_size+0x4c/0x74
      [[<ffff00000817ac28>] mempool_create_node+0xb8/0x17c
      [[<ffff00000837aadc>] blk_init_rl+0x9c/0x120
      [[<ffff000008396580>] blkg_alloc+0x110/0x234
      [[<ffff000008396ac8>] blkg_create+0x424/0x468
      [[<ffff00000839877c>] blkg_lookup_create+0xd8/0x14c
      [[<ffff0000083796bc>] generic_make_request_checks+0x368/0x3b0
      [[<ffff00000837b050>] generic_make_request+0x1c/0x240
      
      So mmc_blk_put wouldn't calling blk_cleanup_queue which actually the
      QUEUE_FLAG_DYING and QUEUE_FLAG_BYPASS should stay. Block core expect
      blk_queue_bypass_{start, end} internally to bypass/drain the queue before
      actually dying the queue, so it didn't expose API to set the queue bypass.
      I think we should set QUEUE_FLAG_BYPASS whenever queue is removed, although
      the md->usage is still counted, as no dispatch queue could be found then.
      
      Fixes: 304419d8 ("mmc: core: Allocate per-request data using the block layer core")
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      7c84b8b4
    • Ludovic Desroches's avatar
      mmc: sdhci-of-at91: force card detect value for non removable devices · 7a1e3f14
      Ludovic Desroches authored
      When the device is non removable, the card detect signal is often used
      for another purpose i.e. muxed to another SoC peripheral or used as a
      GPIO. It could lead to wrong behaviors depending the default value of
      this signal if not muxed to the SDHCI controller.
      
      Fixes: bb5f8ea4 ("mmc: sdhci-of-at91: introduce driver for the Atmel SDMMC")
      Signed-off-by: default avatarLudovic Desroches <ludovic.desroches@microchip.com>
      Acked-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      7a1e3f14
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs · 19ec50a4
      Linus Torvalds authored
      Pull NFS client fixes from Anna Schumaker:
       "Two fixes from Trond this time, now that he's back from his vacation.
        The first is a stable fix for the EXCHANGE_ID issue on the mailing
        list, and the other fixes a double-free situation that he found at the
        same time.
      
        Stable fix:
         - Fix EXCHANGE_ID corrupt verifier issue
      
        Other fix:
         - Fix double frees in nfs4_test_session_trunk()"
      
      * tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
        NFSv4: Fix double frees in nfs4_test_session_trunk()
        NFSv4: Fix EXCHANGE_ID corrupt verifier issue
      19ec50a4
    • Annie Cherkaev's avatar
      isdn/i4l: fix buffer overflow · 9f5af546
      Annie Cherkaev authored
      This fixes a potential buffer overflow in isdn_net.c caused by an
      unbounded strcpy.
      
      [ ISDN seems to be effectively unmaintained, and the I4L driver in
        particular is long deprecated, but in case somebody uses this..
          - Linus ]
      Signed-off-by: default avatarJiten Thakkar <jitenmt@gmail.com>
      Signed-off-by: default avatarAnnie Cherkaev <annie.cherk@gmail.com>
      Cc: Karsten Keil <isdn@linux-pingi.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f5af546
    • Jan Kara's avatar
      ocfs2: don't clear SGID when inheriting ACLs · 19ec8e48
      Jan Kara authored
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0').  However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
      into ocfs2_iop_set_acl().  That way the function will not be called when
      inheriting ACLs which is what we want as it prevents SGID bit clearing
      and the mode has been properly set by posix_acl_create() anyway.  Also
      posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
      mode itself.
      
      Fixes: 07393101 ("posix_acl: Clear SGID bit when setting file permissions")
      Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19ec8e48
    • Kan Liang's avatar
      mm: allow page_cache_get_speculative in interrupt context · 1ee1c3f5
      Kan Liang authored
      Kernel panic when calling the IRQ-safe __get_user_pages_fast in NMI
      handler.
      
      The bug was introduced by commit 2947ba05 ("x86/mm/gup: Switch GUP
      to the generic get_user_page_fast() implementation").
      
      The original x86 __get_user_page_fast used plain get_page() or
      page_ref_add().  However, the generic __get_user_page_fast uses
      page_cache_get_speculative(), which has VM_BUG_ON(in_interrupt()).
      
      There is no reason to prevent page_cache_get_speculative from using in
      interrupt context.  According to the author, putting a BUG_ON there is
      just because the code is not verifying correctness of interrupt races.
      I did some tests in interrupt context.  There is no issue found.
      
      Removing VM_BUG_ON(in_interrupt()) for page_cache_get_speculative().
      
      Link: http://lkml.kernel.org/r/1501609146-59730-1-git-send-email-kan.liang@intel.com
      Fixes: 2947ba05 ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation")
      Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ee1c3f5
    • Mike Rapoport's avatar
      userfaultfd: non-cooperative: flush event_wqh at release time · 5a18b64e
      Mike Rapoport authored
      There may still be threads waiting on event_wqh at the time the
      userfault file descriptor is closed.  Flush the events wait-queue to
      prevent waiting threads from hanging.
      
      Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com
      Fixes: 9cd75c3c ("userfaultfd: non-cooperative: add ability to report
      non-PF events from uffd descriptor")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a18b64e
    • Kees Cook's avatar
      ipc: add missing container_of()s for randstruct · ade9f91b
      Kees Cook authored
      When building with the randstruct gcc plugin, the layout of the IPC
      structs will be randomized, which requires any sub-structure accesses to
      use container_of().  The proc display handlers were missing the needed
      container_of()s since the iterator is passing in the top-level struct
      kern_ipc_perm.
      
      This would lead to crashes when running the "lsipc" program after the
      system had IPC registered (e.g. after starting up Gnome):
      
        general protection fault: 0000 [#1] PREEMPT SMP
        ...
        RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0
        ...
        Call Trace:
          sysvipc_shm_proc_show+0x5e/0x150
          sysvipc_proc_show+0x1a/0x30
          seq_read+0x2e9/0x3f0
        ...
      
      Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast
      Fixes: 3859a271 ("randstruct: Mark various structs for randomization")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ade9f91b
    • Dima Zavin's avatar
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin authored
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: default avatarDima Zavin <dmitriyz@waymo.com>
      Reported-by: default avatarCliff Spradlin <cspradlin@waymo.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
    • Mike Rapoport's avatar
      userfaultfd_zeropage: return -ENOSPC in case mm has gone · 9d95aa4b
      Mike Rapoport authored
      In the non-cooperative userfaultfd case, the process exit may race with
      outstanding mcopy_atomic called by the uffd monitor.  Returning -ENOSPC
      instead of -EINVAL when mm is already gone will allow uffd monitor to
      distinguish this case from other error conditions.
      
      Unfortunately I overlooked userfaultfd_zeropage when updating
      userfaultd_copy().
      
      Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com
      Fixes: 96333187 ("userfaultfd_copy: return -ENOSPC in case mm has gone")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d95aa4b
    • Heiko Carstens's avatar
      mm: take memory hotplug lock within numa_zonelist_order_handler() · 167d0f25
      Heiko Carstens authored
      Andre Wild reported the following warning:
      
        WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
        Modules linked in:
        CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57 #10
        Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
        task: 00000000701d8100 task.stack: 0000000073594000
        Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
        ...
        Call Trace:
         lockdep_assert_cpus_held+0x42/0x60)
         stop_machine_cpuslocked+0x62/0xf0
         build_all_zonelists+0x92/0x150
         numa_zonelist_order_handler+0x102/0x150
         proc_sys_call_handler.isra.12+0xda/0x118
         proc_sys_write+0x34/0x48
         __vfs_write+0x3c/0x178
         vfs_write+0xbc/0x1a0
         SyS_write+0x66/0xc0
         system_call+0xc4/0x2b0
         locks held by bash/1205:
         #0:  (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
         #1:  (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
         #2:  (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
        Last Breaking-Event-Address:
          lockdep_assert_cpus_held+0x48/0x60
      
      This can be easily triggered with e.g.
      
          echo n > /proc/sys/vm/numa_zonelist_order
      
      In commit 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu
      rwsem") memory hotplug locking was changed to fix a potential deadlock.
      
      This also switched the stop_machine() invocation within
      build_all_zonelists() to stop_machine_cpuslocked() which now expects
      that online cpus are locked when being called.
      
      This assumption is not true if build_all_zonelists() is being called
      from numa_zonelist_order_handler().
      
      In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
      pair to numa_zonelist_order_handler().
      
      Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
      Fixes: 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu rwsem")
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: default avatarAndre Wild <wild@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      167d0f25
    • Tetsuo Handa's avatar
      mm/page_io.c: fix oops during block io poll in swapin path · b0ba2d0f
      Tetsuo Handa authored
      When a thread is OOM-killed during swap_readpage() operation, an oops
      occurs because end_swap_bio_read() is calling wake_up_process() based on
      an assumption that the thread which called swap_readpage() is still
      alive.
      
        Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child
        Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
        oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
        Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        task: ffffffffb7c16500 task.stack: ffffffffb7c00000
        RIP: 0010:__lock_acquire+0x151/0x12f0
        Call Trace:
         <IRQ>
         lock_acquire+0x59/0x80
         _raw_spin_lock_irqsave+0x3b/0x4f
         try_to_wake_up+0x3b/0x410
         wake_up_process+0x10/0x20
         end_swap_bio_read+0x6f/0xf0
         bio_endio+0x92/0xb0
         blk_update_request+0x88/0x270
         scsi_end_request+0x32/0x1c0
         scsi_io_completion+0x209/0x680
         scsi_finish_command+0xd4/0x120
         scsi_softirq_done+0x120/0x140
         __blk_mq_complete_request_remote+0xe/0x10
         flush_smp_call_function_queue+0x51/0x120
         generic_smp_call_function_single_interrupt+0xe/0x20
         smp_trace_call_function_single_interrupt+0x22/0x30
         smp_call_function_single_interrupt+0x9/0x10
         call_function_single_interrupt+0xa7/0xb0
         </IRQ>
        RIP: 0010:native_safe_halt+0x6/0x10
         default_idle+0xe/0x20
         arch_cpu_idle+0xa/0x10
         default_idle_call+0x1e/0x30
         do_idle+0x187/0x200
         cpu_startup_entry+0x6e/0x70
         rest_init+0xd0/0xe0
         start_kernel+0x456/0x477
         x86_64_start_reservations+0x24/0x26
         x86_64_start_kernel+0xf7/0x11a
         secondary_startup_64+0xa5/0xa5
        Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff <f0> ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85
        RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50
        ---[ end trace 6c441db499169b1e ]---
        Kernel panic - not syncing: Fatal exception in interrupt
        Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
        ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix it by holding a reference to the thread.
      
      [akpm@linux-foundation.org: add comment]
      Fixes: 23955622 ("swap: add block io poll in swapin path")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0ba2d0f
  3. 02 Aug, 2017 7 commits
    • Minchan Kim's avatar
      zram: do not free pool->size_class · 3189c820
      Minchan Kim authored
      Mike reported kernel goes oops with ltp:zram03 testcase.
      
        zram: Added device: zram0
        zram0: detected capacity change from 0 to 107374182400
        BUG: unable to handle kernel paging request at 0000306d61727a77
        IP: zs_map_object+0xb9/0x260
        PGD 0
        P4D 0
        Oops: 0000 [#1] SMP
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in: zram(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) loop(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_powerclamp(E) coretemp(E) cdc_ether(E) kvm_intel(E) usbnet(E) mii(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) iTCO_wdt(E) ghash_clmulni_intel(E) bnx2(E) iTCO_vendor_support(E) pcbc(E) ioatdma(E) ipmi_ssif(E) aesni_intel(E) i5500_temp(E) i2c_i801(E) aes_x86_64(E) lpc_ich(E) shpchp(E) mfd_core(E) crypto_simd(E) i7core_edac(E) dca(E) glue_helper(E) cryptd(E) ipmi_si(E) button(E) acpi_cpufreq(E) ipmi_devintf(E) pcspkr(E) ipmi_msghandler(E)
         nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) ata_generic(E) i2c_algo_bit(E) ata_piix(E) drm_kms_helper(E) ahci(E) syscopyarea(E) sysfillrect(E) libahci(E) sysimgblt(E) fb_sys_fops(E) uhci_hcd(E) ehci_pci(E) ttm(E) ehci_hcd(E) libata(E) drm(E) megaraid_sas(E) usbcore(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E) [last unloaded: zram]
        CPU: 6 PID: 12356 Comm: swapon Tainted: G            E   4.13.0.g87b2c3fc-default #194
        Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698     , BIOS -[D6E150AUS-1.10]- 12/15/2010
        task: ffff880158d2c4c0 task.stack: ffffc90001680000
        RIP: 0010:zs_map_object+0xb9/0x260
        Call Trace:
         zram_bvec_rw.isra.26+0xe8/0x780 [zram]
         zram_rw_page+0x6e/0xa0 [zram]
         bdev_read_page+0x81/0xb0
         do_mpage_readpage+0x51a/0x710
         mpage_readpages+0x122/0x1a0
         blkdev_readpages+0x1d/0x20
         __do_page_cache_readahead+0x1b2/0x270
         ondemand_readahead+0x180/0x2c0
         page_cache_sync_readahead+0x31/0x50
         generic_file_read_iter+0x7e7/0xaf0
         blkdev_read_iter+0x37/0x40
         __vfs_read+0xce/0x140
         vfs_read+0x9e/0x150
         SyS_read+0x46/0xa0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        Code: 81 e6 00 c0 3f 00 81 fe 00 00 16 00 0f 85 9f 01 00 00 0f b7 13 65 ff 05 5e 07 dc 7e 66 c1 ea 02 81 e2 ff 01 00 00 49 8b 54 d4 08 <8b> 4a 48 41 0f af ce 81 e1 ff 0f 00 00 41 89 c9 48 c7 c3 a0 70
        RIP: zs_map_object+0xb9/0x260 RSP: ffffc90001683988
        CR2: 0000306d61727a77
      
      He bisected the problem is [1].
      
      After commit cf8e0fed ("mm/zsmalloc: simplify zs_max_alloc_size
      handling"), zram doesn't use double pointer for pool->size_class any
      more in zs_create_pool so counter function zs_destroy_pool don't need to
      free it, either.
      
      Otherwise, it does kfree wrong address and then, kernel goes Oops.
      
      Link: http://lkml.kernel.org/r/20170725062650.GA12134@bbox
      Fixes: cf8e0fed ("mm/zsmalloc: simplify zs_max_alloc_size handling")
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3189c820
    • Jonathan Corbet's avatar
      kthread: fix documentation build warning · d16977f3
      Jonathan Corbet authored
      The kerneldoc comment for kthread_create() had an incorrect argument
      name, leading to a warning in the docs build.
      
      Correct it, and make one more small step toward a warning-free build.
      
      Link: http://lkml.kernel.org/r/20170724135916.7f486c6f@lwn.netSigned-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d16977f3
    • Arnd Bergmann's avatar
      kasan: avoid -Wmaybe-uninitialized warning · e7701557
      Arnd Bergmann authored
      gcc-7 produces this warning:
      
        mm/kasan/report.c: In function 'kasan_report':
        mm/kasan/report.c:351:3: error: 'info.first_bad_addr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
           print_shadow_for_address(info->first_bad_addr);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        mm/kasan/report.c:360:27: note: 'info.first_bad_addr' was declared here
      
      The code seems fine as we only print info.first_bad_addr when there is a
      shadow, and we always initialize it in that case, but this is relatively
      hard for gcc to figure out after the latest rework.
      
      Adding an intialization to the most likely value together with the other
      struct members shuts up that warning.
      
      Fixes: b235b9808664 ("kasan: unify report headers")
      Link: https://patchwork.kernel.org/patch/9641417/
      Link: http://lkml.kernel.org/r/20170725152739.4176967-1-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7701557
    • Mike Rapoport's avatar
      userfaultfd: non-cooperative: notify about unmap of destination during mremap · b2282371
      Mike Rapoport authored
      When mremap is called with MREMAP_FIXED it unmaps memory at the
      destination address without notifying userfaultfd monitor.
      
      If the destination were registered with userfaultfd, the monitor has no
      way to distinguish between the old and new ranges and to properly relate
      the page faults that would occur in the destination region.
      
      Fixes: 897ab3e0 ("userfaultfd: non-cooperative: add event for memory unmaps")
      Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: default avatarPavel Emelyanov <xemul@virtuozzo.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2282371
    • Mel Gorman's avatar
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 3ea27719
      Mel Gorman authored
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      
      He described the race as follows:
      
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              try_to_unmap_one()
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
      
                                              user writes using cached RW PTE
              ...
      
              try_to_unmap_flush()
      
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      ack.
      
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>	[v4.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ea27719
    • Kefeng Wang's avatar
      pid: kill pidhash_size in pidhash_init() · 27e37d84
      Kefeng Wang authored
      After commit 3d375d78 ("mm: update callers to use HASH_ZERO flag"),
      drop unused pidhash_size in pidhash_init().
      
      Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarPavel Tatashin <Pasha.Tatashin@Oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27e37d84
    • Daniel Jordan's avatar
      mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors · 2be7cfed
      Daniel Jordan authored
      Commit 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when
      FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
      errors from follow_hugetlb_page.  After such error, __get_user_pages
      subsequently calls faultin_page on the same VMA and start address that
      follow_hugetlb_page failed on instead of returning the error immediately
      as it should.
      
      In follow_hugetlb_page, when hugetlb_fault returns a value covered under
      VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
      to 0 as __get_user_pages expects in this case, which causes the
      following to happen in __get_user_pages: the "while (nr_pages)" check
      succeeds, we skip the "if (!vma..." check because we got a VMA the last
      time around, we find no page with follow_page_mask, and we call
      faultin_page, which calls hugetlb_fault for the second time.
      
      This issue also slightly changes how __get_user_pages works.  Before, it
      only returned error if it had made no progress (i = 0).  But now,
      follow_hugetlb_page can clobber "i" with an error code since its new
      return path doesn't check for progress.  So if "i" is nonzero before a
      failing call to follow_hugetlb_page, that indication of progress is lost
      and __get_user_pages can return error even if some pages were
      successfully pinned.
      
      To fix this, change follow_hugetlb_page so that it updates nr_pages,
      allowing __get_user_pages to fail immediately and restoring the "error
      only if no progress" behavior to __get_user_pages.
      
      Tested that __get_user_pages returns when expected on error from
      hugetlb_fault in follow_hugetlb_page.
      
      Fixes: 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
      Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.comSigned-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: <stable@vger.kernel.org>	[4.12.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2be7cfed