1. 10 Mar, 2017 40 commits
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-for-4.11-rc2' of git://people.freedesktop.org/~airlied/linux · 7c7fba98
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Intel, amd and mxsfb fixes.
      
        These are the drm fixes I've collected for rc2. Mostly i915 GVT only
        fixes, along with a single EDID fix, some mxsfb fixes and a few minor
        amd fixes"
      
      * tag 'drm-fixes-for-4.11-rc2' of git://people.freedesktop.org/~airlied/linux: (38 commits)
        drm: mxsfb: Implement drm_panel handling
        drm: mxsfb_crtc: Fix the framebuffer misplacement
        drm: mxsfb: Fix crash when provided invalid DT bindings
        drm: mxsfb: fix pixel clock polarity
        drm: mxsfb: use bus_format to determine LCD bus width
        drm/amdgpu: bump driver version for some new features
        drm/amdgpu: validate paramaters in the gem ioctl
        drm/amd/amdgpu: fix console deadlock if late init failed
        drm/i915/gvt: change some gvt_err to gvt_dbg_cmd
        drm/i915/gvt: protect RO and Rsvd bits of virtual vgpu configuration space
        drm/i915/gvt: handle workload lifecycle properly
        drm/edid: Add EDID_QUIRK_FORCE_8BPC quirk for Rotel RSX-1058
        drm/i915/gvt: fix an error for F_RO flag
        drm/i915/gvt: use pfn_valid for better checking
        drm/i915/gvt: set SFUSE_STRAP properly for vitual monitor detection
        drm/i915/gvt: fix an error for one register
        drm/i915/gvt: add more registers into handlers list
        drm/i915/gvt: have more registers with F_CMD_ACCESS flags set
        drm/i915/gvt: add some new MMIOs to cmd_access white list
        drm/i915/gvt: fix pcode mailbox write emulation of BDW
        ...
      7c7fba98
    • Linus Torvalds's avatar
      Merge branch 'prep-for-5level' · baeedc71
      Linus Torvalds authored
      Merge 5-level page table prep from Kirill Shutemov:
       "Here's relatively low-risk part of 5-level paging patchset. Merging it
        now will make x86 5-level paging enabling in v4.12 easier.
      
        The first patch is actually x86-specific: detect 5-level paging
        support. It boils down to single define.
      
        The rest of patchset converts Linux MMU abstraction from 4- to 5-level
        paging.
      
        Enabling of new abstraction in most cases requires adding single line
        of code in arch-specific code. The rest is taken care by asm-generic/.
      
        Changes to mm/ code are mostly mechanical: add support for new page
        table level -- p4d_t -- where we deal with pud_t now.
      
        v2:
         - fix build on microblaze (Michal);
         - comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow();
         - acks from Michal"
      
      * emailed patches from Kirill A Shutemov <kirill.shutemov@linux.intel.com>:
        mm: introduce __p4d_alloc()
        mm: convert generic code to 5-level paging
        asm-generic: introduce <asm-generic/pgtable-nop4d.h>
        arch, mm: convert all architectures to use 5level-fixup.h
        asm-generic: introduce __ARCH_USE_5LEVEL_HACK
        asm-generic: introduce 5level-fixup.h
        x86/cpufeature: Add 5-level paging detection
      baeedc71
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 8fe3ccae
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "26 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
        userfaultfd: remove wrong comment from userfaultfd_ctx_get()
        fat: fix using uninitialized fields of fat_inode/fsinfo_inode
        sh: cayman: IDE support fix
        kasan: fix races in quarantine_remove_cache()
        kasan: resched in quarantine_remove_cache()
        mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
        thp: fix another corner case of munlock() vs. THPs
        rmap: fix NULL-pointer dereference on THP munlocking
        mm/memblock.c: fix memblock_next_valid_pfn()
        userfaultfd: selftest: vm: allow to build in vm/ directory
        userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
        userfaultfd: non-cooperative: fix fork fctx->new memleak
        mm/cgroup: avoid panic when init with low memory
        drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
        mm/vmstats: add thp_split_pud event for clarity
        include/linux/fs.h: fix unsigned enum warning with gcc-4.2
        userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
        userfaultfd: non-cooperative: robustness check
        userfaultfd: non-cooperative: rollback userfaultfd_exit
        x86, mm: unify exit paths in gup_pte_range()
        ...
      8fe3ccae
    • Linus Torvalds's avatar
      Merge tag 'xfs-4.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 9db61d6f
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
       "Here are some bug fixes for -rc2 to clean up the copy on write
        handling and to remove a cause of hangs.
      
         - Fix various iomap bugs
      
         - Fix overly aggressive CoW preallocation garbage collection
      
         - Fixes to CoW endio error handling
      
         - Fix some incorrect geometry calculations
      
         - Remove a potential system hang in bulkstat
      
         - Try to allocate blocks more aggressively to reduce ENOSPC errors"
      
      * tag 'xfs-4.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: try any AG when allocating the first btree block when reflinking
        xfs: use iomap new flag for newly allocated delalloc blocks
        xfs: remove kmem_zalloc_greedy
        xfs: Use xfs_icluster_size_fsb() to calculate inode alignment mask
        xfs: fix and streamline error handling in xfs_end_io
        xfs: only reclaim unwritten COW extents periodically
        iomap: invalidate page caches should be after iomap_dio_complete() in direct write
      9db61d6f
    • Linus Torvalds's avatar
      Merge tag 'gcc-plugins-v4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 794fe789
      Linus Torvalds authored
      Pull gcc-plugins fix from Kees Cook:
       "Fixes a typo in sancov plugin, exposed in earlier compiler versions"
      
      * tag 'gcc-plugins-v4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        gcc-plugins: fix sancov_plugin for gcc-5
      794fe789
    • Fabio Estevam's avatar
      drm: mxsfb: Implement drm_panel handling · 3f81e134
      Fabio Estevam authored
      Currently when the 'power-supply' regulator is passed via device tree
      it does not actually work since drm_panel_prepare()/drm_panel_enable()
      are never called.
      
      Quoting Thierry Reding: "It should really call drm_panel_prepare() and
      drm_panel_enable() while switching on the display pipeline and
      drm_panel_disable(), followed by drm_panel_unprepare() while switching
      off the display pipeline."
      
      So do as suggested, so that the 'power-supply' regulator can be functional.
      Reported-by: default avatarBreno Lima <breno.lima@nxp.com>
      Suggested-by: default avatarThierry Reding <thierry.reding@gmail.com>
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@nxp.com>
      Tested-by: default avatarMarek Vasut <marex@denx.de>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      3f81e134
    • Fabio Estevam's avatar
      drm: mxsfb_crtc: Fix the framebuffer misplacement · d42986b6
      Fabio Estevam authored
      Currently the framebuffer content is displayed with incorrect offsets
      in both the vertical and horizontal directions.
      
      The fbdev version of the driver does not show this problem. Breno Lima
      dumped the eLCDIF controller registers on both the drm and fbdev drivers
      and noticed that the VDCTRL3 register is configured incorrectly in the
      drm driver.
      
      The fbdev driver calculates the vertical and horizontal wait counts
      of the VDCTRL3 register by doing: back porch + sync length.
      
      Looking at the horizontal and vertical timing diagram from
      include/drm/drm_modes.h this value corresponds to:
      
      crtc_[hv]total - crtc_[hv]sync_start
      
      So fix the VDCTRL3 register setting accordingly so that the eLCDIF
      controller can properly show the framebuffer content in the correct
      position.
      Reported-by: default avatarBreno Lima <breno.lima@nxp.com>
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@nxp.com>
      Tested-by: default avatarBreno Lima <breno.lima@nxp.com>
      Tested-by: default avatarMarek Vasut <marex@denx.de>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      d42986b6
    • Marek Vasut's avatar
      drm: mxsfb: Fix crash when provided invalid DT bindings · 7ad7a5ac
      Marek Vasut authored
      The mxsfb driver will crash if the mxsfb DT node has a subnode,
      but the content of the subnode is not of-graph binding with an
      endpoint linking to panel. The crash was triggered by providing
      old-style panel bindings to the mxsfb driver instead of the new
      of-graph ones.
      
      The problem happens in mxsfb_create_output(), which is invoked
      from mxsfb_load(). The mxsfb_create_output() iterates over all
      mxsfb DT subnode endpoints and tries to bind a panel on each
      endpoint. If there is any problem binding the panel, that is,
      mxsfb->panel == NULL, this function will return an error code,
      otherwise success 0 is returned.
      
      If the subnodes do not specify of-graph binding with an endpoint,
      the iteration over endpoints in mxsfb_create_output() will have
      zero cycles and the function will immediatelly return 0, but the
      mxsfb->panel will remain NULL. This is propagated back into the
      mxsfb_load(), which does not detect any problem and expects that
      the mxsfb->panel is valid, thus calls mxsfb_panel_attach(). But
      since mxsfb->panel == NULL, mxsfb_panel_attach() is called with
      first argument NULL and this crashes the kernel.
      
      This patch fixes the problem by explicitly checking for valid
      mxsfb->panel at the end of the iteration in mxsfb_create_output().
      Signed-off-by: default avatarMarek Vasut <marex@denx.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Breno Matheus Lima <brenomatheus@gmail.com>
      Tested-by: default avatarBreno Lima <breno.lima@nxp.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      7ad7a5ac
    • Stefan Agner's avatar
      drm: mxsfb: fix pixel clock polarity · 53990e41
      Stefan Agner authored
      The DRM subsystem specifies the pixel clock polarity from a
      controllers perspective: DRM_BUS_FLAG_PIXDATA_NEGEDGE means
      the controller drives the data on pixel clocks falling edge.
      That is the controllers DOTCLK_POL=0 (Default is data launched
      at negative edge).
      
      Also change the data enable logic to be high active by default
      and only change if explicitly requested via bus_flags. With
      that defaults are:
      - Data enable: high active
      - Pixel clock polarity: controller drives data on negative edge
      Signed-off-by: default avatarStefan Agner <stefan@agner.ch>
      Acked-by: default avatarMarek Vasut <marex@denx.de>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      53990e41
    • Stefan Agner's avatar
      drm: mxsfb: use bus_format to determine LCD bus width · 10f2889b
      Stefan Agner authored
      The LCD bus width does not need to align with the pixel format. The
      LCDIF controller automatically converts between pixel formats and
      bus width by padding or dropping LSBs.
      
      The DRM subsystem has the notion of bus_format which allows to
      determine what bus_formats are supported by the display. Choose the
      first available or fallback to 24 bit if none are available.
      Signed-off-by: default avatarStefan Agner <stefan@agner.ch>
      Acked-by: default avatarMarek Vasut <marex@denx.de>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      10f2889b
    • Dave Airlie's avatar
      Merge branch 'drm-fixes-4.11' of git://people.freedesktop.org/~agd5f/linux into drm-fixes · 9813527a
      Dave Airlie authored
      * 'drm-fixes-4.11' of git://people.freedesktop.org/~agd5f/linux:
        drm/amdgpu: bump driver version for some new features
        drm/amdgpu: validate paramaters in the gem ioctl
        drm/amd/amdgpu: fix console deadlock if late init failed
      9813527a
    • Dave Airlie's avatar
      Merge tag 'drm-intel-fixes-2017-03-09' of... · 31aec642
      Dave Airlie authored
      Merge tag 'drm-intel-fixes-2017-03-09' of git://anongit.freedesktop.org/git/drm-intel into drm-fixes
      
      flushing out gvt-g fixes
      
      * tag 'drm-intel-fixes-2017-03-09' of git://anongit.freedesktop.org/git/drm-intel: (29 commits)
        drm/i915/gvt: change some gvt_err to gvt_dbg_cmd
        drm/i915/gvt: protect RO and Rsvd bits of virtual vgpu configuration space
        drm/i915/gvt: handle workload lifecycle properly
        drm/i915/gvt: fix an error for F_RO flag
        drm/i915/gvt: use pfn_valid for better checking
        drm/i915/gvt: set SFUSE_STRAP properly for vitual monitor detection
        drm/i915/gvt: fix an error for one register
        drm/i915/gvt: add more registers into handlers list
        drm/i915/gvt: have more registers with F_CMD_ACCESS flags set
        drm/i915/gvt: add some new MMIOs to cmd_access white list
        drm/i915/gvt: fix pcode mailbox write emulation of BDW
        drm/i915/gvt: add resolution definition for vGPU type
        drm/i915/gvt: Add more edid definition support
        drm/i915/gvt: adjust to fixed vGPU types
        drm/i915/gvt: remove unnecessary error msg from gtt write
        drm/i915/gvt: refine pcode write emulation
        drm/i915/gvt: clear the vGPU reset logic
        drm/i915/gvt: decrease priority of output msg for untracked mmio
        drm/i915/gvt: set default value to 0 for unhandled mmio regs
        drm/i915/gvt: add cmd_access to GEN7_HALF_SLICE_CHICKEN1
        ...
      31aec642
    • Dave Airlie's avatar
      Merge tag 'drm-misc-fixes-2017-03-06' of git://anongit.freedesktop.org/git/drm-misc into drm-fixes · aa717ae1
      Dave Airlie authored
      Just 1 8bpc quirk from Ville, cc: stable
      
      * tag 'drm-misc-fixes-2017-03-06' of git://anongit.freedesktop.org/git/drm-misc:
        drm/edid: Add EDID_QUIRK_FORCE_8BPC quirk for Rotel RSX-1058
      aa717ae1
    • David Hildenbrand's avatar
      userfaultfd: remove wrong comment from userfaultfd_ctx_get() · 2378cd61
      David Hildenbrand authored
      It's a void function, so there is no return value;
      
      Link: http://lkml.kernel.org/r/20170309150817.7510-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2378cd61
    • OGAWA Hirofumi's avatar
      fat: fix using uninitialized fields of fat_inode/fsinfo_inode · c0d0e351
      OGAWA Hirofumi authored
      Recently fallocate patch was merged and it uses
      MSDOS_I(inode)->mmu_private at fat_evict_inode().  However,
      fat_inode/fsinfo_inode that was introduced in past didn't initialize
      MSDOS_I(inode) properly.
      
      With those combinations, it became the cause of accessing random entry
      in FAT area.
      
      Link: http://lkml.kernel.org/r/87pohrj4i8.fsf@mail.parknet.co.jpSigned-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Reported-by: default avatarMoreno Bartalucci <moreno.bartalucci@tecnorama.it>
      Tested-by: default avatarMoreno Bartalucci <moreno.bartalucci@tecnorama.it>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0d0e351
    • Bartlomiej Zolnierkiewicz's avatar
      sh: cayman: IDE support fix · ca5b58ea
      Bartlomiej Zolnierkiewicz authored
      Remove incorrect CONFIG_IDE ifdef (CONFIG_IDE config option is for
      internal drivers/ide/ use) and make IDE hardware interface always
      initialized (not only when IDE subsystem is built-in).
      
      This patch allows Cayman board to work with modular IDE subsystem
      support and removes the requirement of having the whole core IDE
      subsystem built-in when using libata PATA support.
      
      Link: http://lkml.kernel.org/r/1990884.yFoE6lSB9G@amdc3058Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca5b58ea
    • Dmitry Vyukov's avatar
      kasan: fix races in quarantine_remove_cache() · ce5bec54
      Dmitry Vyukov authored
      quarantine_remove_cache() frees all pending objects that belong to the
      cache, before we destroy the cache itself.  However there are currently
      two possibilities how it can fail to do so.
      
      First, another thread can hold some of the objects from the cache in
      temp list in quarantine_put().  quarantine_put() has a windows of
      enabled interrupts, and on_each_cpu() in quarantine_remove_cache() can
      finish right in that window.  These objects will be later freed into the
      destroyed cache.
      
      Then, quarantine_reduce() has the same problem.  It grabs a batch of
      objects from the global quarantine, then unlocks quarantine_lock and
      then frees the batch.  quarantine_remove_cache() can finish while some
      objects from the cache are still in the local to_free list in
      quarantine_reduce().
      
      Fix the race with quarantine_put() by disabling interrupts for the whole
      duration of quarantine_put().  In combination with on_each_cpu() in
      quarantine_remove_cache() it ensures that quarantine_remove_cache()
      either sees the objects in the per-cpu list or in the global list.
      
      Fix the race with quarantine_reduce() by protecting quarantine_reduce()
      with srcu critical section and then doing synchronize_srcu() at the end
      of quarantine_remove_cache().
      
      I've done some assessment of how good synchronize_srcu() works in this
      case.  And on a 4 CPU VM I see that it blocks waiting for pending read
      critical sections in about 2-3% of cases.  Which looks good to me.
      
      I suspect that these races are the root cause of some GPFs that I
      episodically hit.  Previously I did not have any explanation for them.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
        IP: qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
        PGD 6aeea067
        PUD 60ed7067
        PMD 0
        Oops: 0000 [#1] SMP KASAN
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in:
        CPU: 0 PID: 13667 Comm: syz-executor2 Not tainted 4.10.0+ #60
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        task: ffff88005f948040 task.stack: ffff880069818000
        RIP: 0010:qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
        RSP: 0018:ffff88006981f298 EFLAGS: 00010246
        RAX: ffffea0000ffff00 RBX: 0000000000000000 RCX: ffffea0000ffff1f
        RDX: 0000000000000000 RSI: ffff88003fffc3e0 RDI: 0000000000000000
        RBP: ffff88006981f2c0 R08: ffff88002fed7bd8 R09: 00000001001f000d
        R10: 00000000001f000d R11: ffff88006981f000 R12: ffff88003fffc3e0
        R13: ffff88006981f2d0 R14: ffffffff81877fae R15: 0000000080000000
        FS:  00007fb911a2d700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000c8 CR3: 0000000060ed6000 CR4: 00000000000006f0
        Call Trace:
         quarantine_reduce+0x10e/0x120 mm/kasan/quarantine.c:239
         kasan_kmalloc+0xca/0xe0 mm/kasan/kasan.c:590
         kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
         slab_post_alloc_hook mm/slab.h:456 [inline]
         slab_alloc_node mm/slub.c:2718 [inline]
         kmem_cache_alloc_node+0x1d3/0x280 mm/slub.c:2754
         __alloc_skb+0x10f/0x770 net/core/skbuff.c:219
         alloc_skb include/linux/skbuff.h:932 [inline]
         _sctp_make_chunk+0x3b/0x260 net/sctp/sm_make_chunk.c:1388
         sctp_make_data net/sctp/sm_make_chunk.c:1420 [inline]
         sctp_make_datafrag_empty+0x208/0x360 net/sctp/sm_make_chunk.c:746
         sctp_datamsg_from_user+0x7e8/0x11d0 net/sctp/chunk.c:266
         sctp_sendmsg+0x2611/0x3970 net/sctp/socket.c:1962
         inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
         sock_sendmsg_nosec net/socket.c:633 [inline]
         sock_sendmsg+0xca/0x110 net/socket.c:643
         SYSC_sendto+0x660/0x810 net/socket.c:1685
         SyS_sendto+0x40/0x50 net/socket.c:1653
      
      I am not sure about backporting.  The bug is quite hard to trigger, I've
      seen it few times during our massive continuous testing (however, it
      could be cause of some other episodic stray crashes as it leads to
      memory corruption...).  If it is triggered, the consequences are very
      bad -- almost definite bad memory corruption.  The fix is non trivial
      and has chances of introducing new bugs.  I am also not sure how
      actively people use KASAN on older releases.
      
      [dvyukov@google.com: - sorted includes[
        Link: http://lkml.kernel.org/r/20170309094028.51088-1-dvyukov@google.com
      Link: http://lkml.kernel.org/r/20170308151532.5070-1-dvyukov@google.comSigned-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce5bec54
    • Dmitry Vyukov's avatar
      kasan: resched in quarantine_remove_cache() · 68fd814a
      Dmitry Vyukov authored
      We see reported stalls/lockups in quarantine_remove_cache() on machines
      with large amounts of RAM.  quarantine_remove_cache() needs to scan
      whole quarantine in order to take out all objects belonging to the
      cache.  Quarantine is currently 1/32-th of RAM, e.g.  on a machine with
      256GB of memory that will be 8GB.  Moreover quarantine scanning is a
      walk over uncached linked list, which is slow.
      
      Add cond_resched() after scanning of each non-empty batch of objects.
      Batches are specifically kept of reasonable size for quarantine_put().
      On a machine with 256GB of RAM we should have ~512 non-empty batches,
      each with 16MB of objects.
      
      Link: http://lkml.kernel.org/r/20170308154239.25440-1-dvyukov@google.comSigned-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68fd814a
    • Tahsin Erdogan's avatar
      mm: do not call mem_cgroup_free() from within mem_cgroup_alloc() · 40e952f9
      Tahsin Erdogan authored
      mem_cgroup_free() indirectly calls wb_domain_exit() which is not
      prepared to deal with a struct wb_domain object that hasn't executed
      wb_domain_init().  For instance, the following warning message is
      printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():
      
        INFO: trying to register non-static key.
        the code is fine but needs lockdep annotation.
        turning off the locking correctness validator.
        CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         dump_stack+0x67/0x99
         register_lock_class+0x36d/0x540
         __lock_acquire+0x7f/0x1a30
         lock_acquire+0xcc/0x200
         del_timer_sync+0x3c/0xc0
         wb_domain_exit+0x14/0x20
         mem_cgroup_free+0x14/0x40
         mem_cgroup_css_alloc+0x3f9/0x620
         cgroup_apply_control_enable+0x190/0x390
         cgroup_mkdir+0x290/0x3d0
         kernfs_iop_mkdir+0x58/0x80
         vfs_mkdir+0x10e/0x1a0
         SyS_mkdirat+0xa8/0xd0
         SyS_mkdir+0x14/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
      
      Add __mem_cgroup_free() which skips wb_domain_exit().  This is used by
      both mem_cgroup_free() and mem_cgroup_alloc() clean up.
      
      Fixes: 0b8f73e1 ("mm: memcontrol: clean up alloc, online, offline, free functions")
      Link: http://lkml.kernel.org/r/20170306192122.24262-1-tahsin@google.comSigned-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40e952f9
    • Kirill A. Shutemov's avatar
      thp: fix another corner case of munlock() vs. THPs · 6ebb4a1b
      Kirill A. Shutemov authored
      The following test case triggers BUG() in munlock_vma_pages_range():
      
      	int main(int argc, char *argv[])
      	{
      		int fd;
      
      		system("mount -t tmpfs -o huge=always none /mnt");
      		fd = open("/mnt/test", O_CREAT | O_RDWR);
      		ftruncate(fd, 4UL << 20);
      		mmap(NULL, 4UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
      		mmap(NULL, 4096, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_LOCKED, fd, 0);
      		munlockall();
      		return 0;
      	}
      
      The second mmap() create PTE-mapping of the first huge page in file.  It
      makes kernel munlock the page as we never keep PTE-mapped page mlocked.
      
      On munlockall() when we handle vma created by the first mmap(),
      munlock_vma_page() returns page_mask == 0, as the page is not mlocked
      anymore.  On next iteration follow_page_mask() return tail page, but
      page_mask is HPAGE_NR_PAGES - 1.  It makes us skip to the first tail
      page of the next huge page and step on
      VM_BUG_ON_PAGE(PageMlocked(page)).
      
      The fix is not use the page_mask from follow_page_mask() at all.  It has
      no use for us.
      
      Link: http://lkml.kernel.org/r/20170302150252.34120-1-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>    [4.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ebb4a1b
    • Kirill A. Shutemov's avatar
      rmap: fix NULL-pointer dereference on THP munlocking · 8346242a
      Kirill A. Shutemov authored
      The following test case triggers NULL-pointer derefernce in
      try_to_unmap_one():
      
      	#include <fcntl.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main(int argc, char *argv[])
      	{
      		int fd;
      
      		system("mount -t tmpfs -o huge=always none /mnt");
      		fd = open("/mnt/test", O_CREAT | O_RDWR);
      		ftruncate(fd, 2UL << 20);
      		mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
      		mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_LOCKED, fd, 0);
      		munlockall();
      		return 0;
      	}
      
      Apparently, there's a case when we call try_to_unmap() on huge PMDs:
      it's TTU_MUNLOCK.
      
      Let's handle this case correctly.
      
      Fixes: c7ab0d2f ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
      Link: http://lkml.kernel.org/r/20170302151159.30592-1-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8346242a
    • AKASHI Takahiro's avatar
      mm/memblock.c: fix memblock_next_valid_pfn() · c9a1b80d
      AKASHI Takahiro authored
      Obviously, we should not access memblock.memory.regions[right] if
      'right' is outside of [0..memblock.memory.cnt>.
      
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Link: http://lkml.kernel.org/r/20170303023745.9104-1-takahiro.akashi@linaro.orgSigned-off-by: default avatarAKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9a1b80d
    • Andrea Arcangeli's avatar
      userfaultfd: selftest: vm: allow to build in vm/ directory · 46aa6a30
      Andrea Arcangeli authored
      linux/tools/testing/selftests/vm $ make
      
        gcc -Wall -I ../../../../usr/include     compaction_test.c -lrt -o /compaction_test
        /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/../../../../x86_64-pc-linux-gnu/bin/ld: cannot open output file /compaction_test: Permission denied
        collect2: error: ld returned 1 exit status
        make: *** [../lib.mk:54: /compaction_test] Error 1
      
      Since commit a8ba798b ("selftests: enable O and KBUILD_OUTPUT")
      selftests/vm build fails if run from the "selftests/vm" directory, but
      it works in the selftests/ directory.  It's quicker to be able to do a
      local vm-only build after a tree wipe and this patch allows for it
      again.
      
      Link: http://lkml.kernel.org/r/20170302173738.18994-4-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46aa6a30
    • Andrea Arcangeli's avatar
      userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED · 70ccb92f
      Andrea Arcangeli authored
      userfaultfd_remove() has to be execute before zapping the pagetables or
      UFFDIO_COPY could keep filling pages after zap_page_range returned,
      which would result in non zero data after a MADV_DONTNEED.
      
      However userfaultfd_remove() may have to release the mmap_sem.  This was
      handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
      potentially stale vma (the very vma passed to zap_page_range(vma, ...)).
      
      The fix consists in revalidating the vma in case userfaultfd_remove()
      had to release the mmap_sem.
      
      This also optimizes away an unnecessary down_read/up_read in the
      MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.
      
      It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
      userfaultfd_remove() will be defined as "true" at build time.
      
      Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70ccb92f
    • Mike Rapoport's avatar
      userfaultfd: non-cooperative: fix fork fctx->new memleak · 7eb76d45
      Mike Rapoport authored
      We have a memleak in the ->new ctx if the uffd of the parent is closed
      before the fork event is read, nothing frees the new context.
      
      Link: http://lkml.kernel.org/r/20170302173738.18994-2-aarcange@redhat.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7eb76d45
    • Laurent Dufour's avatar
      mm/cgroup: avoid panic when init with low memory · bfc7228b
      Laurent Dufour authored
      The system may panic when initialisation is done when almost all the
      memory is assigned to the huge pages using the kernel command line
      parameter hugepage=xxxx.  Panic may occur like this:
      
        Unable to handle kernel paging request for data at address 0x00000000
        Faulting instruction address: 0xc000000000302b88
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 [    0.082424] NUMA
        pSeries
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
        task: c00000021ed01600 task.stack: c00000010d108000
        NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
        REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
        MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
        CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
        GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
        GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
        GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
        GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
        GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
        GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
        GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
        GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
        NIP mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
        LR do_try_to_free_pages+0x1b4/0x450
        Call Trace:
          do_try_to_free_pages+0x1b4/0x450
          try_to_free_pages+0xf8/0x270
          __alloc_pages_nodemask+0x7a8/0xff0
          new_slab+0x104/0x8e0
          ___slab_alloc+0x620/0x700
          __slab_alloc+0x34/0x60
          kmem_cache_alloc_node_trace+0xdc/0x310
          mem_cgroup_init+0x158/0x1c8
          do_one_initcall+0x68/0x1d0
          kernel_init_freeable+0x278/0x360
          kernel_init+0x24/0x170
          ret_from_kernel_thread+0x5c/0x74
        Instruction dump:
        eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
        3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
        ---[ end trace 342f5208b00d01b6 ]---
      
      This is a chicken and egg issue where the kernel try to get free memory
      when allocating per node data in mem_cgroup_init(), but in that path
      mem_cgroup_soft_limit_reclaim() is called which assumes that these data
      are allocated.
      
      As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
      these data are not yet allocated.
      
      This patch also fixes potential null pointer access in
      mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
      
      Link: http://lkml.kernel.org/r/1487856999-16581-2-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfc7228b
    • Masanari Iida's avatar
    • Yisheng Xie's avatar
      mm/vmstats: add thp_split_pud event for clarity · ce9311cf
      Yisheng Xie authored
      We added support for PUD-sized transparent hugepages, however we count
      the event "thp split pud" into thp_split_pmd event.
      
      To separate the event count of thp split pud from pmd, add a new event
      named thp_split_pud.
      
      Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce9311cf
    • Arnd Bergmann's avatar
      include/linux/fs.h: fix unsigned enum warning with gcc-4.2 · cbfd0c10
      Arnd Bergmann authored
      With arm-linux-gcc-4.2, almost every file we build in the kernel ends up
      with this warning:
      
        include/linux/fs.h:2648: warning: comparison of unsigned expression < 0 is always false
      
      Later versions don't have this problem, but it's easy enough to work
      around.
      
      Link: http://lkml.kernel.org/r/20161216105634.235457-12-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbfd0c10
    • Andrea Arcangeli's avatar
      userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete · 8c9e7bb7
      Andrea Arcangeli authored
      Don't stop running dup_fctx() even if userfaultfd_event_wait_completion
      fails as it has to run userfaultfd_ctx_put on all ctx to pair against
      the userfaultfd_ctx_get that was run on all fctx->orig in
      dup_userfaultfd.
      
      Link: http://lkml.kernel.org/r/20170224181957.19736-4-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c9e7bb7
    • Andrea Arcangeli's avatar
      userfaultfd: non-cooperative: robustness check · 9a69a829
      Andrea Arcangeli authored
      Similar to the handle_userfault() case, also make sure to never attempt
      to send any event past the PF_EXITING point of no return.
      
      This is purely a robustness check.
      
      Link: http://lkml.kernel.org/r/20170224181957.19736-3-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a69a829
    • Andrea Arcangeli's avatar
      userfaultfd: non-cooperative: rollback userfaultfd_exit · dd0db88d
      Andrea Arcangeli authored
      Patch series "userfaultfd non-cooperative further update for 4.11 merge
      window".
      
      Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
      more testing.  I've been doing testing before and this was also tested
      by kbuild bot and exercised by the selftest, but this bug never
      reproduced before.
      
      I dropped userfaultfd_exit as result.  I dropped it because of
      implementation difficulty in receiving signals in __mmput and because I
      think -ENOSPC as result from the background UFFDIO_COPY should be enough
      already.
      
      Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
      wasn't exercised by the selftest and when I tried to exercise it, after
      moving it to a more correct place in __mmput where it would make more
      sense and where the vma list is stable, it resulted in the
      event_wait_completion in D state.  So then I added the second patch to
      be sure even if we call userfaultfd_event_wait_completion too late
      during task exit(), we won't risk to generate tasks in D state.  The
      same check exists in handle_userfault() for the same reason, except it
      makes a difference there, while here is just a robustness check and it's
      run under WARN_ON_ONCE.
      
      While looking at the userfaultfd_event_wait_completion() function I
      looked back at its callers too while at it and I think it's not ok to
      stop executing dup_fctx on the fcs list because we relay on
      userfaultfd_event_wait_completion to execute
      userfaultfd_ctx_put(fctx->orig) which is paired against
      userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
      list_add(fcs).  This change only takes care of fctx->orig but this area
      also needs further review looking for similar problems in fctx->new.
      
      The only patch that is urgent is the first because it's an use after
      free during a SMP race condition that affects all processes if
      CONFIG_USERFAULTFD=y.  Very hard to reproduce though and probably
      impossible without SLUB poisoning enabled.
      
      This patch (of 3):
      
      I once reproduced this oops with the userfaultfd selftest, it's not
      easily reproducible and it requires SLUB poisoning to reproduce.
      
          general protection fault: 0000 [#1] SMP
          Modules linked in:
          CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G               ------------ T 3.10.0+ #15
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
          task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
          RIP: 0010:[<ffffffff81451299>]  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
          RSP: 0018:ffff8801f833fe80  EFLAGS: 00010202
          RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
          RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
          R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
          FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          Call Trace:
            do_exit+0x297/0xd10
            SyS_exit+0x17/0x20
            tracesys+0xdd/0xe2
          Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
          RIP  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
           RSP <ffff8801f833fe80>
          ---[ end trace 9fecd6dcb442846a ]---
      
      In the debugger I located the "mm" pointer in the stack and walking
      mm->mmap->vm_next through the end shows the vma->vm_next list is fully
      consistent and it is null terminated list as expected.  So this has to
      be an SMP race condition where userfaultfd_exit was running while the
      vma list was being modified by another CPU.
      
      When userfaultfd_exit() run one of the ->vm_next pointers pointed to
      SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).
      
      The reason is that it's not running in __mmput but while there are still
      other threads running and it's not holding the mmap_sem (it can't as it
      has to wait the even to be received by the manager).  So this is an use
      after free that was happening for all processes.
      
      One more implementation problem aside from the race condition:
      userfaultfd_exit has really to check a flag in mm->flags before walking
      the vma or it's going to slowdown the exit() path for regular tasks.
      
      One more implementation problem: at that point signals can't be
      delivered so it would also create a task in D state if the manager
      doesn't read the event.
      
      The major design issue: it overall looks superfluous as the manager can
      check for -ENOSPC in the background transfer:
      
      	if (mmget_not_zero(ctx->mm)) {
      [..]
      	} else {
      		return -ENOSPC;
      	}
      
      It's safer to roll it back and re-introduce it later if at all.
      
      [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
        Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd0db88d
    • Dan Williams's avatar
      x86, mm: unify exit paths in gup_pte_range() · b2e593e2
      Dan Williams authored
      All exit paths from gup_pte_range() require pte_unmap() of the original
      pte page before returning.  Refactor the code to have a single exit
      point to do the unmap.
      
      This mirrors the flow of the generic gup_pte_range() in mm/gup.c.
      
      Link: http://lkml.kernel.org/r/148804251828.36605.14910389618497006945.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2e593e2
    • Dan Williams's avatar
      x86, mm: fix gup_pte_range() vs DAX mappings · ef947b25
      Dan Williams authored
      gup_pte_range() fails to check pte_allows_gup() before translating a DAX
      pte entry, pte_devmap(), to a page.  This allows writes to read-only
      mappings, and bypasses the DAX cacheline dirty tracking due to missed
      'mkwrite' faults.  The gup_huge_pmd() path and the gup_huge_pud() path
      correctly check pte_allows_gup() before checking for _devmap() entries.
      
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Link: http://lkml.kernel.org/r/148804251312.36605.12665024794196605053.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reported-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef947b25
    • Aneesh Kumar K.V's avatar
      power/mm: update pte_write and pte_wrprotect to handle savedwrite · d19469e8
      Aneesh Kumar K.V authored
      We use pte_write() to check whethwer the pte entry is writable.  This is
      mostly used to later mark the pte read only if it is writable.  The other
      use of pte_write() is to check whether the pte_entry is writable so that
      hardware page table entry can be marked accordingly.  This is used in kvm
      where we look at qemu page table entry and update hardware hash page table
      for the guest with correct write enable bit.
      
      With the above, for the first usage we should also check the savedwrite
      bit so that we can correctly clear the savedwite bit.  For the later, we
      add a new variant __pte_write().
      
      With this we can revert write_protect_page part of 595cd8f2 ("mm/ksm:
      handle protnone saved writes when making page write protect").  But I left
      it as it is as an example code for savedwrite check.
      
      Fixes: c137a275 ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write")
      Link: http://lkml.kernel.org/r/1488203787-17849-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d19469e8
    • Aneesh Kumar K.V's avatar
      powerpc/mm: handle protnone ptes on fork · 52c50ca7
      Aneesh Kumar K.V authored
      We need to mark pages of parent process read only on fork.  Numa fault
      pte needs a protnone ptes variant with saved write flag set.  On fork we
      need to make sure we remove the saved write bit.  Instead of adding the
      protnone check in the caller update ptep_set_wrprotect variants to clear
      savedwrite bit.
      
      Without this we see random segfaults in application on fork.
      
      Fixes: c137a275 ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write")
      Link: http://lkml.kernel.org/r/1488203787-17849-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52c50ca7
    • Masahiro Yamada's avatar
      scripts/spelling.txt: add "overide" pattern and fix typo instances · 505d3085
      Masahiro Yamada authored
      Fix typos and add the following to the scripts/spelling.txt:
      
        overide||override
      
      While we are here, fix the doubled "address" in the touched line
      Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.
      
      Also, fix the comment block style in the touched hunks in
      drivers/media/dvb-frontends/drx39xyj/drx_driver.h.
      
      Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505d3085
    • Masahiro Yamada's avatar
      scripts/spelling.txt: add "disble(d)" pattern and fix typo instances · 8a1115ff
      Masahiro Yamada authored
      Fix typos and add the following to the scripts/spelling.txt:
      
        disble||disable
        disbled||disabled
      
      I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
      untouched.  The macro is not referenced at all, but this commit is
      touching only comment blocks just in case.
      
      Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a1115ff
    • Andrea Arcangeli's avatar
      userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE · 6bbc4a41
      Andrea Arcangeli authored
      __do_fault assumes vmf->page has been initialized and is valid if
      VM_FAULT_NOPAGE is not returned by vma->vm_ops->fault(vma, vmf).
      
      handle_userfault() in turn should return VM_FAULT_NOPAGE if it doesn't
      return VM_FAULT_SIGBUS or VM_FAULT_RETRY (the other two possibilities).
      
      This VM_FAULT_NOPAGE case is only invoked when signal are pending and it
      didn't matter for anonymous memory before.  It only started to matter
      since shmem was introduced.  hugetlbfs also takes a different path and
      doesn't exercise __do_fault.
      
      Link: http://lkml.kernel.org/r/20170228154201.GH5816@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bbc4a41
    • Linus Torvalds's avatar
      Merge tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c1aa905a
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These fix several issues in the intel_pstate driver and one issue in
        the schedutil cpufreq governor, clean up that governor a bit and hook
        up existing code for disabling cpufreq to a new kernel command line
        option.
      
        Specifics:
      
         - Three fixes for intel_pstate problems related to the passive mode
           (in which it acts as a regular cpufreq scaling driver), two for the
           handling of global P-state limits and one for the handling of the
           cpu_frequency tracepoint in that mode (Rafael Wysocki).
      
         - Three fixes for the handling of P-state limits in intel_pstate in
           the active mode (Rafael Wysocki).
      
         - Introduction of a new cpufreq.off=1 kernel command line argument
           that will disable cpufreq entirely if passed to the kernel and is
           simply hooked up to the existing code used by Xen (Len Brown).
      
         - Fix for the schedutil cpufreq governor to prevent it from using
           stale raw frequency values in configurations with mutiple CPUs
           sharing one policy object and a cleanup for it reducing its
           overhead slightly (Viresh Kumar)"
      
      * tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
        cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
        cpufreq: intel_pstate: Fix global settings in active mode
        cpufreq: Add the "cpufreq.off=1" cmdline option
        cpufreq: schedutil: Pass sg_policy to get_next_freq()
        cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
        cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
        cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
        cpufreq: intel_pstate: Do not use performance_limits in passive mode
      c1aa905a