1. 11 Jan, 2017 26 commits
    • Mike Kravetz's avatar
      mm/hugetlb.c: fix reservation race when freeing surplus pages · e5bbc8a6
      Mike Kravetz authored
      return_unused_surplus_pages() decrements the global reservation count,
      and frees any unused surplus pages that were backing the reservation.
      
      Commit 7848a4bf ("mm/hugetlb.c: add cond_resched_lock() in
      return_unused_surplus_pages()") added a call to cond_resched_lock in the
      loop freeing the pages.
      
      As a result, the hugetlb_lock could be dropped, and someone else could
      use the pages that will be freed in subsequent iterations of the loop.
      This could result in inconsistent global hugetlb page state, application
      api failures (such as mmap) failures or application crashes.
      
      When dropping the lock in return_unused_surplus_pages, make sure that
      the global reservation count (resv_huge_pages) remains sufficiently
      large to prevent someone else from claiming pages about to be freed.
      
      Analyzed by Paul Cassella.
      
      Fixes: 7848a4bf ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
      Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarPaul Cassella <cassella@cray.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>	[3.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5bbc8a6
    • John Sperbeck's avatar
      mm/slab.c: fix SLAB freelist randomization duplicate entries · c4e490cf
      John Sperbeck authored
      This patch fixes a bug in the freelist randomization code.  When a high
      random number is used, the freelist will contain duplicate entries.  It
      will result in different allocations sharing the same chunk.
      
      It will result in odd behaviours and crashes.  It should be uncommon but
      it depends on the machines.  We saw it happening more often on some
      machines (every few hours of running tests).
      
      Fixes: c7ce4f60 ("mm: SLAB freelist randomization")
      Link: http://lkml.kernel.org/r/20170103181908.143178-1-thgarnie@google.comSigned-off-by: default avatarJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4e490cf
    • Minchan Kim's avatar
      zram: support BDI_CAP_STABLE_WRITES · b09ab054
      Minchan Kim authored
      zram has used per-cpu stream feature from v4.7.  It aims for increasing
      cache hit ratio of scratch buffer for compressing.  Downside of that
      approach is that zram should ask memory space for compressed page in
      per-cpu context which requires stricted gfp flag which could be failed.
      If so, it retries to allocate memory space out of per-cpu context so it
      could get memory this time and compress the data again, copies it to the
      memory space.
      
      In this scenario, zram assumes the data should never be changed but it is
      not true without stable page support.  So, If the data is changed under
      us, zram can make buffer overrun so that zsmalloc free object chain is
      broken so system goes crash like below
      
         https://bugzilla.suse.com/show_bug.cgi?id=997574
      
      This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
      device needing *stable write*".
      
      Fixes: da9556a2 ("zram: user per-cpu compression streams")
      Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hyeoncheol Lee <cheol.lee@lge.com>
      Cc: <yjay.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: <stable@vger.kernel.org> [4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b09ab054
    • Minchan Kim's avatar
      zram: revalidate disk under init_lock · e7ccfc4c
      Minchan Kim authored
      Commit b4c5c609 ("zram: avoid lockdep splat by revalidate_disk")
      moved revalidate_disk call out of init_lock to avoid lockdep
      false-positive splat.  However, commit 08eee69f ("zram: remove
      init_lock in zram_make_request") removed init_lock in IO path so there
      is no worry about lockdep splat.  So, let's restore it.
      
      This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
      patch.
      
      Fixes: da9556a2 ("zram: user per-cpu compression streams")
      Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hyeoncheol Lee <cheol.lee@lge.com>
      Cc: <yjay.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: <stable@vger.kernel.org> [4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7ccfc4c
    • Minchan Kim's avatar
      mm: support anonymous stable page · f0571429
      Minchan Kim authored
      During developemnt for zram-swap asynchronous writeback, I found strange
      corruption of compressed page, resulting in:
      
        Modules linked in: zram(E)
        CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G            E   4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        task: ffff88007620b840 task.stack: ffff880078090000
        RIP: set_freeobj.part.43+0x1c/0x1f
        RSP: 0018:ffff880078093ca8  EFLAGS: 00010246
        RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
        RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
        RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
        R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
        FS:  0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
        Call Trace:
          obj_malloc+0x22b/0x260
          zs_malloc+0x1e4/0x580
          zram_bvec_rw+0x4cd/0x830 [zram]
          page_requests_rw+0x9c/0x130 [zram]
          zram_thread+0xe6/0x173 [zram]
          kthread+0xca/0xe0
          ret_from_fork+0x25/0x30
      
      With investigation, it reveals currently stable page doesn't support
      anonymous page.  IOW, reuse_swap_page can reuse the page without waiting
      writeback completion so it can overwrite page zram is compressing.
      
      Unfortunately, zram has used per-cpu stream feature from v4.7.
      It aims for increasing cache hit ratio of scratch buffer for
      compressing. Downside of that approach is that zram should ask
      memory space for compressed page in per-cpu context which requires
      stricted gfp flag which could be failed. If so, it retries to
      allocate memory space out of per-cpu context so it could get memory
      this time and compress the data again, copies it to the memory space.
      
      In this scenario, zram assumes the data should never be changed
      but it is not true unless stable page supports. So, If the data is
      changed under us, zram can make buffer overrun because second
      compression size could be bigger than one we got in previous trial
      and blindly, copy bigger size object to smaller buffer which is
      buffer overrun. The overrun breaks zsmalloc free object chaining
      so system goes crash like above.
      
      I think below is same problem.
      https://bugzilla.suse.com/show_bug.cgi?id=997574
      
      Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
      writeback in there so the approach in this patch is simply return false if
      we found it needs stable page.  Although it increases memory footprint
      temporarily, it happens rarely and it should be reclaimed easily althoug
      it happened.  Also, It would be better than waiting of IO completion,
      which is critial path for application latency.
      
      Fixes: da9556a2 ("zram: user per-cpu compression streams")
      Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
      Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hyeoncheol Lee <cheol.lee@lge.com>
      Cc: <yjay.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: <stable@vger.kernel.org> [4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0571429
    • Alexander Duyck's avatar
      mm: add documentation for page fragment APIs · 4d09d0f4
      Alexander Duyck authored
      This is a first pass at trying to add documentation for the page_frag
      APIs.  They may still change over time but for now I thought I would try
      to get these documented so that as more network drivers and stack calls
      make use of them we have one central spot to document how they are meant
      to be used.
      
      Link: http://lkml.kernel.org/r/20170104024157.13451.6758.stgit@localhost.localdomainSigned-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d09d0f4
    • Alexander Duyck's avatar
      mm: rename __page_frag functions to __page_frag_cache, drop order from drain · 2976db80
      Alexander Duyck authored
      This patch does two things.
      
      First it goes through and renames the __page_frag prefixed functions to
      __page_frag_cache so that we can be clear that we are draining or
      refilling the cache, not the frags themselves.
      
      Second we drop the order parameter from __page_frag_cache_drain since we
      don't actually need to pass it since all fragments are either order 0 or
      must be a compound page.
      
      Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomainSigned-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2976db80
    • Alexander Duyck's avatar
      mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free · 8c2dd3e4
      Alexander Duyck authored
      Patch series "Page fragment updates", v4.
      
      This patch series takes care of a few cleanups for the page fragments
      API.
      
      First we do some renames so that things are much more consistent.  First
      we move the page_frag_ portion of the name to the front of the functions
      names.  Secondly we split out the cache specific functions from the
      other page fragment functions by adding the word "cache" to the name.
      
      Finally I added a bit of documentation that will hopefully help to
      explain some of this.  I plan to revisit this later as we get things
      more ironed out in the near future with the changes planned for the DMA
      setup to support eXpress Data Path.
      
      This patch (of 3):
      
      This patch renames the page frag functions to be more consistent with
      other APIs.  Specifically we place the name page_frag first in the name
      and then have either an alloc or free call name that we append as the
      suffix.  This makes it a bit clearer in terms of naming.
      
      In addition we drop the leading double underscores since we are
      technically no longer a backing interface and instead the front end that
      is called from the networking APIs.
      
      Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomainSigned-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c2dd3e4
    • Michal Hocko's avatar
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · b4536f0c
      Michal Hocko authored
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarNils Holland <nholland@tisys.org>
      Tested-by: default avatarNils Holland <nholland@tisys.org>
      Reported-by: default avatarKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4536f0c
    • Ard Biesheuvel's avatar
      mm: don't dereference struct page fields of invalid pages · f073bdc5
      Ard Biesheuvel authored
      The VM_BUG_ON() check in move_freepages() checks whether the node id of
      a page matches the node id of its zone.  However, it does this before
      having checked whether the struct page pointer refers to a valid struct
      page to begin with.  This is guaranteed in most cases, but may not be
      the case if CONFIG_HOLES_IN_ZONE=y.
      
      So reorder the VM_BUG_ON() with the pfn_valid_within() check.
      
      Link: http://lkml.kernel.org/r/1481706707-6211-2-git-send-email-ard.biesheuvel@linaro.orgSigned-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Hanjun Guo <hanjun.guo@linaro.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Robert Richter <rrichter@cavium.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f073bdc5
    • Stephen Boyd's avatar
      mailmap: add codeaurora.org names for nameless email commits · 9ebf73b2
      Stephen Boyd authored
      Some codeaurora.org emails have crept in but the names don't exist for
      them.  Add the names for the emails so git can match everyone up.
      
      Link: http://lkml.kernel.org/r/20170104194611.25933-1-sboyd@codeaurora.orgSigned-off-by: default avatarStephen Boyd <sboyd@codeaurora.org>
      Cc: Sarangdhar Joshi <spjoshi@codeaurora.org>
      Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Cc: Subhash Jadavani <subhashj@codeaurora.org>
      Cc: Thomas Pedersen <twp@codeaurora.org>
      Cc: Andy Gross <andy.gross@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ebf73b2
    • Jamie Iles's avatar
      signal: protect SIGNAL_UNKILLABLE from unintentional clearing. · 2d39b3cd
      Jamie Iles authored
      Since commit 00cd5c37 ("ptrace: permit ptracing of /sbin/init") we
      can now trace init processes.  init is initially protected with
      SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
      there are a number of paths during tracing where SIGNAL_UNKILLABLE can
      be implicitly cleared.
      
      This can result in init becoming stoppable/killable after tracing.  For
      example, running:
      
        while true; do kill -STOP 1; done &
        strace -p 1
      
      and then stopping strace and the kill loop will result in init being
      left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
      init will now respond to future SIGSTOP signals rather than ignoring
      them.
      
      Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
      that we don't clear SIGNAL_UNKILLABLE.
      
      Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.comSigned-off-by: default avatarJamie Iles <jamie.iles@oracle.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d39b3cd
    • Minchan Kim's avatar
      mm: pmd dirty emulation in page fault handler · 20f664aa
      Minchan Kim authored
      Andreas reported [1] made a test in jemalloc hang in THP mode in arm64:
      
        http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
      
      The problem is currently page fault handler doesn't supports dirty bit
      emulation of pmd for non-HW dirty-bit architecture so that application
      stucks until VM marked the pmd dirty.
      
      How the emulation work depends on the architecture.  In case of arm64,
      when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
      mark the pte dirty via triggering page fault when store access happens.
      Once the page fault occurs, VM marks the pmd dirty and arch code for
      setting pmd will clear PTE_RDONLY for application to proceed.
      
      IOW, if VM doesn't mark the pmd dirty, application hangs forever by
      repeated fault(i.e., store op but the pmd is PTE_RDONLY).
      
      This patch enables pmd dirty-bit emulation for those architectures.
      
      [1] b8d3c4c3, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called
      
      Fixes: b8d3c4c3 ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
      Link: http://lkml.kernel.org/r/1482506098-6149-1-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarAndreas Schwab <schwab@suse.de>
      Tested-by: default avatarAndreas Schwab <schwab@suse.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: <stable@vger.kernel.org> [4.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20f664aa
    • Manfred Spraul's avatar
      ipc/sem.c: fix incorrect sem_lock pairing · c626bc46
      Manfred Spraul authored
      Based on the syzcaller test case from dvyukov:
      
        https://gist.githubusercontent.com/dvyukov/d0e5efefe4d7d6daed829f5c3ca26a40/raw/08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt
      
      The slow (i.e.: failure to acquire) syscall exit from semtimedop()
      incorrectly assumed that the the same lock is acquired as it was at the
      initial syscall entry.
      
      This is wrong:
       - thread A: single semop semop(), sleeps
       - thread B: multi semop semop(), sleeps
       - thread A: woken up by signal/timeout
      
      With this sequence, the initial sem_lock() call locks the per-semaphore
      spinlock, and it is unlocked with sem_unlock().  The call at the syscall
      return locks the global spinlock.  Because locknum is not updated, the
      following sem_unlock() call unlocks the per-semaphore spinlock, which is
      actually not locked.
      
      The fix is trivial: Use the return value from sem_lock.
      
      Fixes: 370b262c ("ipc/sem: avoid idr tree lookup for interrupted semop")
      Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.comSigned-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reported-by: default avatarJohanna Abrahamsson <johanna@mjao.org>
      Tested-by: default avatarJohanna Abrahamsson <johanna@mjao.org>
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c626bc46
    • Sudip Mukherjee's avatar
      lib/Kconfig.debug: fix frv build failure · da0510c4
      Sudip Mukherjee authored
      The build of frv allmodconfig was failing with the errors like:
      
        /tmp/cc0JSPc3.s: Assembler messages:
        /tmp/cc0JSPc3.s:1839: Error: symbol `.LSLT0' is already defined
        /tmp/cc0JSPc3.s:1842: Error: symbol `.LASLTP0' is already defined
        /tmp/cc0JSPc3.s:1969: Error: symbol `.LELTP0' is already defined
        /tmp/cc0JSPc3.s:1970: Error: symbol `.LELT0' is already defined
      
      Commit 866ced95 ("kbuild: Support split debug info v4") introduced
      splitting the debug info and keeping that in a separate file.  Somehow,
      the frv-linux gcc did not like that and I am guessing that instead of
      splitting it started copying.  The first report about this is at:
      
        https://lists.01.org/pipermail/kbuild-all/2015-July/010527.html.
      
      I will try and see if this can work with frv and if still fails I will
      open a bug report with gcc.  But meanwhile this is the easiest option to
      solve build failure of frv.
      
      Fixes: 866ced95 ("kbuild: Support split debug info v4")
      Link: http://lkml.kernel.org/r/1482062348-5352-1-git-send-email-sudipm.mukherjee@gmail.comSigned-off-by: default avatarSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da0510c4
    • Michal Hocko's avatar
      mm: get rid of __GFP_OTHER_NODE · 41b6167e
      Michal Hocko authored
      The flag was introduced by commit 78afd561 ("mm: add
      __GFP_OTHER_NODE flag") to allow proper accounting of remote node
      allocations done by kernel daemons on behalf of a process - e.g.
      khugepaged.
      
      After "mm: fix remote numa hits statistics" we do not need and actually
      use the flag so we can safely remove it because all allocations which
      are satisfied from their "home" node are accounted properly.
      
      [mhocko@suse.com: fix build]
      Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41b6167e
    • Michal Hocko's avatar
      mm: fix remote numa hits statistics · 2df26639
      Michal Hocko authored
      Jia He has noticed that commit b9f00e14 ("mm, page_alloc: reduce
      branches in zone_statistics") has an unintentional side effect that
      remote node allocation requests are accounted as NUMA_MISS rathat than
      NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE.
      
      There are many of these potentially because the flag is used very rarely
      while we have many users of __alloc_pages_node.
      
      Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
      follow up patch) and treat all allocations that were satisfied from the
      preferred zone's node as NUMA_HITS because this is the same node we
      requested the allocation from in most cases.  If this is not the local
      node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.
      
      One downsize would be that an allocation request for a node which is
      outside of the mempolicy nodemask would be reported as a hit which is a
      bit weird but that was the case before b9f00e14 already.
      
      Fixes: b9f00e14 ("mm, page_alloc: reduce branches in zone_statistics")
      Link: http://lkml.kernel.org/r/20170102153057.9451-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarJia He <hejianet@gmail.com>
      Reviewed-by: Vlastimil Babka <vbabka@suse.cz> # with cbmc[1] superpowers
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2df26639
    • Dan Williams's avatar
      mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done} · f931ab47
      Dan Williams authored
      Both arch_add_memory() and arch_remove_memory() expect a single threaded
      context.
      
      For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
      not hold any locks over this check and branch:
      
          if (pgd_val(*pgd)) {
          	pud = (pud_t *)pgd_page_vaddr(*pgd);
          	paddr_last = phys_pud_init(pud, __pa(vaddr),
          				   __pa(vaddr_end),
          				   page_size_mask);
          	continue;
          }
      
          pud = alloc_low_page();
          paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
          			   page_size_mask);
      
      The result is that two threads calling devm_memremap_pages()
      simultaneously can end up colliding on pgd initialization.  This leads
      to crash signatures like the following where the loser of the race
      initializes the wrong pgd entry:
      
          BUG: unable to handle kernel paging request at ffff888ebfff0000
          IP: memcpy_erms+0x6/0x10
          PGD 2f8e8fc067 PUD 0 /* <---- Invalid PUD */
          Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
          task: ffff882fac290040 ti: ffff882f887a4000 task.ti: ffff882f887a4000
          RIP: memcpy_erms+0x6/0x10
          [..]
          Call Trace:
            ? pmem_do_bvec+0x205/0x370 [nd_pmem]
            ? blk_queue_enter+0x3a/0x280
            pmem_rw_page+0x38/0x80 [nd_pmem]
            bdev_read_page+0x84/0xb0
      
      Hold the standard memory hotplug mutex over calls to
      arch_{add,remove}_memory().
      
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f931ab47
    • Eric Ren's avatar
      ocfs2: fix crash caused by stale lvb with fsdlm plugin · e7ee2c08
      Eric Ren authored
      The crash happens rather often when we reset some cluster nodes while
      nodes contend fiercely to do truncate and append.
      
      The crash backtrace is below:
      
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
         ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: Beginning quota recovery on device (253,18) for slot 2
         ocfs2: Finishing quota recovery on device (253,18) for slot 2
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
         ------------[ cut here ]------------
         kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
         Supported: No, Unsupported modules are loaded
         CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
         task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
         RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
         RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
         RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
         RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
         RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
         R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
         R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
         FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
         CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
         Call Trace:
           ocfs2_setattr+0x698/0xa90 [ocfs2]
           notify_change+0x1ae/0x380
           do_truncate+0x5e/0x90
           do_sys_ftruncate.constprop.11+0x108/0x160
           entry_SYSCALL_64_fastpath+0x12/0x6d
         Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
         RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
      
      It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
      not equal to the disk i_size.  We mistakenly trust the LVB because the
      underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
      DLM_SBF_VALNOTVALID properly for us.  But, why?
      
      The current code tries to downconvert lock without DLM_LKF_VALBLK flag
      to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
      if the lock resource type needs LVB.  This is not the right way for
      fsdlm.
      
      The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
      DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
      DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
      this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
      failure happens.
      
      The following diagram briefly illustrates how this crash happens:
      
      RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
      
      The 1st round:
      
                   Node1                                    Node2
      RSB1: PR
                                                        RSB1(master): NULL->EX
      ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
        ocfs2_dlm_lock(no DLM_LKF_VALBLK)
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      
      dlm_lock(no DLM_LKF_VALBLK)
        convert_lock(overwrite lkb->lkb_exflags
                     with no DLM_LKF_VALBLK)
      
      RSB1: NULL                                        RSB1: EX
                                                        reset Node2
      dlm_recover_rsbs()
        recover_lvb()
      
      /* The LVB is not trustable if the node with EX fails and
       * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
       */
      
       if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
                 return;                   * to invalid the LVB here.
                                           */
      
      The 2nd round:
      
               Node 1                                Node2
      RSB1(become master from recovery)
      
      ocfs2_setattr()
        ocfs2_inode_lock(NULL->EX)
          /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
          ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
        ocfs2_truncate_file()
            mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
      
      The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
      for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
      is uesed.
      
      Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: default avatarEric Ren <zren@suse.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7ee2c08
    • Michal Hocko's avatar
      bpf: do not use KMALLOC_SHIFT_MAX · 7984c27c
      Michal Hocko authored
      Commit 01b3f521 ("bpf: fix allocation warnings in bpf maps and
      integer overflow") has added checks for the maximum allocateable size.
      It (ab)used KMALLOC_SHIFT_MAX for that purpose.
      
      While this is not incorrect it is not very clean because we already have
      KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
      KMALLOC_MAX_SIZE instead.
      
      The original motivation for using KMALLOC_SHIFT_MAX was to work around
      an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
      but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
      will fit into MAX_ORDER".
      
      Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7984c27c
    • Michal Hocko's avatar
      mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER · bb1107f7
      Michal Hocko authored
      Andrey Konovalov has reported the following warning triggered by the
      syzkaller fuzzer.
      
        WARNING: CPU: 1 PID: 9935 at mm/page_alloc.c:3511 __alloc_pages_nodemask+0x159c/0x1e20
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 1 PID: 9935 Comm: syz-executor0 Not tainted 4.9.0-rc7+ #34
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          __alloc_pages_slowpath mm/page_alloc.c:3511
          __alloc_pages_nodemask+0x159c/0x1e20 mm/page_alloc.c:3781
          alloc_pages_current+0x1c7/0x6b0 mm/mempolicy.c:2072
          alloc_pages include/linux/gfp.h:469
          kmalloc_order+0x1f/0x70 mm/slab_common.c:1015
          kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026
          kmalloc_large include/linux/slab.h:422
          __kmalloc+0x210/0x2d0 mm/slub.c:3723
          kmalloc include/linux/slab.h:495
          ep_write_iter+0x167/0xb50 drivers/usb/gadget/legacy/inode.c:664
          new_sync_write fs/read_write.c:499
          __vfs_write+0x483/0x760 fs/read_write.c:512
          vfs_write+0x170/0x4e0 fs/read_write.c:560
          SYSC_write fs/read_write.c:607
          SyS_write+0xfb/0x230 fs/read_write.c:599
          entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      The issue is caused by a lack of size check for the request size in
      ep_write_iter which should be fixed.  It, however, points to another
      problem, that SLUB defines KMALLOC_MAX_SIZE too large because the its
      KMALLOC_SHIFT_MAX is (MAX_ORDER + PAGE_SHIFT) which means that the
      resulting page allocator request might be MAX_ORDER which is too large
      (see __alloc_pages_slowpath).
      
      The same applies to the SLOB allocator which allows even larger sizes.
      Make sure that they are capped properly and never request more than
      MAX_ORDER order.
      
      Link: http://lkml.kernel.org/r/20161220130659.16461-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb1107f7
    • Ross Zwisler's avatar
      dax: wrprotect pmd_t in dax_mapping_entry_mkclean · f729c8c9
      Ross Zwisler authored
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss in the following sequence:
      
      1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
         pmd_t dirty and writeable
      2) fsync, flushing out PMD data and cleaning the radix tree entry. We
         currently fail to mark the pmd_t as clean and write protected.
      3) more mmap writes to the PMD.  These don't cause any page faults since
         the pmd_t is dirty and writeable.  The radix tree entry remains clean.
      4) fsync, which fails to flush the dirty PMD data because the radix tree
         entry was clean.
      5) crash - dirty data that should have been fsync'd as part of 4) could
         still have been in the processor cache, and is lost.
      
      Fix this by marking the pmd_t clean and write protected in
      dax_mapping_entry_mkclean(), which is called as part of the fsync
      operation 2).  This will cause the writes in step 3) above to generate
      page faults where we'll re-dirty the PMD radix tree entry, resulting in
      flushes in the fsync that happens in step 4).
      
      Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush")
      Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f729c8c9
    • Ross Zwisler's avatar
      mm: add follow_pte_pmd() · 09796395
      Ross Zwisler authored
      Patch series "Write protect DAX PMDs in *sync path".
      
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss, as detailed in patch 2.
      
      This series is based on Dan's "libnvdimm-pending" branch, which is the
      current home for Jan's "dax: Page invalidation fixes" series.  You can
      find a working tree here:
      
        https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_clean
      
      This patch (of 2):
      
      Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
      huge page PMD leaf to be found and returned.
      
      Link: http://lkml.kernel.org/r/1482272586-21177-2-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09796395
    • Aneesh Kumar K.V's avatar
      mm/thp/pagecache/collapse: free the pte page table on collapse for thp page cache. · d670ffd8
      Aneesh Kumar K.V authored
      With THP page cache, when trying to build a huge page from regular pte
      pages, we just clear the pmd entry.  We will take another fault and at
      that point we will find the huge page in the radix tree, thereby using
      the huge page to complete the page fault
      
      The second fault path will allocate the needed pgtable_t page for archs
      like ppc64.  So no need to deposit the same in collapse path.
      Depositing them in the collapse path resulting in a pgtable_t memory
      leak also giving errors like
      
        BUG: non-zero nr_ptes on freeing mm: 3
      
      Fixes: 953c66c2 ("mm: THP page cache support for ppc64")
      Link: http://lkml.kernel.org/r/20161212163428.6780-2-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d670ffd8
    • Ross Zwisler's avatar
      dax: fix deadlock with DAX 4k holes · 965d004a
      Ross Zwisler authored
      Currently in DAX if we have three read faults on the same hole address we
      can end up with the following:
      
      Thread 0		Thread 1		Thread 2
      --------		--------		--------
      dax_iomap_fault
       grab_mapping_entry
        lock_slot
         <locks empty DAX entry>
      
        			dax_iomap_fault
      			 grab_mapping_entry
      			  get_unlocked_mapping_entry
      			   <sleeps on empty DAX entry>
      
      						dax_iomap_fault
      						 grab_mapping_entry
      						  get_unlocked_mapping_entry
      						   <sleeps on empty DAX entry>
        dax_load_hole
         find_or_create_page
         ...
          page_cache_tree_insert
           dax_wake_mapping_entry_waiter
            <wakes one sleeper>
           __radix_tree_replace
            <swaps empty DAX entry with 4k zero page>
      
      			<wakes>
      			get_page
      			lock_page
      			...
      			put_locked_mapping_entry
      			unlock_page
      			put_page
      
      						<sleeps forever on the DAX
      						 wait queue>
      
      The crux of the problem is that once we insert a 4k zero page, all
      locking from then on is done in terms of that 4k zero page and any
      additional threads sleeping on the empty DAX entry will never be woken.
      
      Fix this by waking all sleepers when we replace the DAX radix tree entry
      with a 4k zero page.  This will allow all sleeping threads to
      successfully transition from locking based on the DAX empty entry to
      locking on the 4k zero page.
      
      With the test case reported by Xiong this happens very regularly in my
      test setup, with some runs resulting in 9+ threads in this deadlocked
      state.  With this fix I've been able to run that same test dozens of
      times in a loop without issue.
      
      Fixes: ac401cc7 ("dax: New fault locking")
      Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatarXiong Zhou <xzhou@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      965d004a
    • Vlastimil Babka's avatar
      MAINTAINERS: remove duplicate bug filling description · 5771f6ea
      Vlastimil Babka authored
      I have noticed that two different descriptions for B: entries in
      MAINTAINERS were merged: commit 68656443 ("MAINTAINERS: Add bug
      tracking system location entry type") and 2de2bd95 ("MAINTAINERS:
      add "B:" for URI where to file bugs").
      
      This patch keeps the description from 2de2bd95.  There has been a
      discussion [1] about whether this more detailed description is useful
      and what it exactly implies.  I find it more useful and general, and the
      author of 68656443 agreed in the end that either is fine.
      
      [1] https://lkml.org/lkml/2016/12/8/71
      
      Link: http://lkml.kernel.org/r/20161219085158.12114-1-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5771f6ea
  2. 09 Jan, 2017 6 commits
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-for-v4.10-rc4' of git://people.freedesktop.org/~airlied/linux · bd5d7428
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "amdgpu, radeon, msm, meson, tilcdc, drm fixes.
      
        Just back online for a couple of days, gathered up the remaining fixes
        pull requests.
      
        This contains fixes for a few ARM platforms (msm, tilcdc, meson), and
        one core atomic fix. The AMD pull has some new hardware support
        (Polaris12) in it, but this is pretty limited to just hw enablement
        and shouldn't cause any problems"
      
      * tag 'drm-fixes-for-v4.10-rc4' of git://people.freedesktop.org/~airlied/linux:
        drm/amdgpu: drop verde dpm quirks
        drm/radeon: drop verde dpm quirks
        drm/radeon: update smc firmware selection for SI
        drm/amdgpu: update si kicker smc firmware
        drm/amd/powerplay: extend smu's response timeout time.
        drm/amdgpu: remove static integer for uvd pp state
        drm/amd/amdgpu: add Polaris12 PCI ID
        drm/amdgpu/powerplay: add Polaris12 support
        drm/amd/amdgpu: add Polaris12 support (v3)
        MAINTAINERS: Update mailing list for radeon and amdgpu
        drm/meson: Fix CVBS VDAC disable
        drm/meson: Fix CVBS initialization when HDMI is configured by bootloader
        drm: Clean up planes in atomic commit helper failure path
        drm: tilcdc: simplify the recovery from sync lost error on rev1
        drm/meson: Fix plane atomic check when no crtc for the plane
        drm/msm: Verify that MSM_SUBMIT_BO_FLAGS are set
        drm/msm: Put back the vaddr in submit_reloc()
        drm/msm: Ensure that the hardware write pointer is valid
      bd5d7428
    • Linus Torvalds's avatar
      Merge tag 'gpio-v4.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · 756a7334
      Linus Torvalds authored
      Pull GPIO fixes from Linus Walleij:
      
       - move freeing of GPIO hogs to after freeing the device to get rid of a
         warning state.
      
       - a small compile warning fix
      
      * tag 'gpio-v4.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
        gpio: Move freeing of GPIO hogs before numbing of the device
        gpio: mxs: remove __init annotation
      756a7334
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · c92f5bdc
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix dumping of nft_quota entries, from Pablo Neira Ayuso.
      
       2) Fix out of bounds access in nf_tables discovered by KASAN, from
          Florian Westphal.
      
       3) Fix IRQ enabling in dp83867 driver, from Grygorii Strashko.
      
       4) Fix unicast filtering in be2net driver, from Ivan Vecera.
      
       5) tg3_get_stats64() can race with driver close and ethtool
          reconfigurations, fix from Michael Chan.
      
       6) Fix error handling when pass limit is reached in bpf code gen on
          x86. From Daniel Borkmann.
      
       7) Don't clobber switch ops and use proper MDIO nested reads and writes
          in bcm_sf2 driver, from Florian Fainelli.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
        net: dsa: bcm_sf2: Utilize nested MDIO read/write
        net: dsa: bcm_sf2: Do not clobber b53_switch_ops
        net: stmmac: fix maxmtu assignment to be within valid range
        bpf: change back to orig prog on too many passes
        tg3: Fix race condition in tg3_get_stats64().
        be2net: fix unicast list filling
        be2net: fix accesses to unicast list
        netlabel: add CALIPSO to the list of built-in protocols
        vti6: fix device register to report IFLA_INFO_KIND
        net: phy: dp83867: fix irq generation
        amd-xgbe: Fix IRQ processing when running in single IRQ mode
        sh_eth: R8A7740 supports packet shecksumming
        sh_eth: fix EESIPR values for SH77{34|63}
        r8169: fix the typo in the comment
        nl80211: fix sched scan netlink socket owner destruction
        bridge: netfilter: Fix dropping packets that moving through bridge interface
        netfilter: ipt_CLUSTERIP: check duplicate config when initializing
        netfilter: nft_payload: mangle ckecksum if NFT_PAYLOAD_L4CSUM_PSEUDOHDR is set
        netfilter: nf_tables: fix oob access
        netfilter: nft_queue: use raw_smp_processor_id()
        ...
      c92f5bdc
    • David S. Miller's avatar
      Merge branch 'bcm_sf2-fixes' · 03430fa1
      David S. Miller authored
      Florian Fainelli says:
      
      ====================
      net: dsa: bcm_sf2: Couple fixes
      
      Here are a couple of fixes for bcm_sf2, please queue these up for -stable
      as well, thank you very much!
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03430fa1
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Utilize nested MDIO read/write · 2cfe8f82
      Florian Fainelli authored
      We are implementing a MDIO bus which is behind another one, so use the
      nested version of the accessors to get lockdep annotations correct.
      
      Fixes: 461cd1b0 ("net: dsa: bcm_sf2: Register our slave MDIO bus")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2cfe8f82
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Do not clobber b53_switch_ops · a4c61b92
      Florian Fainelli authored
      We make the bcm_sf2 driver override ds->ops which points to
      b53_switch_ops since b53_switch_alloc() did the assignent. This is all
      well and good until a second b53 switch comes in, and ends up using the
      bcm_sf2 operations. Make a proper local copy, substitute the ds->ops
      pointer and then override the operations.
      
      Fixes: f458995b ("net: dsa: bcm_sf2: Utilize core B53 driver when possible")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4c61b92
  3. 08 Jan, 2017 8 commits