1. 13 Dec, 2016 40 commits
    • Dan Williams's avatar
      mm: disable numa migration faults for dax vmas · c1ef8e2c
      Dan Williams authored
      Mark dax vmas as not migratable to exclude them from task_numa_work().
      This is especially relevant for device-dax which wants to ensure
      predictable access latency and not incur periodic faults.
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/147892450132.22062.16875659431109209179.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1ef8e2c
    • Heiko Carstens's avatar
      mm/pkeys: generate pkey system call code only if ARCH_HAS_PKEYS is selected · c7142aea
      Heiko Carstens authored
      Having code for the pkey_mprotect, pkey_alloc and pkey_free system calls
      makes only sense if ARCH_HAS_PKEYS is selected.  If not selected these
      system calls will always return -ENOSPC or -EINVAL.
      
      To simplify things and have less code generate the pkey system call code
      only if ARCH_HAS_PKEYS is selected.
      
      For architectures which have already wired up the system calls, but do
      not select ARCH_HAS_PKEYS this will result in less generated code and a
      different return code: the three system calls will now always return
      -ENOSYS, using the cond_syscall mechanism.
      
      For architectures which have not wired up the system calls less
      unreachable code will be generated.
      
      Link: http://lkml.kernel.org/r/20161114111251.70084-1-heiko.carstens@de.ibm.comSigned-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7142aea
    • Reza Arbab's avatar
      dt: add documentation of "hotpluggable" memory property · c3352cbb
      Reza Arbab authored
      Summarize the "hotpluggable" property of dt memory nodes.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-6-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3352cbb
    • Reza Arbab's avatar
      of/fdt: mark hotpluggable memory · 41a9ada3
      Reza Arbab authored
      When movable nodes are enabled, any node containing only hotpluggable
      memory is made movable at boot time.
      
      On x86, hotpluggable memory is discovered by parsing the ACPI SRAT,
      making corresponding calls to memblock_mark_hotplug().
      
      If we introduce a dt property to describe memory as hotpluggable,
      configs supporting early fdt may then also do this marking and use
      movable nodes.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-5-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41a9ada3
    • Reza Arbab's avatar
      mm: enable CONFIG_MOVABLE_NODE on non-x86 arches · 114cf3cc
      Reza Arbab authored
      To support movable memory nodes (CONFIG_MOVABLE_NODE), at least one of
      the following must be true:
      
      1. This config has the capability to identify movable nodes at boot.
         Right now, only x86 can do this.
      
      2. Our config supports memory hotplug, which means that a movable node
         can be created by hotplugging all of its memory into ZONE_MOVABLE.
      
      Fix the Kconfig definition of CONFIG_MOVABLE_NODE, which currently
      recognizes (1), but not (2).
      
      Link: http://lkml.kernel.org/r/1479160961-25840-4-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      114cf3cc
    • Reza Arbab's avatar
      mm: remove x86-only restriction of movable_node · 39fa104d
      Reza Arbab authored
      In commit c5320926 ("mem-hotplug: introduce movable_node boot
      option"), the memblock allocation direction is changed to bottom-up and
      then back to top-down like this:
      
      1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
      2. memblock_set_bottom_up(false), called by x86's numa_init().
      
      Even though (1) occurs in generic mm code, it is wrapped by #ifdef
      CONFIG_MOVABLE_NODE, which depends on X86_64.
      
      This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
      things will be unbalanced.  (1) will happen for them, but (2) will not.
      
      This toggle was added in the first place because x86 has a delay between
      adding memblocks and marking them as hotpluggable.  Since other arches
      do this marking either immediately or not at all, they do not require
      the bottom-up toggle.
      
      So, resolve things by moving (1) from cmdline_parse_movable_node() to
      x86's setup_arch(), immediately after the movable_node parameter has
      been parsed.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39fa104d
    • Reza Arbab's avatar
      powerpc/mm: allow memory hotplug into a memoryless node · 4a3bac4e
      Reza Arbab authored
      Patch series "enable movable nodes on non-x86 configs", v7.
      
      This patchset allows more configs to make use of movable nodes.  When
      CONFIG_MOVABLE_NODE is selected, there are two ways to introduce such
      nodes into the system:
      
      1. Discover movable nodes at boot. Currently this is only possible on
         x86, but we will enable configs supporting fdt to do the same.
      
      2. Hotplug and online all of a node's memory using online_movable. This
         is already possible on any config supporting memory hotplug, not
         just x86, but the Kconfig doesn't say so. We will fix that.
      
      We'll also remove some cruft on power which would prevent (2).
      
      This patch (of 5):
      
      Remove the check which prevents us from hotplugging into an empty node.
      
      The original commit b226e462 ("[PATCH] powerpc: don't add memory to
      empty node/zone"), states that this was intended to be a temporary measure.
      It is a workaround for an oops which no longer occurs.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-2-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a3bac4e
    • Piotr Kwapulinski's avatar
      mm/mempolicy.c: forbid static or relative flags for local NUMA mode · 8d303e44
      Piotr Kwapulinski authored
      The MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES flags are irrelevant
      when setting them for MPOL_LOCAL NUMA memory policy via set_mempolicy or
      mbind.
      
      Return the "invalid argument" from set_mempolicy and mbind whenever any
      of these flags is passed along with MPOL_LOCAL.
      
      It is consistent with MPOL_PREFERRED passed with empty nodemask.
      
      It slightly shortens the execution time in paths where these flags are
      used e.g.  when trying to rebind the NUMA nodes for changes in cgroups
      cpuset mems (mpol_rebind_preferred()) or when just printing the mempolicy
      structure (/proc/PID/numa_maps).  Isolated tests done.
      
      Link: http://lkml.kernel.org/r/20161027163037.4089-1-kwapulinski.piotr@gmail.comSigned-off-by: default avatarPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Liang Chen <liangchen.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d303e44
    • Lorenzo Stoakes's avatar
      mm: fix up get_user_pages* comments · 80a79516
      Lorenzo Stoakes authored
      In the previous round of get_user_pages* changes comments attached to
      __get_user_pages_unlocked() and get_user_pages_unlocked() were rendered
      incorrect, this patch corrects them.
      
      In addition the get_user_pages_unlocked() comment seems to have already
      been outdated as it referred to tsk, mm parameters which were removed in
      c12d2da5 ("mm/gup: Remove the macro overload API migration helpers from
      the get_user*() APIs"), this patch fixes this also.
      
      Link: http://lkml.kernel.org/r/20161025233435.5338-1-lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80a79516
    • Aneesh Kumar K.V's avatar
      mm: remove the page size change check in tlb_remove_page · 692a68c1
      Aneesh Kumar K.V authored
      Now that we check for page size change early in the loop, we can
      partially revert e9d55e15 ("mm: change the interface for
      __tlb_remove_page").
      
      This simplies the code much, by removing the need to track the last
      address with which we adjusted the range.  We also go back to the older
      way of filling the mmu_gather array, ie, we add an entry and then check
      whether the gather batch is full.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-6-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      692a68c1
    • Aneesh Kumar K.V's avatar
      mm: add tlb_remove_check_page_size_change to track page size change · 07e32661
      Aneesh Kumar K.V authored
      With commit e77b0852 ("mm/mmu_gather: track page size with mmu
      gather and force flush if page size change") we added the ability to
      force a tlb flush when the page size change in a mmu_gather loop.  We
      did that by checking for a page size change every time we added a page
      to mmu_gather for lazy flush/remove.  We can improve that by moving the
      page size change check early and not doing it every time we add a page.
      
      This also helps us to do tlb flush when invalidating a range covering
      dax mapping.  Wrt dax mapping we don't have a backing struct page and
      hence we don't call tlb_remove_page, which earlier forced the tlb flush
      on page size change.  Moving the page size change check earlier means we
      will do the same even for dax mapping.
      
      We also avoid doing this check on architecture other than powerpc.
      
      In a later patch we will remove page size check from tlb_remove_page().
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-5-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07e32661
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: add tlb_remove_hugetlb_entry for handling hugetlb pages · b528e4b6
      Aneesh Kumar K.V authored
      This add tlb_remove_hugetlb_entry similar to tlb_remove_pmd_tlb_entry.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-4-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b528e4b6
    • Aneesh Kumar K.V's avatar
      mm: update mmu_gather range correctly · b5bc66b7
      Aneesh Kumar K.V authored
      We use __tlb_adjust_range to update range convered by mmu_gather struct.
      We later use the 'start' and 'end' to do a mmu_notifier_invalidate_range
      in tlb_flush_mmu_tlbonly().  Update the 'end' correctly in
      __tlb_adjust_range so that we call mmu_notifier_invalidate_range with
      the correct range values.
      
      Wrt tlbflush, this should not have any impact, because a flush with
      correct start address will flush tlb mapping for the range.
      
      Also add comment w.r.t updating the range when we free pagetable pages.
      For now we don't support a range based page table cache flush.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-3-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5bc66b7
    • Aneesh Kumar K.V's avatar
      mm: use the correct page size when removing the page · c0f2e176
      Aneesh Kumar K.V authored
      We are removing a pmd hugepage here.  Use the correct page size.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-2-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0f2e176
    • Arnd Bergmann's avatar
      shmem: avoid maybe-uninitialized warning · 23f919d4
      Arnd Bergmann authored
      After enabling -Wmaybe-uninitialized warnings, we get a false-postive
      warning for shmem:
      
        mm/shmem.c: In function `shmem_getpage_gfp':
        include/linux/spinlock.h:332:21: error: `info' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      This can be easily avoided, since the correct 'info' pointer is known at
      the time we first enter the function, so we can simply move the
      initialization up.  Moving it before the first label avoids the warning
      and lets us remove two later initializations.
      
      Note that the function is so hard to read that it not only confuses the
      compiler, but also most readers and without this patch it could\ easily
      break if one of the 'goto's changed.
      
      Link: https://www.spinics.net/lists/kernel/msg2368133.html
      Link: http://lkml.kernel.org/r/20161024205725.786455-1-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23f919d4
    • Ming Ling's avatar
      mm, compaction: fix NR_ISOLATED_* stats for pfn based migration · 6afcf8ef
      Ming Ling authored
      Since commit bda807d4 ("mm: migrate: support non-lru movable page
      migration") isolate_migratepages_block) can isolate !PageLRU pages which
      would acct_isolated account as NR_ISOLATED_*.  Accounting these non-lru
      pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
      heuristics based on those counters such as pgdat_reclaimable_pages resp.
      too_many_isolated which would lead to unexpected stalls during the
      direct reclaim without any good reason.  Note that
      __alloc_contig_migrate_range can isolate a lot of pages at once.
      
      On mobile devices such as 512M ram android Phone, it may use a big zram
      swap.  In some cases zram(zsmalloc) uses too many non-lru but
      migratedable pages, such as:
      
            MemTotal: 468148 kB
            Normal free:5620kB
            Free swap:4736kB
            Total swap:409596kB
            ZRAM: 164616kB(zsmalloc non-lru pages)
            active_anon:60700kB
            inactive_anon:60744kB
            active_file:34420kB
            inactive_file:37532kB
      
      Fix this by only accounting lru pages to NR_ISOLATED_* in
      isolate_migratepages_block right after they were isolated and we still
      know they were on LRU.  Drop acct_isolated because it is called after
      the fact and we've lost that information.  Batching per-cpu counter
      doesn't make much improvement anyway.  Also make sure that we uncharge
      only LRU pages when putting them back on the LRU in
      putback_movable_pages resp.  when unmap_and_move migrates the page.
      
      [mhocko@suse.com: replace acct_isolated() with direct counting]
      Fixes: bda807d4 ("mm: migrate: support non-lru movable page migration")
      Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.orgSigned-off-by: default avatarMing Ling <ming.ling@spreadtrum.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6afcf8ef
    • Michal Hocko's avatar
      mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist · 6d840958
      Michal Hocko authored
      __GFP_THISNODE is documented to enforce the allocation to be satisified
      from the requested node with no fallbacks or placement policy
      enforcements.  policy_zonelist seemingly breaks this semantic if the
      current policy is MPOL_MBIND and instead of taking the node it will
      fallback to the first node in the mask if the requested one is not in
      the mask.  This is confusing to say the least because it fact we
      shouldn't ever go that path.  First tasks shouldn't be scheduled on CPUs
      with nodes outside of their mempolicy binding.  And secondly
      policy_zonelist is called only from 3 places:
      
       - huge_zonelist - never should do __GFP_THISNODE when going this path
      
       - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
      
       - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
         used
      
      So we shouldn't even need to care about this possibility and can drop
      the confusing code.  Let's keep a WARN_ON_ONCE in place to catch
      potential users and fix them up properly (aka use a different allocation
      function which ignores mempolicy).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20161013125958.32155-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d840958
    • David Rientjes's avatar
      mm, thp: avoid unlikely branches for split_huge_pmd · fd60775a
      David Rientjes authored
      While doing MADV_DONTNEED on a large area of thp memory, I noticed we
      encountered many unlikely() branches in profiles for each backing
      hugepage.  This is because zap_pmd_range() would call split_huge_pmd(),
      which rechecked the conditions that were already validated, but as part
      of an unlikely() branch.
      
      Avoid the unlikely() branch when in a context where pmd is known to be
      good for __split_huge_pmd() directly.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd60775a
    • zijun_hu's avatar
      mm/vmalloc.c: simplify /proc/vmallocinfo implementation · 3f500069
      zijun_hu authored
      Many seq_file helpers exist for simplifying implementation of virtual
      files especially, for /proc nodes.  however, the helpers for iteration
      over list_head are available but aren't adopted to implement
      /proc/vmallocinfo currently.
      
      Simplify /proc/vmallocinfo implementation by using existing seq_file
      helpers.
      
      Link: http://lkml.kernel.org/r/57FDF2E5.1000201@zoho.comSigned-off-by: default avatarzijun_hu <zijun_hu@htc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f500069
    • Minchan Kim's avatar
      mm: make unreserve highatomic functions reliable · 29fac03b
      Minchan Kim authored
      Currently, unreserve_highatomic_pageblock bails out if it found
      highatomic pageblock regardless of really moving free pages from the one
      so that it could mitigate unreserve logic's goal which saves OOM of a
      process.
      
      This patch makes unreserve functions bail out only if it moves some
      pages out of !highatomic free list to avoid such false positive.
      
      Another potential problem is that by race between page freeing and
      reserve highatomic function, pages could be in highatomic free list even
      though the pageblock is !high atomic migratetype.  In that case,
      unreserve_highatomic_pageblock can be void if count of highatomic
      reserve is less than pageblock_nr_pages.  We could solve it simply via
      draining all of reserved pages before the OOM.  It would have a
      safeguard role to exhuast reserved pages before converging to OOM.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29fac03b
    • Minchan Kim's avatar
      mm: try to exhaust highatomic reserve before the OOM · 04c8716f
      Minchan Kim authored
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      It's weird to show that zone has enough free memory above min watermark
      but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic
      pages.  As last resort, try to unreserve highatomic pages again and if
      it has moved pages to non-highatmoc free list, retry reclaim once more.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04c8716f
    • Minchan Kim's avatar
      mm: prevent double decrease of nr_reserved_highatomic · 4855e4a7
      Minchan Kim authored
      There is race between page freeing and unreserved highatomic.
      
       CPU 0				    CPU 1
      
          free_hot_cold_page
            mt = get_pfnblock_migratetype
            set_pcppage_migratetype(page, mt)
          				    unreserve_highatomic_pageblock
          				    spin_lock_irqsave(&zone->lock)
          				    move_freepages_block
          				    set_pageblock_migratetype(page)
          				    spin_unlock_irqrestore(&zone->lock)
            free_pcppages_bulk
              __free_one_page(mt) <- mt is stale
      
      By above race, a page on CPU 0 could go non-highorderatomic free list
      since the pageblock's type is changed.  By that, unreserve logic of
      highorderatomic can decrease reserved count on a same pageblock severak
      times and then it will make mismatch between nr_reserved_highatomic and
      the number of reserved pageblock.
      
      So, this patch verifies whether the pageblock is highatomic or not and
      decrease the count only if the pageblock is highatomic.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4855e4a7
    • Minchan Kim's avatar
      mm: don't steal highatomic pageblock · 88ed365e
      Minchan Kim authored
      Patch series "use up highorder free pages before OOM", v3.
      
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      During the investigation, I found some problems with highatomic so this
      patch aims to solve the problems and the final goal is to unreserve
      every highatomic free pages before the OOM kill.
      
      This patch (of 4):
      
      In page freeing path, migratetype is racy so that a highorderatomic page
      could free into non-highorderatomic free list.  If that page is
      allocated, VM can change the pageblock from higorderatomic to something.
      In that case, highatomic pageblock accounting is broken so it doesn't
      work(e.g., VM cannot reserve highorderatomic pageblocks any more
      although it doesn't reach 1% limit).
      
      So, this patch prohibits the changing from highatomic to other type.
      It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
      array so stealing will only happen due to unexpected races which is
      really rare.  Also, such prohibiting keeps highatomic pageblock more
      longer so it would be better for highorderatomic page allocation.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88ed365e
    • Andreas Platschek's avatar
      kmemleak: fix reference to Documentation · 22901c6c
      Andreas Platschek authored
      Documentation/kmemleak.txt was moved to Documentation/dev-tools/kmemleak.rst,
      this fixes the reference to the new location.
      
      Link: http://lkml.kernel.org/r/1476544946-18804-1-git-send-email-andreas.platschek@opentech.atSigned-off-by: default avatarAndreas Platschek <andreas.platschek@opentech.at>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22901c6c
    • Aneesh Kumar K.V's avatar
    • Aneesh Kumar K.V's avatar
      mm/hugetlb.c: use the right pte val for compare in hugetlb_cow · 3999f52e
      Aneesh Kumar K.V authored
      We cannot use the pte value used in set_pte_at for pte_same comparison,
      because archs like ppc64, filter/add new pte flag in set_pte_at.
      Instead fetch the pte value inside hugetlb_cow.  We are comparing pte
      value to make sure the pte didn't change since we dropped the page table
      lock.  hugetlb_cow get called with page table lock held, and we can take
      a copy of the pte value before we drop the page table lock.
      
      With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
      previous mapping (huge_pte_none entries), by forcing a cow in the fault
      path.  This avoid take an addition fault to covert a read-only mapping
      to read/write.  Here we were comparing a recently instantiated pte (via
      set_pte_at) to the pte values from linux page table.  As explained above
      on ppc64 such pte_same check returned wrong result, resulting in us
      taking an additional fault on ppc64.
      
      Fixes: 6a119eae ("powerpc/mm: Add a _PAGE_PTE bit")
      Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Scott Wood <scottwood@freescale.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3999f52e
    • Tobias Klauser's avatar
      mm/gup.c: make unnecessarily global vma_permits_fault() static · 771ab430
      Tobias Klauser authored
      Make vma_permits_fault() static as it is only used in mm/gup.c
      
      This fixes a sparse warning.
      
      Link: http://lkml.kernel.org/r/20161017122353.31598-1-tklauser@distanz.chSigned-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      771ab430
    • Shaohua Li's avatar
      mm/vmscan.c: set correct defer count for shrinker · 5f33a080
      Shaohua Li authored
      Our system uses significantly more slab memory with memcg enabled with
      the latest kernel.  With 3.10 kernel, slab uses 2G memory, while with
      4.6 kernel, 6G memory is used.  The shrinker has problem.  Let's see we
      have two memcg for one shrinker.  In do_shrink_slab:
      
      1. Check cg1.  nr_deferred = 0, assume total_scan = 700.  batch size
         is 1024, then no memory is freed.  nr_deferred = 700
      
      2. Check cg2.  nr_deferred = 700.  Assume freeable = 20, then
         total_scan = 10 or 40.  Let's assume it's 10.  No memory is freed.
         nr_deferred = 10.
      
      The deferred share of cg1 is lost in this case.  kswapd will free no
      memory even run above steps again and again.
      
      The fix makes sure one memcg's deferred share isn't lost.
      
      Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f33a080
    • Andi Kleen's avatar
      mm/mprotect.c: don't touch single threaded PTEs which are on the right node · 3e321587
      Andi Kleen authored
      We had some problems with pages getting unmapped in single threaded
      affinitized processes.  It was tracked down to NUMA scanning.
      
      In this case it doesn't make any sense to unmap pages if the process is
      single threaded and the page is already on the node the process is
      running on.
      
      Add a check for this case into the numa protection code, and skip
      unmapping if true.
      
      In theory the process could be migrated later, but we will eventually
      rescan and unmap and migrate then.
      
      In theory this could be made more fancy: remembering this state per
      process or even whole mm.  However that would need extra tracking and be
      more complicated, and the simple check seems to work fine so far.
      
      [ak@linux.intel.com: v3: Minor updates from Mel. Change code layout]
        Link: http://lkml.kernel.org/r/1476382117-5440-1-git-send-email-andi@firstfloor.org
      Link: http://lkml.kernel.org/r/1476288949-20970-1-git-send-email-andi@firstfloor.orgSigned-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e321587
    • David Rientjes's avatar
      mm, slab: maintain total slab count instead of active count · bf00bd34
      David Rientjes authored
      Rather than tracking the number of active slabs for each node, track the
      total number of slabs.  This is a minor improvement that avoids active
      slab tracking when a slab goes from free to partial or partial to free.
      
      For slab debugging, this also removes an explicit free count since it
      can easily be inferred by the difference in number of total objects and
      number of active objects.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612042020110.115755@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf00bd34
    • Greg Thelen's avatar
      mm, slab: faster active and free stats · f728b0a5
      Greg Thelen authored
      Reading /proc/slabinfo or monitoring slabtop(1) can become very
      expensive if there are many slab caches and if there are very lengthy
      per-node partial and/or free lists.
      
      Commit 07a63c41 ("mm/slab: improve performance of gathering slabinfo
      stats") addressed the per-node full lists which showed a significant
      improvement when no objects were freed.  This patch has the same
      motivation and optimizes the remainder of the usecases where there are
      very lengthy partial and free lists.
      
      This patch maintains per-node active_slabs (full and partial) and
      free_slabs rather than iterating the lists at runtime when reading
      /proc/slabinfo.
      
      When allocating 100GB of slab from a test cache where every slab page is
      on the partial list, reading /proc/slabinfo (includes all other slab
      caches on the system) takes ~247ms on average with 48 samples.
      
      As a result of this patch, the same read takes ~0.856ms on average.
      
      [rientjes@google.com: changelog]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1611081505240.13403@chino.kir.corp.google.comSigned-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f728b0a5
    • Thomas Garnier's avatar
      mm/slab_common.c: check kmem_create_cache flags are common · e70954fd
      Thomas Garnier authored
      Verify that kmem_create_cache flags are not allocator specific.  It is
      done before removing flags that are not available with the current
      configuration.
      
      The current kmem_cache_create removes incorrect flags but do not
      validate the callers are using them right.  This change will ensure that
      callers are not trying to create caches with flags that won't be used
      because allocator specific.
      
      Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.comSigned-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e70954fd
    • Arnd Bergmann's avatar
      slub: avoid false-postive warning · 84582c8a
      Arnd Bergmann authored
      The slub allocator gives us some incorrect warnings when
      CONFIG_PROFILE_ANNOTATED_BRANCHES is set, as the unlikely() macro
      prevents it from seeing that the return code matches what it was before:
      
        mm/slub.c: In function `kmem_cache_free_bulk':
        mm/slub.c:262:23: error: `df.s' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2943:3: error: `df.cnt' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2933:4470: error: `df.freelist' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2943:3: error: `df.tail' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      I have not been able to come up with a perfect way for dealing with
      this, the three options I see are:
      
       - add a bogus initialization, which would increase the runtime overhead
       - replace unlikely() with unlikely_notrace()
       - remove the unlikely() annotation completely
      
      I checked the object code for a typical x86 configuration and the last
      two cases produce the same result, so I went for the last one, which is
      the simplest.
      
      Link: http://lkml.kernel.org/r/20161024155704.3114445-1-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84582c8a
    • Vladimir Davydov's avatar
      slub: move synchronize_sched out of slab_mutex on shrink · 89e364db
      Vladimir Davydov authored
      synchronize_sched() is a heavy operation and calling it per each cache
      owned by a memory cgroup being destroyed may take quite some time.  What
      is worse, it's currently called under the slab_mutex, stalling all works
      doing cache creation/destruction.
      
      Actually, there isn't much point in calling synchronize_sched() for each
      cache - it's enough to call it just once - after setting cpu_partial for
      all caches and before shrinking them.  This way, we can also move it out
      of the slab_mutex, which we have to hold for iterating over the slab
      cache list.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
      Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.comSigned-off-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Reported-by: default avatarDoug Smythies <dsmythies@telus.net>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89e364db
    • Vladimir Davydov's avatar
      mm: memcontrol: use special workqueue for creating per-memcg caches · 13583c3d
      Vladimir Davydov authored
      Creating a lot of cgroups at the same time might stall all worker
      threads with kmem cache creation works, because kmem cache creation is
      done with the slab_mutex held.  The problem was amplified by commits
      801faf0d ("mm/slab: lockless decision to grow cache") in case of
      SLAB and 81ae6d03 ("mm/slub.c: replace kick_all_cpus_sync() with
      synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
      increased the maximal time the slab_mutex can be held.
      
      To prevent that from happening, let's use a special ordered single
      threaded workqueue for kmem cache creation.  This shouldn't introduce
      any functional changes regarding how kmem caches are created, as the
      work function holds the global slab_mutex during its whole runtime
      anyway, making it impossible to run more than one work at a time.  By
      using a single threaded workqueue, we just avoid creating a thread per
      each work.  Ordering is required to avoid a situation when a cgroup's
      work is put off indefinitely because there are other cgroups to serve,
      in other words to guarantee fairness.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
      Link: http://lkml.kernel.org/r/20161004131417.GC1862@esperanzaSigned-off-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Reported-by: default avatarDoug Smythies <dsmythies@telus.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13583c3d
    • Deepa Dinamani's avatar
      ocfs2: replace CURRENT_TIME macro · c62c38f6
      Deepa Dinamani authored
      CURRENT_TIME is not y2038 safe.
      
      Use y2038 safe ktime_get_real_seconds() here for timestamps.  struct
      heartbeat_block's hb_seq and deletetion time are already 64 bits wide
      and accommodate times beyond y2038.
      
      Also use y2038 safe ktime_get_real_ts64() for on disk inode timestamps.
      These are also wide enough to accommodate time64_t.
      
      Link: http://lkml.kernel.org/r/1475365298-29236-1-git-send-email-deepa.kernel@gmail.comSigned-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c62c38f6
    • Deepa Dinamani's avatar
      ocfs2: use time64_t to represent orphan scan times · 395627b0
      Deepa Dinamani authored
      struct timespec is not y2038 safe.  Use time64_t which is y2038 safe to
      represent orphan scan times.  time64_t is sufficient here as only the
      seconds delta times are relevant.
      
      Also use appropriate time functions that return time in time64_t format.
      Time functions now return monotonic time instead of real time as only
      delta scan times are relevant and these values are not persistent across
      reboots.
      
      The format string for the debug print is still using long as this is
      only the time elapsed since the last scan and long is sufficient to
      represent this value.
      
      Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.comSigned-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      395627b0
    • Ashish Samant's avatar
      ocfs2: fix double put of recount tree in ocfs2_lock_refcount_tree() · 4131d538
      Ashish Samant authored
      In ocfs2_lock_refcount_tree, if ocfs2_read_refcount_block() returns an
      error, we do ocfs2_refcount_tree_put twice (once in
      ocfs2_unlock_refcount_tree and once outside it), thereby reducing the
      refcount of the refcount tree twice, but we dont delete the tree in this
      case.  This will make refcnt of the tree = 0 and the
      ocfs2_refcount_tree_put will eventually call ocfs2_mark_lockres_freeing,
      setting OCFS2_LOCK_FREEING for the refcount_tree->rf_lockres.
      
      The error returned by ocfs2_read_refcount_block is propagated all the
      way back and for next iteration of write, ocfs2_lock_refcount_tree gets
      the same tree back from ocfs2_get_refcount_tree because we havent
      deleted the tree.  Now we have the same tree, but OCFS2_LOCK_FREEING is
      set for rf_lockres and eventually, when _ocfs2_lock_refcount_tree is
      called in this iteration, BUG_ON( __ocfs2_cluster_lock:1395 ERROR:
      Cluster lock called on freeing lockres T00000000000000000386019775b08d!
      flags 0x81) is triggerred.
      
      Call stack:
      
        (loop16,11155,0):ocfs2_lock_refcount_tree:482 ERROR: status = -5
        (loop16,11155,0):ocfs2_refcount_cow_hunk:3497 ERROR: status = -5
        (loop16,11155,0):ocfs2_refcount_cow:3560 ERROR: status = -5
        (loop16,11155,0):ocfs2_prepare_inode_for_refcount:2111 ERROR: status = -5
        (loop16,11155,0):ocfs2_prepare_inode_for_write:2190 ERROR: status = -5
        (loop16,11155,0):ocfs2_file_write_iter:2331 ERROR: status = -5
        (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: bug expression:
        lockres->l_flags & OCFS2_LOCK_FREEING
      
        (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: Cluster lock called on
        freeing lockres T00000000000000000386019775b08d! flags 0x81
      
        kernel BUG at fs/ocfs2/dlmglue.c:1395!
      
        invalid opcode: 0000 [#1] SMP  CPU 0
        Modules linked in: tun ocfs2 jbd2 xen_blkback xen_netback xen_gntdev .. sd_mod crc_t10dif ext3 jbd mbcache
        RIP: __ocfs2_cluster_lock+0x31c/0x740 [ocfs2]
        RSP: e02b:ffff88017c0138a0  EFLAGS: 00010086
        Process loop16 (pid: 11155, threadinfo ffff88017c010000, task ffff8801b5374300)
        Call Trace:
           ocfs2_refcount_lock+0xae/0x130 [ocfs2]
           __ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2]
           ocfs2_lock_refcount_tree+0xdd/0x320 [ocfs2]
           ocfs2_refcount_cow_hunk+0x1cb/0x440 [ocfs2]
           ocfs2_refcount_cow+0xa9/0x1d0 [ocfs2]
           ocfs2_prepare_inode_for_refcount+0x115/0x200 [ocfs2]
           ocfs2_prepare_inode_for_write+0x33b/0x470 [ocfs2]
           ocfs2_file_write_iter+0x220/0x8c0 [ocfs2]
           aio_write_iter+0x2e/0x30
      
      Fix this by avoiding the second call to ocfs2_refcount_tree_put()
      
      Link: http://lkml.kernel.org/r/1473984404-32011-1-git-send-email-ashish.samant@oracle.comSigned-off-by: default avatarAshish Samant <ashish.samant@oracle.com>
      Reviewed-by: default avatarEric Ren <zren@suse.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4131d538
    • piaojun's avatar
      ocfs2: clean up unused 'page' parameter in ocfs2_write_end_nolock() · 07f38d97
      piaojun authored
      'page' parameter in ocfs2_write_end_nolock() is never used.
      
      Link: http://lkml.kernel.org/r/582FD91A.5000902@huawei.comSigned-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07f38d97
    • piaojun's avatar
      ocfs2/dlm: clean up deadcode in dlm_master_request_handler() · 28bb5ef4
      piaojun authored
      When 'dispatch_assert' is set, 'response' must be DLM_MASTER_RESP_YES,
      and 'res' won't be null, so execution can't reach these two branch.
      
      Link: http://lkml.kernel.org/r/58174C91.3040004@huawei.comSigned-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarJoseph Qi Joseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28bb5ef4