1. 10 Oct, 2014 40 commits
    • Laura Abbott's avatar
      lib/genalloc.c: add power aligned algorithm · 505e3be6
      Laura Abbott authored
      One of the more common algorithms used for allocation is to align the
      start address of the allocation to the order of size requested.  Add this
      as an algorithm option for genalloc.
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarOlof Johansson <olof@lixom.net>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Riley <davidriley@chromium.org>
      Cc: Ritesh Harjain <ritesh.harjani@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thierry Reding <thierry.reding@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505e3be6
    • Mel Gorman's avatar
      mm: remove misleading ARCH_USES_NUMA_PROT_NONE · 6a33979d
      Mel Gorman authored
      ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
      _PAGE_NUMA using _PROT_NONE.  This saved using an additional PTE bit and
      relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
      fault scanner.  This was found to be conceptually confusing with a lot of
      implicit assumptions and it was asked that an alternative be found.
      
      Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
      PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE
      bits and shrunk the maximum possible swap size but it did not go far
      enough.  There are no architectures that reuse _PROT_NONE as _PROT_NUMA
      but the relics still exist.
      
      This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
      duplication in powerpc vs the generic implementation by defining the types
      the core NUMA helpers expected to exist from x86 with their ppc64
      equivalent.  This necessitated that a PTE bit mask be created that
      identified the bits that distinguish present from NUMA pte entries but it
      is expected this will only differ between arches based on _PAGE_PROTNONE.
      The naming for the generic helpers was taken from x86 originally but ppc64
      has types that are equivalent for the purposes of the helper so they are
      mapped instead of duplicating code.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a33979d
    • Zhang Zhen's avatar
      memory-hotplug: add sysfs valid_zones attribute · ed2f2400
      Zhang Zhen authored
      Currently memory-hotplug has two limits:
      
      1. If the memory block is in ZONE_NORMAL, you can change it to
         ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.
      
      2. If the memory block is in ZONE_MOVABLE, you can change it to
         ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.
      
      With this patch, we can easy to know a memory block can be onlined to
      which zone, and don't need to know the above two limits.
      
      Updated the related Documentation.
      
      [akpm@linux-foundation.org: use conventional comment layout]
      [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
      [akpm@linux-foundation.org: remove unused local zone_prev]
      Signed-off-by: default avatarZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed2f2400
    • vishnu.ps's avatar
      cc71aba3
    • Joonsoo Kim's avatar
      mm/slab: use percpu allocator for cpu cache · bf0dea23
      Joonsoo Kim authored
      Because of chicken and egg problem, initialization of SLAB is really
      complicated.  We need to allocate cpu cache through SLAB to make the
      kmem_cache work, but before initialization of kmem_cache, allocation
      through SLAB is impossible.
      
      On the other hand, SLUB does initialization in a more simple way.  It uses
      percpu allocator to allocate cpu cache so there is no chicken and egg
      problem.
      
      So, this patch try to use percpu allocator in SLAB.  This simplifies the
      initialization step in SLAB so that we could maintain SLAB code more
      easily.
      
      In my testing there is no performance difference.
      
      This implementation relies on percpu allocator.  Because percpu allocator
      uses vmalloc address space, vmalloc address space could be exhausted by
      this change on many cpu system with *32 bit* kernel.  This implementation
      can cover 1024 cpus in worst case by following calculation.
      
      Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
      	120 objects per cpu_cache = 140 MB
      Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
      	80 objects per cpu_cache = 46 MB
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jeremiah Mahler <jmmahler@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf0dea23
    • Joonsoo Kim's avatar
      mm/slab: support slab merge · 12220dea
      Joonsoo Kim authored
      Slab merge is good feature to reduce fragmentation.  If new creating slab
      have similar size and property with exsitent slab, this feature reuse it
      rather than creating new one.  As a result, objects are packed into fewer
      slabs so that fragmentation is reduced.
      
      Below is result of my testing.
      
      * After boot, sleep 20; cat /proc/meminfo | grep Slab
      
      <Before>
      Slab: 25136 kB
      
      <After>
      Slab: 24364 kB
      
      We can save 3% memory used by slab.
      
      For supporting this feature in SLAB, we need to implement SLAB specific
      kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
      SLUB specific processing related to debug flag and object size change on
      these functions.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12220dea
    • Joonsoo Kim's avatar
      mm/slab_common: commonize slab merge logic · 423c929c
      Joonsoo Kim authored
      Slab merge is good feature to reduce fragmentation.  Now, it is only
      applied to SLUB, but, it would be good to apply it to SLAB.  This patch is
      preparation step to apply slab merge to SLAB by commonizing slab merge
      logic.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      423c929c
    • Mikulas Patocka's avatar
      slab: fix for_each_kmem_cache_node() · 9163582c
      Mikulas Patocka authored
      Fix a bug (discovered with kmemcheck) in for_each_kmem_cache_node().  The
      for loop reads the array "node" before verifying that the index is within
      the range.  This results in kmemcheck warning.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9163582c
    • Nishanth Aravamudan's avatar
      kernel/kthread.c: partial revert of 81c98869 ("kthread: ensure locality of... · 10922838
      Nishanth Aravamudan authored
      kernel/kthread.c: partial revert of 81c98869 ("kthread: ensure locality of task_struct allocations")
      
      After discussions with Tejun, we don't want to spread the use of
      cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
      callers, but would rather ensure the callees correctly handle memoryless
      nodes.  With the previous patches ("topology: add support for
      node_to_mem_node() to determine the fallback node" and "slub: fallback to
      node_to_mem_node() node if allocating on memoryless node") adding and
      using node_to_mem_node(), we can safely undo part of the change to the
      kthread logic from 81c98869.
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10922838
    • Joonsoo Kim's avatar
      slub: fall back to node_to_mem_node() node if allocating on memoryless node · a561ce00
      Joonsoo Kim authored
      Update the SLUB code to search for partial slabs on the nearest node with
      memory in the presence of memoryless nodes.  Additionally, do not consider
      it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a
      memoryless-node specified allocation goes off-node.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a561ce00
    • Joonsoo Kim's avatar
      topology: add support for node_to_mem_node() to determine the fallback node · ad2c8144
      Joonsoo Kim authored
      Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
      on ppc LPARs with memoryless nodes, a large amount of memory was consumed
      by slabs and was marked unreclaimable.  He tracked it down to slab
      deactivations in the SLUB core when we allocate remotely, leading to poor
      efficiency always when memoryless nodes are present.
      
      After much discussion, Joonsoo provided a few patches that help
      significantly.  They don't resolve the problem altogether:
      
       - memory hotplug still needs testing, that is when a memoryless node
         becomes memory-ful, we want to dtrt
       - there are other reasons for going off-node than memoryless nodes,
         e.g., fully exhausted local nodes
      
      Neither case is resolved with this series, but I don't think that should
      block their acceptance, as they can be explored/resolved with follow-on
      patches.
      
      The series consists of:
      
      [1/3] topology: add support for node_to_mem_node() to determine the
            fallback node
      
      [2/3] slub: fallback to node_to_mem_node() node if allocating on
            memoryless node
      
            - Joonsoo's patches to cache the nearest node with memory for each
              NUMA node
      
      [3/3] Partial revert of 81c98869 (""kthread: ensure locality of
            task_struct allocations")
      
       - At Tejun's request, keep the knowledge of memoryless node fallback
         to the allocator core.
      
      This patch (of 3):
      
      We need to determine the fallback node in slub allocator if the allocation
      target node is memoryless node.  Without it, the SLUB wrongly select the
      node which has no memory and can't use a partial slab, because of node
      mismatch.  Introduced function, node_to_mem_node(X), will return a node Y
      with memory that has the nearest distance.  If X is memoryless node, it
      will return nearest distance node, but, if X is normal node, it will
      return itself.
      
      We will use this function in following patch to determine the fallback
      node.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad2c8144
    • Christoph Lameter's avatar
      slub: disable tracing and failslab for merged slabs · c9e16131
      Christoph Lameter authored
      Tracing of mergeable slabs as well as uses of failslab are confusing since
      the objects of multiple slab caches will be affected.  Moreover this
      creates a situation where a mergeable slab will become unmergeable.
      
      If tracing or failslab testing is desired then it may be best to switch
      merging off for starters.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarWANG Chao <chaowang@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9e16131
    • Joonsoo Kim's avatar
      mm/slab: factor out unlikely part of cache_free_alien() · 25c4f304
      Joonsoo Kim authored
      cache_free_alien() is rarely used function when node mismatch.  But, it is
      defined with inline attribute so it is inlined to __cache_free() which is
      core free function of slab allocator.  It uselessly makes
      kmem_cache_free()/kfree() functions large.  What we really need to inline
      is just checking node match so this patch factor out other parts of
      cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().
      
      <Before>
      nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
      00000000000011e0 0000000000000228 T kfree
      0000000000000670 0000000000000216 T kmem_cache_free
      
      <After>
      nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
      0000000000001110 00000000000001b5 T kfree
      0000000000000750 0000000000000181 T kmem_cache_free
      
      You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25c4f304
    • Joonsoo Kim's avatar
      mm/slab: noinline __ac_put_obj() · d3aec344
      Joonsoo Kim authored
      Our intention of __ac_put_obj() is that it doesn't affect anything if
      sk_memalloc_socks() is disabled.  But, because __ac_put_obj() is too
      small, compiler inline it to ac_put_obj() and affect code size of free
      path.  This patch add noinline keyword for __ac_put_obj() not to distrupt
      normal free path at all.
      
      <Before>
      nm -S slab-orig.o |
      	grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"
      
      0000000000001e80 00000000000002f5 t cache_alloc_refill
      0000000000001230 0000000000000258 T kfree
      0000000000000690 000000000000024c T kmem_cache_free
      
      <After>
      nm -S slab-patched.o |
      	grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"
      
      0000000000001e00 00000000000002e5 t cache_alloc_refill
      00000000000011e0 0000000000000228 T kfree
      0000000000000670 0000000000000216 T kmem_cache_free
      
      cache_alloc_refill: 0x2f5->0x2e5
      kfree: 0x256->0x228
      kmem_cache_free: 0x24c->0x216
      
      code size of each function is reduced slightly.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3aec344
    • Joonsoo Kim's avatar
      mm/slab: move cache_flusharray() out of unlikely.text section · 3d880194
      Joonsoo Kim authored
      Now, due to likely keyword, compiled code of cache_flusharray() is on
      unlikely.text section.  Although it is uncommon case compared to free to
      cpu cache case, it is common case than free_block().  But, free_block() is
      on normal text section.  This patch fix this odd situation to remove
      likely keyword.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d880194
    • Joonsoo Kim's avatar
      mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller() · 61f47105
      Joonsoo Kim authored
      Now, we track caller if tracing or slab debugging is enabled.  If they are
      disabled, we could save one argument passing overhead by calling
      __kmalloc(_node)().  But, I think that it would be marginal.  Furthermore,
      default slab allocator, SLUB, doesn't use this technique so I think that
      it's okay to change this situation.
      
      After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
      kernel build and remove some complicated '#if' defintion.  It looks more
      benefitial to me.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61f47105
    • Joonsoo Kim's avatar
      mm/slab_common: move kmem_cache definition to internal header · 07f361b2
      Joonsoo Kim authored
      We don't need to keep kmem_cache definition in include/linux/slab.h if we
      don't need to inline kmem_cache_size().  According to my code inspection,
      this function is only called at lc_create() in lib/lru_cache.c which may
      be called at initialization phase of something, so we don't need to inline
      it.  Therfore, move it to slab_common.c and move kmem_cache definition to
      internal header.
      
      After this change, we can change kmem_cache definition easily without full
      kernel build.  For instance, we can turn on/off CONFIG_SLUB_STATS without
      full kernel build.
      
      [akpm@linux-foundation.org: export kmem_cache_size() to modules]
      [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07f361b2
    • Andrew Morton's avatar
      mm/slab_common.c: suppress warning · 3aa24f51
      Andrew Morton authored
      False positive:
      
      mm/slab_common.c: In function 'kmem_cache_create':
      mm/slab_common.c:204: warning: 's' may be used uninitialized in this function
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3aa24f51
    • Baoquan He's avatar
      fs/proc/kcore.c: don't add modules range to kcore if it's equal to vmcore range · bf3e2692
      Baoquan He authored
      On some ARCHs modules range is eauql to vmalloc range. E.g on i686
      
      	"#define MODULES_VADDR   VMALLOC_START"
      	"#define MODULES_END     VMALLOC_END"
      
      This will cause 2 duplicate program segments in /proc/kcore, and no flag
      to indicate they are different.  This is confusing.  And usually people
      who need check the elf header or read the content of kcore will check
      memory ranges.  Two program segments which are the same are unnecessary.
      
      So check if the modules range is equal to vmalloc range.  If so, just skip
      adding the modules range.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf3e2692
    • Oleg Nesterov's avatar
      proc/maps: make vm_is_stack() logic namespace-friendly · 58cb6548
      Oleg Nesterov authored
      - Rename vm_is_stack() to task_of_stack() and change it to return
        "struct task_struct *" rather than the global (and thus wrong in
        general) pid_t.
      
      - Add the new pid_of_stack() helper which calls task_of_stack() and
        uses the right namespace to report the correct pid_t.
      
        Unfortunately we need to define this helper twice, in task_mmu.c
        and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
        and move at least pid_of_stack/task_of_stack there to avoid the
        code duplication.
      
      - Change show_map_vma() and show_numa_map() to use the new helper.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58cb6548
    • Oleg Nesterov's avatar
      proc/maps: replace proc_maps_private->pid with "struct inode *inode" · 2c03376d
      Oleg Nesterov authored
      m_start() can use get_proc_task() instead, and "struct inode *"
      provides more potentially useful info, see the next changes.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c03376d
    • Oleg Nesterov's avatar
      fs/proc/task_nommu.c: don't use priv->task->mm · 47fecca1
      Oleg Nesterov authored
      I do not know if CONFIG_PREEMPT/SMP is possible without CONFIG_MMU
      but the usage of task->mm in m_stop(). The task can exit/exec before
      we take mmap_sem, in this case m_stop() can hit NULL or unlock the
      wrong rw_semaphore.
      
      Also, this code uses priv->task != NULL to decide whether we need
      up_read/mmput. This is correct, but we will probably kill priv->task.
      Change m_start/m_stop to rely on IS_ERR_OR_NULL() like task_mmu.c does.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47fecca1
    • Oleg Nesterov's avatar
      fs/proc/task_nommu.c: shift mm_access() from m_start() to proc_maps_open() · 27692cd5
      Oleg Nesterov authored
      Copy-and-paste the changes from "fs/proc/task_mmu.c: shift mm_access()
      from m_start() to proc_maps_open()" into task_nommu.c.
      
      Change maps_open() to initialize priv->mm using proc_mem_open(), m_start()
      can rely on atomic_inc_not_zero(mm_users) like task_mmu.c does.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27692cd5
    • Oleg Nesterov's avatar
      fs/proc/task_nommu.c: change maps_open() to use __seq_open_private() · ce34fddb
      Oleg Nesterov authored
      Cleanup and preparation. maps_open() can use __seq_open_private()
      like proc_maps_open() does.
      
      [akpm@linux-foundation.org: deuglify]
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce34fddb
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: update m->version in the main loop in m_start() · 557c2d8a
      Oleg Nesterov authored
      Change the main loop in m_start() to update m->version. Mostly for
      consistency, but this can help to avoid the same loop if the very
      1st ->show() fails due to seq_overflow().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      557c2d8a
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: reintroduce m->version logic · b8c20a9b
      Oleg Nesterov authored
      Add the "last_addr" optimization back. Like before, every ->show()
      method checks !seq_overflow() and sets m->version = vma->vm_start.
      
      However, it also checks that m_next_vma(vma) != NULL, otherwise it
      sets m->version = -1 for the lockless "EOF" fast-path in m_start().
      
      m_start() can simply do find_vma() + m_next_vma() if last_addr is
      not zero, the code looks clear and simple and this case is clearly
      separated from "scan vmas" path.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8c20a9b
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: introduce m_next_vma() helper · ad2a00e4
      Oleg Nesterov authored
      Extract the tail_vma/vm_next calculation from m_next() into the new
      trivial helper, m_next_vma().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad2a00e4
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: simplify m_start() to make it readable · 0c255321
      Oleg Nesterov authored
      Now that m->version is gone we can cleanup m_start(). In particular,
      
        - Remove the "unsigned long" typecast, m->index can't be negative
          or exceed ->map_count. But lets use "unsigned int pos" to make
          it clear that "pos < map_count" is safe.
      
        - Remove the unnecessary "vma != NULL" check in the main loop. It
          can't be NULL unless we have a vm bug.
      
        - This also means that "pos < map_count" case can simply return the
          valid vma and avoid "goto" and subsequent checks.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c255321
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic · ebb6cdde
      Oleg Nesterov authored
      m_start() carefully documents, checks, and sets "m->version = -1" if
      we are going to return NULL. The only problem is that we will be never
      called again if m_start() returns NULL, so this is simply pointless
      and misleading.
      
      Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is
      just wrong, we want -1 in this case. And in fact we also want -1 if
      ->vm_next == NULL and ->tail_vma == NULL.
      
      And it is not used consistently, the "scan vmas" loop in m_start()
      should update last_addr too.
      
      Finally, imo the whole "last_addr" logic in m_start() looks horrible.
      find_vma(last_addr) is called unconditionally even if we are not going
      to use the result. But the main problem is that this code participates
      in tail_vma-or-NULL mess, and this looks simply unfixable.
      
      Remove this optimization. We will add it back after some cleanups.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebb6cdde
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop() · 0d5f5f45
      Oleg Nesterov authored
      1. There is no reason to reset ->tail_vma in m_start(), if we return
         IS_ERR_OR_NULL() it won't be used.
      
      2. m_start() also clears priv->task to ensure that m_stop() won't use
         the stale pointer if we fail before get_task_struct(). But this is
         ugly and confusing, move this initialization in m_stop().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d5f5f45
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next() · 23d54837
      Oleg Nesterov authored
      1. Kill the first "vma != NULL" check. Firstly this is not possible,
         m_next() won't be called if ->start() or the previous ->next()
         returns NULL.
      
         And if it was possible the 2nd "vma != tail_vma" check is buggy,
         we should not wrongly return ->tail_vma.
      
      2. Make this function readable. The logic is very simple, we should
         return check "vma != tail" once and return "vm_next || tail_vma".
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23d54837
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: simplify the vma_stop() logic · 59b4bf12
      Oleg Nesterov authored
      m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall
      vma. This is because in this case m_stop()->vma_stop() obviously
      can't use gate_vma->vm_mm.
      
      Now that we have proc_maps_private->mm we can simplify this logic:
      
        - Change m_start() to return with ->mmap_sem held unless it returns
          IS_ERR_OR_NULL().
      
        - Change vma_stop() to use priv->mm and avoid the ugly vma checks,
          this makes "vm_area_struct *vma" unnecessary.
      
        - This also allows m_start() to use vm_stop().
      
        - Cleanup m_next() to follow the new locking rule.
      
          Note: m_stop() looks very ugly, and this temporary uglifies it
          even more. Fixed by the next change.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59b4bf12
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open() · 29a40ace
      Oleg Nesterov authored
      A simple test-case from Kirill Shutemov
      
      	cat /proc/self/maps >/dev/null
      	chmod +x /proc/self/net/packet
      	exec /proc/self/net/packet
      
      makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
      the opposite order.
      
      It's a false positive and probably we should not allow "chmod +x" on proc
      files. Still I think that we should avoid mm_access() and cred_guard_mutex
      in sys_read() paths, security checking should happen at open time. Besides,
      this doesn't even look right if the task changes its ->mm between m_stop()
      and m_start().
      
      Add the new "mm_struct *mm" member into struct proc_maps_private and change
      proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
      use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
      otherwise.
      
      The only complication is that proc_maps_open() users should additionally do
      mmdrop() in fop->release(), add the new proc_map_release() helper for that.
      
      Note: this is the user-visible change, if the task execs after open("maps")
      the new ->mm won't be visible via this file. I hope this is fine, and this
      matches /proc/pid/mem bahaviour.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29a40ace
    • Oleg Nesterov's avatar
      proc: introduce proc_mem_open() · 5381e169
      Oleg Nesterov authored
      Extract the mm_access() code from __mem_open() into the new helper,
      proc_mem_open(), the next patch will add another caller.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5381e169
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open() · 4db7d0ee
      Oleg Nesterov authored
      do_maps_open() and numa_maps_open() are overcomplicated, they could use
      __seq_open_private().  Plus they do the same, just sizeof(*priv)
      
      Change them to use a new simple helper, proc_maps_open(ops, psize).  This
      simplifies the code and allows us to do the next changes.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db7d0ee
    • Oleg Nesterov's avatar
      fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map() · 46c298cf
      Oleg Nesterov authored
      get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be NULL or
      it can changed by exec right after mm_access().
      
      And in theory this race is not harmless, the task can exec and then later
      exit and free the new mm_struct.  In this case get_task_mm(oldmm) can't
      help, get_gate_vma(task->mm) can read the freed/unmapped memory.
      
      I think that priv->task should simply die and hold_task_mempolicy() logic
      can be simplified.  tail_vma logic asks for cleanups too.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46c298cf
    • chai wen's avatar
      softlockup: make detector be aware of task switch of processes hogging cpu · b1a8de1f
      chai wen authored
      For now, soft lockup detector warns once for each case of process
      softlockup.  But the thread 'watchdog/n' may not always get the cpu at the
      time slot between the task switch of two processes hogging that cpu to
      reset soft_watchdog_warn.
      
      An example would be two processes hogging the cpu.  Process A causes the
      softlockup warning and is killed manually by a user.  Process B
      immediately becomes the new process hogging the cpu preventing the
      softlockup code from resetting the soft_watchdog_warn variable.
      
      This case is a false negative of "warn only once for a process", as there
      may be a different process that is going to hog the cpu.  Resolve this by
      saving/checking the task pointer of the hogging process and use that to
      reset soft_watchdog_warn too.
      
      [dzickus@redhat.com: update comment]
      Signed-off-by: default avatarchai wen <chaiw.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1a8de1f
    • Junxiao Bi's avatar
      ocfs2: fix deadlock due to wrong locking order · f775da2f
      Junxiao Bi authored
      For commit ocfs2 journal, ocfs2 journal thread will acquire the mutex
      osb->journal->j_trans_barrier and wake up jbd2 commit thread, then it
      will wait until jbd2 commit thread done. In order journal mode, jbd2
      needs flushing dirty data pages first, and this needs get page lock.
      So osb->journal->j_trans_barrier should be got before page lock.
      
      But ocfs2_write_zero_page() and ocfs2_write_begin_inline() obey this
      locking order, and this will cause deadlock and hung the whole cluster.
      
      One deadlock catched is the following:
      
      PID: 13449  TASK: ffff8802e2f08180  CPU: 31  COMMAND: "oracle"
       #0 [ffff8802ee3f79b0] __schedule at ffffffff8150a524
       #1 [ffff8802ee3f7a58] schedule at ffffffff8150acbf
       #2 [ffff8802ee3f7a68] rwsem_down_failed_common at ffffffff8150cb85
       #3 [ffff8802ee3f7ad8] rwsem_down_read_failed at ffffffff8150cc55
       #4 [ffff8802ee3f7ae8] call_rwsem_down_read_failed at ffffffff812617a4
       #5 [ffff8802ee3f7b50] ocfs2_start_trans at ffffffffa0498919 [ocfs2]
       #6 [ffff8802ee3f7ba0] ocfs2_zero_start_ordered_transaction at ffffffffa048b2b8 [ocfs2]
       #7 [ffff8802ee3f7bf0] ocfs2_write_zero_page at ffffffffa048e9bd [ocfs2]
       #8 [ffff8802ee3f7c80] ocfs2_zero_extend_range at ffffffffa048ec83 [ocfs2]
       #9 [ffff8802ee3f7ce0] ocfs2_zero_extend at ffffffffa048edfd [ocfs2]
       #10 [ffff8802ee3f7d50] ocfs2_extend_file at ffffffffa049079e [ocfs2]
       #11 [ffff8802ee3f7da0] ocfs2_setattr at ffffffffa04910ed [ocfs2]
       #12 [ffff8802ee3f7e70] notify_change at ffffffff81187d29
       #13 [ffff8802ee3f7ee0] do_truncate at ffffffff8116bbc1
       #14 [ffff8802ee3f7f50] sys_ftruncate at ffffffff8116bcbd
       #15 [ffff8802ee3f7f80] system_call_fastpath at ffffffff81515142
          RIP: 00007f8de750c6f7  RSP: 00007fffe786e478  RFLAGS: 00000206
          RAX: 000000000000004d  RBX: ffffffff81515142  RCX: 0000000000000000
          RDX: 0000000000000200  RSI: 0000000000028400  RDI: 000000000000000d
          RBP: 00007fffe786e040   R8: 0000000000000000   R9: 000000000000000d
          R10: 0000000000000000  R11: 0000000000000206  R12: 000000000000000d
          R13: 00007fffe786e710  R14: 00007f8de70f8340  R15: 0000000000028400
          ORIG_RAX: 000000000000004d  CS: 0033  SS: 002b
      
      crash64> bt
      PID: 7610   TASK: ffff88100fd56140  CPU: 1   COMMAND: "ocfs2cmt"
       #0 [ffff88100f4d1c50] __schedule at ffffffff8150a524
       #1 [ffff88100f4d1cf8] schedule at ffffffff8150acbf
       #2 [ffff88100f4d1d08] jbd2_log_wait_commit at ffffffffa01274fd [jbd2]
       #3 [ffff88100f4d1d98] jbd2_journal_flush at ffffffffa01280b4 [jbd2]
       #4 [ffff88100f4d1dd8] ocfs2_commit_cache at ffffffffa0499b14 [ocfs2]
       #5 [ffff88100f4d1e38] ocfs2_commit_thread at ffffffffa0499d38 [ocfs2]
       #6 [ffff88100f4d1ee8] kthread at ffffffff81090db6
       #7 [ffff88100f4d1f48] kernel_thread_helper at ffffffff81516284
      
      crash64> bt
      PID: 7609   TASK: ffff88100f2d4480  CPU: 0   COMMAND: "jbd2/dm-20-86"
       #0 [ffff88100def3920] __schedule at ffffffff8150a524
       #1 [ffff88100def39c8] schedule at ffffffff8150acbf
       #2 [ffff88100def39d8] io_schedule at ffffffff8150ad6c
       #3 [ffff88100def39f8] sleep_on_page at ffffffff8111069e
       #4 [ffff88100def3a08] __wait_on_bit_lock at ffffffff8150b30a
       #5 [ffff88100def3a58] __lock_page at ffffffff81110687
       #6 [ffff88100def3ab8] write_cache_pages at ffffffff8111b752
       #7 [ffff88100def3be8] generic_writepages at ffffffff8111b901
       #8 [ffff88100def3c48] journal_submit_data_buffers at ffffffffa0120f67 [jbd2]
       #9 [ffff88100def3cf8] jbd2_journal_commit_transaction at ffffffffa0121372[jbd2]
       #10 [ffff88100def3e68] kjournald2 at ffffffffa0127a86 [jbd2]
       #11 [ffff88100def3ee8] kthread at ffffffff81090db6
       #12 [ffff88100def3f48] kernel_thread_helper at ffffffff81516284
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Alex Chen <alex.chen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f775da2f
    • Joseph Qi's avatar
      ocfs2: fix deadlock between o2hb thread and o2net_wq · 70e82a12
      Joseph Qi authored
      The following case may lead to o2net_wq and o2hb thread deadlock on
      o2hb_callback_sem.
      Currently there are 2 nodes say N1, N2 in the cluster. And N2 down, at
      the same time, N3 tries to join the cluster. So N1 will handle node
      down (N2) and join (N3) simultaneously.
          o2hb                               o2net_wq
          ->o2hb_do_disk_heartbeat
          ->o2hb_check_slot
          ->o2hb_run_event_list
          ->o2hb_fire_callbacks
          ->down_write(&o2hb_callback_sem)
          ->o2net_hb_node_down_cb
          ->flush_workqueue(o2net_wq)
                                             ->o2net_process_message
                                             ->dlm_query_join_handler
                                             ->o2hb_check_node_heartbeating
                                             ->o2hb_fill_node_map
                                             ->down_read(&o2hb_callback_sem)
      
      No need to take o2hb_callback_sem in dlm_query_join_handler,
      o2hb_live_lock is enough to protect live node map.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: xMark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: jiangyiwen <jiangyiwen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70e82a12
    • Junxiao Bi's avatar
      ocfs2: don't fire quorum before connection established · 5046f18d
      Junxiao Bi authored
      Firing quorum before connection established can cause unexpected node to
      reboot.
      
      Assume there are 3 nodes in the cluster, Node 1, 2, 3.  Node 2 and 3 have
      wrong ip address of Node 1 in cluster.conf and global heartbeat is enabled
      in the cluster.  After the heatbeats are started on these three nodes,
      Node 1 will reboot due to quorum fencing.  It is similar case if Node 1's
      networking is not ready when starting the global heartbeat.
      
      The reboot is not friendly as customer is not fully ready for ocfs2 to
      work.  Fix it by not allowing firing quorum before the connection is
      established.  In this case, ocfs2 will wait until the wrong configuration
      is fixed or networking is up to continue.  Also update the log to guide
      the user where to check when connection is not built for a long time.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5046f18d