1. 25 Jun, 2015 40 commits
    • Naoya Horiguchi's avatar
      mm: soft-offline: don't free target page in successful page migration · add05cec
      Naoya Horiguchi authored
      Stress testing showed that soft offline events for a process iterating
      "mmap-pagefault-munmap" loop can trigger
      VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
      
        Soft offlining page 0x70fe1 at 0x70100008d000
        Soft offlining page 0x705fb at 0x70300008d000
        page:ffffea0001c3f840 count:0 mapcount:0 mapping:          (null) index:0x2
        flags: 0x1fffff80800000(hwpoison)
        page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
        ------------[ cut here ]------------
        kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
        invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
        Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
        CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
        RIP: free_pcppages_bulk+0x52a/0x6f0
        Call Trace:
          drain_pages_zone+0x3d/0x50
          drain_local_pages+0x1d/0x30
          on_each_cpu_mask+0x46/0x80
          drain_all_pages+0x14b/0x1e0
          soft_offline_page+0x432/0x6e0
          SyS_madvise+0x73c/0x780
          system_call_fastpath+0x12/0x17
        Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
        RIP  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
         RSP <ffff88007a117d28>
        ---[ end trace 53926436e76d1f35 ]---
      
      When soft offline successfully migrates page, the source page is supposed
      to be freed.  But there is a race condition where a source page looks
      isolated (i.e.  the refcount is 0 and the PageHWPoison is set) but
      somewhat linked to pcplist.  Then another soft offline event calls
      drain_all_pages() and tries to free such hwpoisoned page, which is
      forbidden.
      
      This odd page state seems to happen due to the race between put_page() in
      putback_lru_page() and __pagevec_lru_add_fn().  But I don't want to play
      with tweaking drain code as done in commit 9ab3b598 "mm: hwpoison:
      drop lru_add_drain_all() in __soft_offline_page()", or to change page
      freeing code for this soft offline's purpose.
      
      Instead, let's think about the difference between hard offline and soft
      offline.  There is an interesting difference in how to isolate the in-use
      page between these, that is, hard offline marks PageHWPoison of the target
      page at first, and doesn't free it by keeping its refcount 1.  OTOH, soft
      offline tries to free the target page then marks PageHWPoison.  This
      difference might be the source of complexity and result in bugs like the
      above.  So making soft offline isolate with keeping refcount can be a
      solution for this problem.
      
      We can pass to page migration code the "reason" which shows the caller, so
      let's use this more to avoid calling putback_lru_page() when called from
      soft offline, which effectively does the isolation for soft offline.  With
      this change, target pages of soft offline never be reused without changing
      migratetype, so this patch also removes the related code.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      add05cec
    • Naoya Horiguchi's avatar
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling · ead07f6a
      Naoya Horiguchi authored
      memory_failure() can run in 2 different mode (specified by
      MF_COUNT_INCREASED) in page refcount perspective.  When
      MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
      takes a refcount of the target page.  And if cleared, memory_failure()
      takes it in it's own.
      
      In current code, however, refcounting is done differently in each caller.
      For example, madvise_hwpoison() uses get_user_pages_fast() and
      hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
      refcounting causes refcount failure especially for thp tail pages.
      Typical user visible effects are like memory leak or
      VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
      
      To fix this refcounting issue, this patch introduces get_hwpoison_page()
      to handle thp tail pages in the same manner for each caller of hwpoison
      code.
      
      memory_failure() might fail to split thp and in such case it returns
      without completing page isolation.  This is not good because PageHWPoison
      on the thp is still set and there's no easy way to unpoison such thps.  So
      this patch try to roll back any action to the thp in "non anonymous thp"
      case and "thp split failed" case, expecting an MCE(SRAR) generated by
      later access afterward will properly free such thps.
      
      [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ead07f6a
    • Naoya Horiguchi's avatar
      mm/memory-failure: split thp earlier in memory error handling · 415c64c1
      Naoya Horiguchi authored
      memory_failure() doesn't handle thp itself at this time and need to split
      it before doing isolation.  Currently thp is split in the middle of
      hwpoison_user_mappings(), but there're corner cases where memory_failure()
      wrongly tries to handle thp without splitting.
      
      1) "non anonymous" thp, which is not a normal operating mode of thp,
         but a memory error could hit a thp before anon_vma is initialized.  In
         such case, split_huge_page() fails and me_huge_page() (intended for
         hugetlb) is called for thp, which triggers BUG_ON in page_hstate().
      
      2) !PageLRU case, where hwpoison_user_mappings() returns with
         SWAP_SUCCESS and the result is the same as case 1.
      
      memory_failure() can't avoid splitting, so let's split it more earlier,
      which also reduces code which are prepared for both of normal page and
      thp.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      415c64c1
    • Zhihui Zhang's avatar
      mm: rename RECLAIM_SWAP to RECLAIM_UNMAP · 95bbc0c7
      Zhihui Zhang authored
      The name SWAP implies that we are dealing with anonymous pages only.  In
      fact, the original patch that introduced the min_unmapped_ratio logic
      was to fix an issue related to file pages.  Rename it to RECLAIM_UNMAP
      to match what does.
      
      Historically, commit a6dc60f8 ("vmscan: rename sc.may_swap to
      may_unmap") renamed .may_swap to .may_unmap, leaving RECLAIM_SWAP
      behind.  commit 2e2e4259 ("vmscan,memcg: reintroduce sc->may_swap")
      reintroduced .may_swap for memory controller.
      Signed-off-by: default avatarZhihui Zhang <zzhsuny@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95bbc0c7
    • Nishanth Aravamudan's avatar
      mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages · f012a84a
      Nishanth Aravamudan authored
      Based upon 675becce ("mm: vmscan: do not throttle based on pfmemalloc
      reserves if node has no ZONE_NORMAL") from Mel.
      
      We have a system with the following topology:
      
      # numactl -H
      available: 3 nodes (0,2-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
      23 24 25 26 27 28 29 30 31
      node 0 size: 28273 MB
      node 0 free: 27323 MB
      node 2 cpus:
      node 2 size: 16384 MB
      node 2 free: 0 MB
      node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 3 size: 30533 MB
      node 3 free: 13273 MB
      node distances:
      node   0   2   3
        0:  10  20  20
        2:  20  10  20
        3:  20  20  10
      
      Node 2 has no free memory, because:
      # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
      1
      
      This leads to the following zoneinfo:
      
      Node 2, zone      DMA
        pages free     0
              min      1840
              low      2300
              high     2760
              scanned  0
              spanned  262144
              present  262144
              managed  262144
      ...
        all_unreclaimable: 1
      
      If one then attempts to allocate some normal 16M hugepages via
      
      echo 37 > /proc/sys/vm/nr_hugepages
      
      The echo never returns and kswapd2 consumes CPU cycles.
      
      This is because throttle_direct_reclaim ends up calling
      wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
      pfmemalloc_watermark_ok() in turn checks all zones on the node if there
      are any reserves, and if so, then indicates the watermarks are ok, by
      seeing if there are sufficient free pages.
      
      675becce added a condition already for memoryless nodes.  In this case,
      though, the node has memory, it is just all consumed (and not
      reclaimable).  Effectively, though, the result is the same on this call to
      pfmemalloc_watermark_ok() and thus seems like a reasonable additional
      condition.
      
      With this change, the afore-mentioned 16M hugepage allocation attempt
      succeeds and correctly round-robins between Nodes 1 and 3.
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f012a84a
    • Anisse Astier's avatar
      mm/page_alloc.c: cleanup obsolete KM_USER* · f4d2897b
      Anisse Astier authored
      It's been five years now that KM_* kmap flags have been removed and that
      we can call clear_highpage from any context.  So we remove prep_zero_pages
      accordingly.
      Signed-off-by: default avatarAnisse Astier <anisse@astier.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4d2897b
    • Kirill A. Shutemov's avatar
      mm: avoid tail page refcounting on non-THP compound pages · c761471b
      Kirill A. Shutemov authored
      Reintroduce 8d63d99a ("mm: avoid tail page refcounting on non-THP
      compound pages") after removing bogus VM_BUG_ON_PAGE() in
      put_unrefcounted_compound_page().
      
      THP uses tail page refcounting to be able to split huge pages at any time.
       Tail page refcounting is not needed for other users of compound pages and
      it's harmful because of overhead.
      
      We try to exclude non-THP pages from tail page refcounting using
      __compound_tail_refcounted() check.  It excludes most common non-THP
      compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
      users -- drivers.
      
      And it's not only about overhead.
      
      Drivers might want to use compound pages to get refcounting semantics
      suitable for mapping high-order pages to userspace.  But tail page
      refcounting breaks it.
      
      Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
      them.  It means GUP pins would affect page_mapcount() for tail pages.
      It's not a problem for THP, because it never maps tail pages.  But unlike
      THP, drivers map parts of compound pages with PTEs and it makes
      page_mapcount() be called for tail pages.
      
      In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
      such pages.  But, I'm not aware about anything which can lead to crash or
      other serious misbehaviour.
      
      Since currently all THP pages are anonymous and all drivers pages are not,
      we can fix the __compound_tail_refcounted() check by requiring PageAnon()
      to enable tail page refcounting.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c761471b
    • Kirill A. Shutemov's avatar
      mm: drop bogus VM_BUG_ON_PAGE assert in put_page() codepath · 73933b33
      Kirill A. Shutemov authored
      My commit 8d63d99a ("mm: avoid tail page refcounting on non-THP
      compound pages") which was merged during 4.1 merge window caused
      regression:
      
        page:ffffea0010a15040 count:0 mapcount:1 mapping:          (null) index:0x0
        flags: 0x8000000000008014(referenced|dirty|tail)
        page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) != 0)
        ------------[ cut here ]------------
        kernel BUG at mm/swap.c:134!
      
      The problem can be reproduced by playing *two* audio files at the same
      time and then stopping one of players.  I used two mplayers to trigger
      this.
      
      The VM_BUG_ON_PAGE() which triggers the bug is bogus:
      
      Sound subsystem uses compound pages for its buffers, but unlike most
      __GFP_COMP sound maps compound pages to userspace with PTEs.
      
      In our case with two players map the buffer twice and therefore elevates
      page_mapcount() on tail pages by two.  When one of players exits it
      unmaps the VMA and drops page_mapcount() to one and try to release
      reference on the page with put_page().
      
      My commit changes which path it takes under put_compound_page().  It hits
      put_unrefcounted_compound_page() where VM_BUG_ON_PAGE() is.  It sees
      page_mapcount() == 1.  The function wrongly assumes that subpages of
      compound page cannot be be mapped by itself with PTEs..
      
      The solution is simply drop the VM_BUG_ON_PAGE().
      
      Note: there's no need to move the check under put_page_testzero().
      Allocator will check the mapcount by itself before putting on free list.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73933b33
    • Rasmus Villemoes's avatar
      mm: only define hashdist variable when needed · a9919c79
      Rasmus Villemoes authored
      For !CONFIG_NUMA, hashdist will always be 0, since it's setter is
      otherwise compiled out.  So we can save 4 bytes of data and some .text
      (although mostly in __init functions) by only defining it for
      CONFIG_NUMA.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9919c79
    • Zhang Zhen's avatar
      mm/hugetlb: reduce arch dependent code about hugetlb_prefault_arch_hook · a67a31fa
      Zhang Zhen authored
      Currently we have many duplicates in definitions of
      hugetlb_prefault_arch_hook.  In all architectures this function is empty.
      Signed-off-by: default avatarZhang Zhen <zhenzhang.zhang@huawei.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a67a31fa
    • Laurent Dufour's avatar
      powerpc/mm: tracking vDSO remap · 83d3f0e9
      Laurent Dufour authored
      Some processes (CRIU) are moving the vDSO area using the mremap system
      call.  As a consequence the kernel reference to the vDSO base address is
      no more valid and the signal return frame built once the vDSO has been
      moved is not pointing to the new sigreturn address.
      
      This patch handles vDSO remapping and unmapping.
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83d3f0e9
    • Laurent Dufour's avatar
      mm: new arch_remap() hook · 4abad2ca
      Laurent Dufour authored
      Some architectures would like to be triggered when a memory area is moved
      through the mremap system call.
      
      This patch introduces a new arch_remap() mm hook which is placed in the
      path of mremap, and is called before the old area is unmapped (and the
      arch_unmap() hook is called).
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4abad2ca
    • Laurent Dufour's avatar
      mm: new mm hook framework · 2ae416b1
      Laurent Dufour authored
      CRIU is recreating the process memory layout by remapping the checkpointee
      memory area on top of the current process (criu).  This includes remapping
      the vDSO to the place it has at checkpoint time.
      
      However some architectures like powerpc are keeping a reference to the
      vDSO base address to build the signal return stack frame by calling the
      vDSO sigreturn service.  So once the vDSO has been moved, this reference
      is no more valid and the signal frame built later are not usable.
      
      This patch serie is introducing a new mm hook framework, and a new
      arch_remap hook which is called when mremap is done and the mm lock still
      hold.  The next patch is adding the vDSO remap and unmap tracking to the
      powerpc architecture.
      
      This patch (of 3):
      
      This patch introduces a new set of header file to manage mm hooks:
      - per architecture empty header file (arch/x/include/asm/mm-arch-hooks.h)
      - a generic header (include/linux/mm-arch-hooks.h)
      
      The architecture which need to overwrite a hook as to redefine it in its
      header file, while architecture which doesn't need have nothing to do.
      
      The default hooks are defined in the generic header and are used in the
      case the architecture is not defining it.
      
      In a next step, mm hooks defined in include/asm-generic/mm_hooks.h should
      be moved here.
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ae416b1
    • Zhang Zhen's avatar
      mm/hugetlb: reduce arch dependent code about huge_pmd_unshare · e81f2d22
      Zhang Zhen authored
      Currently we have many duplicates in definitions of huge_pmd_unshare.  In
      all architectures this function just returns 0 when
      CONFIG_ARCH_WANT_HUGE_PMD_SHARE is N.
      
      This patch puts the default implementation in mm/hugetlb.c and lets these
      architectures use the common code.
      Signed-off-by: default avatarZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Yang <James.Yang@freescale.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e81f2d22
    • Kirill A. Shutemov's avatar
      mm: fix mprotect() behaviour on VM_LOCKED VMAs · 36f88188
      Kirill A. Shutemov authored
      On mlock(2) we trigger COW on private writable VMA to avoid faults in
      future.
      
      mm/gup.c:
       840 long populate_vma_page_range(struct vm_area_struct *vma,
       841                 unsigned long start, unsigned long end, int *nonblocking)
       842 {
       ...
       855          * We want to touch writable mappings with a write fault in order
       856          * to break COW, except for shared mappings because these don't COW
       857          * and we would not want to dirty them for nothing.
       858          */
       859         if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
       860                 gup_flags |= FOLL_WRITE;
      
      But we miss this case when we make VM_LOCKED VMA writeable via
      mprotect(2). The test case:
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/resource.h>
      	#include <sys/stat.h>
      	#include <sys/time.h>
      	#include <sys/types.h>
      
      	#define PAGE_SIZE 4096
      
      	int main(int argc, char **argv)
      	{
      		struct rusage usage;
      		long before;
      		char *p;
      		int fd;
      
      		/* Create a file and populate first page of page cache */
      		fd = open("/tmp", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR);
      		write(fd, "1", 1);
      
      		/* Create a *read-only* *private* mapping of the file */
      		p = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
      
      		/*
      		 * Since the mapping is read-only, mlock() will populate the mapping
      		 * with PTEs pointing to page cache without triggering COW.
      		 */
      		mlock(p, PAGE_SIZE);
      
      		/*
      		 * Mapping became read-write, but it's still populated with PTEs
      		 * pointing to page cache.
      		 */
      		mprotect(p, PAGE_SIZE, PROT_READ | PROT_WRITE);
      
      		getrusage(RUSAGE_SELF, &usage);
      		before = usage.ru_minflt;
      
      		/* Trigger COW: fault in mlock()ed VMA. */
      		*p = 1;
      
      		getrusage(RUSAGE_SELF, &usage);
      		printf("faults: %ld\n", usage.ru_minflt - before);
      
      		return 0;
      	}
      
      	$ ./test
      	faults: 1
      
      Let's fix it by triggering populating of VMA in mprotect_fixup() on this
      condition. We don't care about population error as we don't in other
      similar cases i.e. mremap.
      
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36f88188
    • Jiri Kosina's avatar
      thp: cleanup how khugepaged enters freezer · cd092411
      Jiri Kosina authored
      khugepaged_do_scan() checks in every iteration whether freezing(current)
      is true, and in such case breaks out of the loop, which causes
      try_to_freeze() to be called immediately afterwards in
      khugepaged_wait_work().
      
      If nothing else, this causes unnecessary freezing(current) test, and also
      makes the way khugepaged enters freezer a bit less obvious than necessary.
      
      Let's just try to freeze directly, instead of splitting it into two
      (directly adjacent) phases.
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd092411
    • Andi Kleen's avatar
      mm, hwpoison: remove obsolete "Notebook" todo list · ebb09738
      Andi Kleen authored
      All the items mentioned here have been either addressed, or were not
      really needed.  So just remove the comment.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebb09738
    • Andi Kleen's avatar
      mm, hwpoison: add comment describing when to add new cases · e0de78df
      Andi Kleen authored
      Here's another comment fix for hwpoison.
      
      It describes the "guiding principle" on when to add new
      memory error recovery code.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0de78df
    • Rasmus Villemoes's avatar
      linux/slab.h: fix three off-by-one typos in comment · 1ed58b60
      Rasmus Villemoes authored
      The first is a keyboard-off-by-one, the other two the ordinary mathy kind.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ed58b60
    • Daniel Sanders's avatar
      slab: correct size_index table before replacing the bootstrap kmem_cache_node · 34cc6990
      Daniel Sanders authored
      This patch moves the initialization of the size_index table slightly
      earlier so that the first few kmem_cache_node's can be safely allocated
      when KMALLOC_MIN_SIZE is large.
      
      There are currently two ways to generate indices into kmalloc_caches (via
      kmalloc_index() and via the size_index table in slab_common.c) and on some
      arches (possibly only MIPS) they potentially disagree with each other
      until create_kmalloc_caches() has been called.  It seems that the
      intention is that the size_index table is a fast equivalent to
      kmalloc_index() and that create_kmalloc_caches() patches the table to
      return the correct value for the cases where kmalloc_index()'s
      if-statements apply.
      
      The failing sequence was:
      * kmalloc_caches contains NULL elements
      * kmem_cache_init initialises the element that 'struct
        kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
        56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
      * init_list is called which calls kmalloc_node to allocate a 'struct
        kmem_cache_node'.
      * kmalloc_slab selects the kmem_caches element using
        size_index[size_index_elem(size)]. For MIPS, size is 56, and the
        expression returns 6.
      * This element of kmalloc_caches is NULL and allocation fails.
      * If it had not already failed, it would have called
        create_kmalloc_caches() at this point which would have changed
        size_index[size_index_elem(size)] to 7.
      
      I don't believe the bug to be LLVM specific but GCC doesn't normally
      encounter the problem.  I haven't been able to identify exactly what GCC
      is doing better (probably inlining) but it seems that GCC is managing to
      optimize to the point that it eliminates the problematic allocations.
      This theory is supported by the fact that GCC can be made to fail in the
      same way by changing inline, __inline, __inline__, and __always_inline in
      include/linux/compiler-gcc.h such that they don't actually inline things.
      Signed-off-by: default avatarDaniel Sanders <daniel.sanders@imgtec.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34cc6990
    • Gavin Guo's avatar
      mm/slab_common: support the slub_debug boot option on specific object size · 4066c33d
      Gavin Guo authored
      The slub_debug=PU,kmalloc-xx cannot work because in the
      create_kmalloc_caches() the s->name is created after the
      create_kmalloc_cache() is called.  The name is NULL in the
      create_kmalloc_cache() so the kmem_cache_flags() would not set the
      slub_debug flags to the s->flags.  The fix here set up a kmalloc_names
      string array for the initialization purpose and delete the dynamic name
      creation of kmalloc_caches.
      
      [akpm@linux-foundation.org: s/kmalloc_names/kmalloc_info/, tweak comment text]
      Signed-off-by: default avatarGavin Guo <gavin.guo@canonical.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4066c33d
    • Akinobu Mita's avatar
      xtensa: use for_each_sg() · 3693a84d
      Akinobu Mita authored
      This replaces the plain loop over the sglist array with for_each_sg()
      macro which consists of sg_next() function calls.  Since xtensa doesn't
      select ARCH_HAS_SG_CHAIN, it is not necessary to use for_each_sg() in
      order to loop over each sg element.  But this can help find problems
      with drivers that do not properly initialize their sg tables when
      CONFIG_DEBUG_SG is enabled.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3693a84d
    • Chris Metcalf's avatar
      procfs: treat parked tasks as sleeping for task state · f51c0eae
      Chris Metcalf authored
      Allowing watchdog threads to be parked means that we now have the
      opportunity of actually seeing persistent parked threads in the output
      of /proc/<pid>/stat and /proc/<pid>/status.  The existing code reported
      such threads as "Running", which is kind-of true if you think of the
      case where we park them as part of taking cpus offline.  But if we allow
      parking them indefinitely, "Running" is pretty misleading, so we report
      them as "Sleeping" instead.
      
      We could simply report them with a new string, "Parked", but it feels
      like it's a bit risky for userspace to see unexpected new values; the
      output is already documented in Documentation/filesystems/proc.txt, and
      it seems like a mistake to change that lightly.
      
      The scheduler does report parked tasks with a "P" in debugging output
      from sched_show_task() or dump_cpu_task(), but that's a different API.
      Similarly, the trace_ctxwake_* routines report a "P" for parked tasks,
      but again, different API.
      
      This change seemed slightly cleaner than updating the task_state_array
      to have additional rows.  TASK_DEAD should be subsumed by the exit_state
      bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
      reasonably be reported as "Running" (as it is now).  Only TASK_PARKED
      shows up with unreasonable output here.
      Signed-off-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f51c0eae
    • Chris Metcalf's avatar
      watchdog: add watchdog_cpumask sysctl to assist nohz · fe4ba3c3
      Chris Metcalf authored
      Change the default behavior of watchdog so it only runs on the
      housekeeping cores when nohz_full is enabled at build and boot time.
      Allow modifying the set of cores the watchdog is currently running on
      with a new kernel.watchdog_cpumask sysctl.
      
      In the current system, the watchdog subsystem runs a periodic timer that
      schedules the watchdog kthread to run.  However, nohz_full cores are
      designed to allow userspace application code running on those cores to
      have 100% access to the CPU.  So the watchdog system prevents the
      nohz_full application code from being able to run the way it wants to,
      thus the motivation to suppress the watchdog on nohz_full cores, which
      this patchset provides by default.
      
      However, if we disable the watchdog globally, then the housekeeping
      cores can't benefit from the watchdog functionality.  So we allow
      disabling it only on some cores.  See Documentation/lockup-watchdogs.txt
      for more information.
      
      [jhubbard@nvidia.com: fix a watchdog crash in some configurations]
      Signed-off-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe4ba3c3
    • Chris Metcalf's avatar
      smpboot: allow excluding cpus from the smpboot threads · b5242e98
      Chris Metcalf authored
      This patch series allows the watchdog to run by default only on the
      housekeeping cores when nohz_full is in effect; this seems to be a good
      compromise short of turning it off completely (since the nohz_full cores
      can't tolerate a watchdog).
      
      To provide customizability, we add /proc/sys/kernel/watchdog_cpumask so
      that the set of cores running the watchdog can be tuned to different
      values after bootup.
      
      To implement this customizability, we add a new
      smpboot_update_cpumask_percpu_thread() API to the smpboot_thread
      subsystem that lets us park or unpark "unwanted" threads.
      
      And now that threads can be parked for long periods of time, we tweak the
      /proc/<pid>/stat and /proc/<pid>/status code so parked threads aren't
      reported as running, which is otherwise confusing.
      
      This patch (of 3):
      
      This change allows some cores to be excluded from running the
      smp_hotplug_thread tasks.  The following commit to update
      kernel/watchdog.c to use this functionality is the motivating example, and
      more information on the motivation is provided there.
      
      A new smp_hotplug_thread field is introduced, "cpumask", which is cpumask
      field managed by the smpboot subsystem that indicates whether or not the
      given smp_hotplug_thread should run on that core; the cpumask is checked
      when deciding whether to unpark the thread.
      
      To limit the cpumask to less than cpu_possible, you must call
      smpboot_update_cpumask_percpu_thread() after registering.
      Signed-off-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5242e98
    • Akinobu Mita's avatar
      sparc: use for_each_sg() · 8c07a308
      Akinobu Mita authored
      This replaces the plain loop over the sglist array with for_each_sg()
      macro which consists of sg_next() function calls.  Since sparc does select
      ARCH_HAS_SG_CHAIN, it is necessary to use for_each_sg() in order to loop
      over each sg element.  This also help find problems with drivers that do
      not properly initialize their sg tables when CONFIG_DEBUG_SG is enabled.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c07a308
    • Akinobu Mita's avatar
      parisc: use for_each_sg() · 210bff6d
      Akinobu Mita authored
      This replaces the plain loop over the sglist array with for_each_sg()
      macro which consists of sg_next() function calls.  Since parisc doesn't
      select ARCH_HAS_SG_CHAIN, it is not necessary to use for_each_sg() in
      order to loop over each sg element.  But this can help find problems with
      drivers that do not properly initialize their sg tables when
      CONFIG_DEBUG_SG is enabled.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      210bff6d
    • Joseph Qi's avatar
      ocfs2: mark local functions as static · b519ea6d
      Joseph Qi authored
      Some functions are only used locally, so mark them as static.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b519ea6d
    • Fabian Frederick's avatar
      ocfs2: use swap() in ocfs2_double_lock() · ab1ba021
      Fabian Frederick authored
      Use kernel.h macro definition.
      
      Thanks to Julia Lawall for Coccinelle scripting support.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab1ba021
    • Fabian Frederick's avatar
      ocfs2: use swap() in swap_refcount_rec() · a612543f
      Fabian Frederick authored
      Use kernel.h macro definition.
      
      Thanks to Julia Lawall for Coccinelle scripting support.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a612543f
    • Fabian Frederick's avatar
      ocfs2: use swap() in dx_leaf_sort_swap() · 2a28f98c
      Fabian Frederick authored
      Use kernel.h macro definition.
      
      Thanks to Julia Lawall for Coccinelle scripting support.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a28f98c
    • Joseph Qi's avatar
      ocfs2: fix wrong check in ocfs2_direct_IO_get_blocks · ae1f0814
      Joseph Qi authored
      contig_blocks gotten from ocfs2_extent_map_get_blocks cannot be compared
      with clusters_to_alloc. So convert it to clusters first.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarWeiwei Wang <wangww631@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae1f0814
    • Xue jiufei's avatar
      ocfs2: fix NULL pointer dereference in function ocfs2_abort_trigger() · 74e364ad
      Xue jiufei authored
      ocfs2_abort_trigger() use bh->b_assoc_map to get sb.  But there's no
      function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
      dereference while calling this function.  We can get sb from
      bh->b_bdev->bd_super instead of b_assoc_map.
      
      [akpm@linux-foundation.org: update comment, per Joseph]
      Signed-off-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74e364ad
    • alex chen's avatar
      ocfs2: o2net: should remove debugfs in o2net_init() out branch · fce56d84
      alex chen authored
      Signed-off-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fce56d84
    • WeiWei Wang's avatar
      ocfs2: remove OCFS2_IOCB_SEM lock type in direct io · fa5a0eb3
      WeiWei Wang authored
      In ocfs2 direct read/write, OCFS2_IOCB_SEM lock type is used to protect
      inode->i_alloc_sem rw semaphore lock in the earlier kernel version.
      However, in the latest kernel, inode->i_alloc_sem rw semaphore lock is not
      used at all, so OCFS2_IOCB_SEM lock type needs to be removed.
      Signed-off-by: default avatarWeiwei Wang <wangww631@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa5a0eb3
    • Joseph Qi's avatar
      ocfs2: do not BUG if jbd2_journal_dirty_metadata fails · e272e7f0
      Joseph Qi authored
      jbd2_journal_dirty_metadata may fail.  Currently it cannot take care of
      non zero return value and just BUG in ocfs2_journal_dirty.  This patch is
      aborting the handle and journal instead of BUG.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: joyce.xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e272e7f0
    • Xue jiufei's avatar
      ocfs2: remove BUG_ON(!empty_extent) in __ocfs2_rotate_tree_left() · 099768b0
      Xue jiufei authored
      ocfs2_rotate_tree_left() calls __ocfs2_rotate_tree_left() for left
      rotation while non-rightmost path containing an empty extent in the leaf
      block.  __ocfs2_rotate_tree_left() returns -EAGAIN if right subtree having an
      empty extent and pass the empty_extent_path to caller.  The caller
      ocfs2_rotate_tree_left() will restart rotation from the returned path.
      
      It will trigger the BUG_ON(!ocfs2_is_empty_extent) when the et on disk
      is as follows:
      
      eb0 is the leaf block of path(say path_a) passed to
      ocfs2_rotate_tree_left, which has an empty rec[0].
      
      eb1 is the leaf block of path(say path_b) that just right to path_a, which
      has no empty record.
      
      eb2 is the leaf block of path(say path_c) that just right to path_b, which
      has an empty rec[0].  And path_c is also the rightmost path.
      
      Now we want to remove the empty rec[0] in eb0:
      
      ocfs2_rotate_tree_left:
        -> call __ocfs2_rotate_tree_left with path_a as its input *path*
          -> call ocfs2_rotate_subtree_left with path_a as its input
             *left_path* and path_b as its input *right_path*. it will move
             rec[0] in eb1 to eb0, and rec[0] in eb0 is not empty now.
          -> continue to call ocfs2_rotate_subtree_left with path_b as its
             input *left_path* and path_c as its input *right_path*, and
             return -EAGAIN because eb2 has an empty rec[0]
        -> call __ocfs2_rotate_tree_left with path_c as it input, rotate all
           records in eb2 to left and return 0.
        -> call __ocfs2_rotate_tree_left with path_a as its input, and triggers
           the BUG_ON(!ocfs2_is_empty_extent) as the rec[0] in eb0 is not empty.
      
      So the BUG_ON() should be removed and return 0 if rec[0] is no longer an
      empty extent.
      Signed-off-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      099768b0
    • Xue jiufei's avatar
      ocfs2: return error when ocfs2_figure_merge_contig_type() fails · 9f99ad08
      Xue jiufei authored
      ocfs2_figure_merge_contig_type() still returns CONTIG_NONE when some error
      occurs which will cause an unpredictable error.  So return a proper errno
      when ocfs2_figure_merge_contig_type() fails.
      Signed-off-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f99ad08
    • Joseph Qi's avatar
      ocfs2/dlm: cleanup unused function __dlm_wait_on_lockres_flags_set · 345dc681
      Joseph Qi authored
      __dlm_wait_on_lockres_flags_set() is declared but not implemented and
      used.  So remove it.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      345dc681
    • Daeseok Youn's avatar
      ocfs2: use retval instead of status for checking error · 2e173152
      Daeseok Youn authored
      The use of 'status' in __ocfs2_add_entry() can return wrong value.
      
      Some functions' return value in __ocfs2_add_entry(), i.e
      ocfs2_journal_access_di() is saved to 'status'.  But 'status' is not
      used in 'bail' label for returning result of __ocfs2_add_entry().
      
      So use retval instead of status.
      Signed-off-by: default avatarDaeseok Youn <daeseok.youn@gmail.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e173152