1. 16 Nov, 2017 40 commits
    • Jan Kara's avatar
      f2fs: use pagevec_lookup_range_tag() · 69c4f35d
      Jan Kara authored
      We want only pages from given range in f2fs_write_cache_pages().  Use
      pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
      unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-6-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69c4f35d
    • Jan Kara's avatar
      ext4: use pagevec_lookup_range_tag() · dc7f3e86
      Jan Kara authored
      We want only pages from given range in ext4_writepages().  Use
      pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
      unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-5-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc7f3e86
    • Jan Kara's avatar
      ceph: use pagevec_lookup_range_tag() · 0ed75fc8
      Jan Kara authored
      We want only pages from given range in ceph_writepages_start().  Use
      pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
      unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-4-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed75fc8
    • Jan Kara's avatar
      btrfs: use pagevec_lookup_range_tag() · 4006f437
      Jan Kara authored
      We want only pages from given range in btree_write_cache_pages() and
      extent_write_cache_pages().  Use pagevec_lookup_range_tag() instead of
      pagevec_lookup_tag() and remove unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: David Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4006f437
    • Jan Kara's avatar
      mm: implement find_get_pages_range_tag() · 72b045ae
      Jan Kara authored
      Patch series "Ranged pagevec tagged lookup", v3.
      
      In this series I provide a ranged variant of pagevec_lookup_tag() and
      use it in places where it makes sense.  This series removes some common
      code and it also has a potential for speeding up some operations
      similarly as for pagevec_lookup_range() (but for now I can think of only
      artificial cases where this happens).
      
      This patch (of 16):
      
      Implement a variant of find_get_pages_tag() that stops iterating at
      given index.  Lots of users of this function (through pagevec_lookup())
      actually want a range lookup and all of them are currently open-coding
      this.
      
      Also create corresponding pagevec_lookup_range_tag() function.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Steve French <sfrench@samba.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b045ae
    • Ayush Mittal's avatar
      mm/page_owner.c: reduce page_owner structure size · 6b4c54e3
      Ayush Mittal authored
      Maximum page order can be at max 10 which can be accomodated in short
      data type(2 bytes).  last_migrate_reason is defined as enum type whose
      values can be accomodated in short data type (2 bytes).
      
      Total structure size is currently 16 bytes but after changing structure
      size it goes to 12 bytes.
      
      Vlastimil said:
       "Looks like it works, so why not.
        Before:
        [    0.001000] allocated 50331648 bytes of page_ext
        After:
        [    0.001000] allocated 41943040 bytes of page_ext"
      
      Link: http://lkml.kernel.org/r/1507623917-37991-1-git-send-email-ayush.m@samsung.comSigned-off-by: default avatarAyush Mittal <ayush.m@samsung.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Amit Sahrawat <a.sahrawat@samsung.com>
      Cc: Vaneet Narang <v.narang@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b4c54e3
    • Pintu Agarwal's avatar
      mm/cma.c: change pr_info to pr_err for cma_alloc fail log · 5984af10
      Pintu Agarwal authored
      It was observed that under cma_alloc fail log, pr_info was used instead
      of pr_err.  This will lead to problems if printk debug level is set to
      below 7.  In this case the cma_alloc failure log will not be captured in
      the log and it will be difficult to debug.
      
      Simply replace the pr_info with pr_err to capture failure log.
      
      Link: http://lkml.kernel.org/r/1507650633-4430-1-git-send-email-pintu.ping@gmail.comSigned-off-by: default avatarPintu Agarwal <pintu.ping@gmail.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jaewon Kim <jaewon31.kim@samsung.com>
      Cc: Doug Berger <opendmb@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5984af10
    • Michal Hocko's avatar
      mm, arch: remove empty_bad_page* · 8745808f
      Michal Hocko authored
      empty_bad_page() and empty_bad_pte_table() seem to be relics from old
      days which is not used by any code for a long time.  I have tried to
      find when exactly but this is not really all that straightforward due to
      many code movements - traces disappear around 2.4 times.
      
      Anyway no code really references neither empty_bad_page nor
      empty_bad_pte_table.  We only allocate the storage which is not used by
      anybody so remove them.
      
      Link: http://lkml.kernel.org/r/20171004150045.30755-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRalf Baechle <ralf@linus-mips.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8745808f
    • Tim Chen's avatar
      mm/swap_slots.c: fix race conditions in swap_slots cache init · a2e16731
      Tim Chen authored
      Memory allocations can happen before the swap_slots cache initialization
      is completed during cpu bring up.  If we are low on memory, we could
      call get_swap_page() and access swap_slots_cache before it is fully
      initialized.
      
      Add a check in get_swap_page() for initialized swap_slots_cache to
      prevent this condition.  Similar check already exists in free_swap_slot.
      Also annotate the checks to indicate the likely condition.
      
      We also added a memory barrier to make sure that the locks
      initialization are done before the assignment of cache->slots and
      cache->slots_ret pointers.  This ensures the assumption that it is safe
      to acquire the slots cache locks and use the slots cache when the
      corresponding cache->slots or cache->slots_ret pointers are non null.
      
      [akpm@linux-foundation.org: tidy up comment]
      [akpm@linux-foundation.org: fix spello in comment]
      Link: http://lkml.kernel.org/r/65a9d0f133f63e66bba37b53b2fd0464b7cae771.1500677066.git.tim.c.chen@linux.intel.comSigned-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Reported-by: default avatarWenwei Tao <wenwei.tww@alibaba-inc.com>
      Acked-by: default avatarYing Huang <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2e16731
    • Andrey Ryabinin's avatar
      mm: remove unused pgdat->inactive_ratio · 3a50d14d
      Andrey Ryabinin authored
      Since commit 59dc76b0 ("mm: vmscan: reduce size of inactive file
      list") 'pgdat->inactive_ratio' is not used, except for printing
      "node_inactive_ratio: 0" in /proc/zoneinfo output.
      
      Remove it.
      
      Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.comSigned-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a50d14d
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid call to invalidate_range() in range_end() · 4645b9fe
      Jérôme Glisse authored
      This is an optimization patch that only affect mmu_notifier users which
      rely on the invalidate_range() callback.  This patch avoids calling that
      callback twice in a row from inside __mmu_notifier_invalidate_range_end
      
      Existing pattern (before this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_end()
              mmu_notifier_invalidate_range()
      
      New pattern (after this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_only_end()
      
      We call the invalidate_range callback after clearing the page table
      under the page table lock and we skip the call to invalidate_range
      inside the __mmu_notifier_invalidate_range_end() function.
      
      Idea from Andrea Arcangeli
      
      Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4645b9fe
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse authored
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e
    • Sergey Senozhatsky's avatar
      zsmalloc: calling zs_map_object() from irq is a bug · 1aedcafb
      Sergey Senozhatsky authored
      Use BUG_ON(in_interrupt()) in zs_map_object().  This is not a new
      BUG_ON(), it's always been there, but was recently changed to
      VM_BUG_ON().  There are several problems there.  First, we use use
      per-CPU mappings both in zsmalloc and in zram, and interrupt may easily
      corrupt those buffers.  Second, and more importantly, we believe it's
      possible to start leaking sensitive information.  Consider the following
      case:
      
      -> process P
      	swap out
      	 zram
      	  per-cpu mapping CPU1
      	   compress page A
      -> IRQ
      
      	swap out
      	 zram
      	  per-cpu mapping CPU1
      	   compress page B
      	    write page from per-cpu mapping CPU1 to zsmalloc pool
      	iret
      
      -> process P
      	    write page from per-cpu mapping CPU1 to zsmalloc pool  [*]
      	return
      
      * so we store overwritten data that actually belongs to another
        page (task) and potentially contains sensitive data. And when
        process P will page fault it's going to read (swap in) that
        other task's data.
      
      Link: http://lkml.kernel.org/r/20170929045140.4055-1-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aedcafb
    • Ralph Campbell's avatar
      mm/hmm: constify hmm_devmem_page_get_drvdata() parameter · 0bea803e
      Ralph Campbell authored
      Constify pointer parameter to avoid issue when use from code that only
      has const struct page pointer to use in the first place.
      
      Link: http://lkml.kernel.org/r/1506972774-10191-1-git-send-email-jglisse@redhat.comSigned-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0bea803e
    • Anshuman Khandual's avatar
      fs/hugetlbfs/inode.c: remove redundant -ENIVAL return from hugetlbfs_setattr() · 007ab7b4
      Anshuman Khandual authored
      There is no need to have a local return code set with -EINVAL when both
      the conditions following it return error codes appropriately.  Just
      remove the redundant one.
      
      Link: http://lkml.kernel.org/r/20170929145444.17611-1-khandual@linux.vnet.ibm.comSigned-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      007ab7b4
    • Sergey Senozhatsky's avatar
      zram: remove zlib from the list of recommended algorithms · 0b07ff39
      Sergey Senozhatsky authored
      ZSTD tends to outperform deflate/inflate, thus we remove zlib from the
      list of recommended algorithms and recommend zstd instead.
      
      Link: http://lkml.kernel.org/r/20170912050005.3247-2-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b07ff39
    • Sergey Senozhatsky's avatar
      zram: add zstd to the supported algorithms list · 5ef3a8b1
      Sergey Senozhatsky authored
      Add ZSTD to the list of supported compression algorithms.
      
      ZRAM fio perf test:
      
                            LZO         DEFLATE         ZSTD
      
      #jobs1
      WRITE:              (2180MB/s)   (77.2MB/s)      (1429MB/s)
      WRITE:              (1617MB/s)   (77.7MB/s)      (1202MB/s)
      READ:                (426MB/s)   (595MB/s)       (1181MB/s)
      READ:                (422MB/s)   (572MB/s)       (1020MB/s)
      READ:                (318MB/s)   (67.8MB/s)      (563MB/s)
      WRITE:               (318MB/s)   (67.9MB/s)      (564MB/s)
      READ:                (336MB/s)   (68.3MB/s)      (583MB/s)
      WRITE:               (335MB/s)   (68.2MB/s)      (582MB/s)
      #jobs2
      WRITE:              (3441MB/s)   (152MB/s)       (2141MB/s)
      WRITE:              (2507MB/s)   (147MB/s)       (1888MB/s)
      READ:                (801MB/s)   (1146MB/s)      (1890MB/s)
      READ:                (767MB/s)   (1096MB/s)      (2073MB/s)
      READ:                (621MB/s)   (126MB/s)       (1009MB/s)
      WRITE:               (621MB/s)   (126MB/s)       (1009MB/s)
      READ:                (656MB/s)   (125MB/s)       (1075MB/s)
      WRITE:               (657MB/s)   (126MB/s)       (1077MB/s)
      #jobs3
      WRITE:              (4772MB/s)   (225MB/s)       (3394MB/s)
      WRITE:              (3905MB/s)   (211MB/s)       (2939MB/s)
      READ:               (1216MB/s)   (1608MB/s)      (3218MB/s)
      READ:               (1159MB/s)   (1431MB/s)      (2981MB/s)
      READ:                (906MB/s)   (156MB/s)       (1457MB/s)
      WRITE:               (907MB/s)   (156MB/s)       (1458MB/s)
      READ:                (953MB/s)   (158MB/s)       (1595MB/s)
      WRITE:               (952MB/s)   (157MB/s)       (1593MB/s)
      #jobs4
      WRITE:              (6036MB/s)   (265MB/s)       (4469MB/s)
      WRITE:              (5059MB/s)   (263MB/s)       (3951MB/s)
      READ:               (1618MB/s)   (2066MB/s)      (4276MB/s)
      READ:               (1573MB/s)   (1942MB/s)      (3830MB/s)
      READ:               (1202MB/s)   (227MB/s)       (1971MB/s)
      WRITE:              (1200MB/s)   (227MB/s)       (1968MB/s)
      READ:               (1265MB/s)   (226MB/s)       (2116MB/s)
      WRITE:              (1264MB/s)   (226MB/s)       (2114MB/s)
      #jobs5
      WRITE:              (5339MB/s)   (233MB/s)       (3781MB/s)
      WRITE:              (4298MB/s)   (234MB/s)       (3276MB/s)
      READ:               (1626MB/s)   (2048MB/s)      (4081MB/s)
      READ:               (1567MB/s)   (1929MB/s)      (3758MB/s)
      READ:               (1174MB/s)   (205MB/s)       (1747MB/s)
      WRITE:              (1173MB/s)   (204MB/s)       (1746MB/s)
      READ:               (1214MB/s)   (208MB/s)       (1890MB/s)
      WRITE:              (1215MB/s)   (208MB/s)       (1892MB/s)
      #jobs6
      WRITE:              (5666MB/s)   (270MB/s)       (4338MB/s)
      WRITE:              (4828MB/s)   (267MB/s)       (3772MB/s)
      READ:               (1803MB/s)   (2058MB/s)      (4946MB/s)
      READ:               (1805MB/s)   (2156MB/s)      (4711MB/s)
      READ:               (1334MB/s)   (235MB/s)       (2135MB/s)
      WRITE:              (1335MB/s)   (235MB/s)       (2137MB/s)
      READ:               (1364MB/s)   (236MB/s)       (2268MB/s)
      WRITE:              (1365MB/s)   (237MB/s)       (2270MB/s)
      #jobs7
      WRITE:              (5474MB/s)   (270MB/s)       (4300MB/s)
      WRITE:              (4666MB/s)   (266MB/s)       (3817MB/s)
      READ:               (2022MB/s)   (2319MB/s)      (5472MB/s)
      READ:               (1924MB/s)   (2260MB/s)      (5031MB/s)
      READ:               (1369MB/s)   (242MB/s)       (2153MB/s)
      WRITE:              (1370MB/s)   (242MB/s)       (2155MB/s)
      READ:               (1499MB/s)   (246MB/s)       (2310MB/s)
      WRITE:              (1497MB/s)   (246MB/s)       (2307MB/s)
      #jobs8
      WRITE:              (5558MB/s)   (273MB/s)       (4439MB/s)
      WRITE:              (4763MB/s)   (271MB/s)       (3918MB/s)
      READ:               (2201MB/s)   (2599MB/s)      (6062MB/s)
      READ:               (2105MB/s)   (2463MB/s)      (5413MB/s)
      READ:               (1490MB/s)   (252MB/s)       (2238MB/s)
      WRITE:              (1488MB/s)   (252MB/s)       (2236MB/s)
      READ:               (1566MB/s)   (254MB/s)       (2434MB/s)
      WRITE:              (1568MB/s)   (254MB/s)       (2437MB/s)
      #jobs9
      WRITE:              (5120MB/s)   (264MB/s)       (4035MB/s)
      WRITE:              (4531MB/s)   (267MB/s)       (3740MB/s)
      READ:               (1940MB/s)   (2258MB/s)      (4986MB/s)
      READ:               (2024MB/s)   (2387MB/s)      (4871MB/s)
      READ:               (1343MB/s)   (246MB/s)       (2038MB/s)
      WRITE:              (1342MB/s)   (246MB/s)       (2037MB/s)
      READ:               (1553MB/s)   (238MB/s)       (2243MB/s)
      WRITE:              (1552MB/s)   (238MB/s)       (2242MB/s)
      #jobs10
      WRITE:              (5345MB/s)   (271MB/s)       (3988MB/s)
      WRITE:              (4750MB/s)   (254MB/s)       (3668MB/s)
      READ:               (1876MB/s)   (2363MB/s)      (5150MB/s)
      READ:               (1990MB/s)   (2256MB/s)      (5080MB/s)
      READ:               (1355MB/s)   (250MB/s)       (2019MB/s)
      WRITE:              (1356MB/s)   (251MB/s)       (2020MB/s)
      READ:               (1490MB/s)   (252MB/s)       (2202MB/s)
      WRITE:              (1488MB/s)   (252MB/s)       (2199MB/s)
      
      jobs1                              perfstat
      instructions                 52,065,555,710 (    0.79)    855,731,114,587 (    2.64)       54,280,709,944 (    1.40)
      branches                     14,020,427,116 ( 725.847)    101,733,449,582 (1074.521)       11,170,591,067 ( 992.869)
      branch-misses                    22,626,174 (   0.16%)        274,197,885 (   0.27%)           25,915,805 (   0.23%)
      jobs2                              perfstat
      instructions                103,633,110,402 (    0.75)  1,710,822,100,914 (    2.59)      107,879,874,104 (    1.28)
      branches                     27,931,237,282 ( 679.203)    203,298,267,479 (1037.326)       22,185,350,842 ( 884.427)
      branch-misses                    46,103,811 (   0.17%)        533,747,204 (   0.26%)           49,682,483 (   0.22%)
      jobs3                              perfstat
      instructions                154,857,283,657 (    0.76)  2,565,748,974,197 (    2.57)      161,515,435,813 (    1.31)
      branches                     41,759,490,355 ( 670.529)    304,905,605,277 ( 978.765)       33,215,805,907 ( 888.003)
      branch-misses                    74,263,293 (   0.18%)        759,746,240 (   0.25%)           76,841,196 (   0.23%)
      jobs4                              perfstat
      instructions                206,215,849,076 (    0.75)  3,420,169,460,897 (    2.60)      215,003,061,664 (    1.31)
      branches                     55,632,141,739 ( 666.501)    406,394,977,433 ( 927.241)       44,214,322,251 ( 883.532)
      branch-misses                   102,287,788 (   0.18%)      1,098,617,314 (   0.27%)          103,891,040 (   0.23%)
      jobs5                              perfstat
      instructions                258,711,315,588 (    0.67)  4,275,657,533,244 (    2.23)      269,332,235,685 (    1.08)
      branches                     69,802,821,166 ( 588.823)    507,996,211,252 ( 797.036)       55,450,846,129 ( 735.095)
      branch-misses                   129,217,214 (   0.19%)      1,243,284,991 (   0.24%)          173,512,278 (   0.31%)
      jobs6                              perfstat
      instructions                312,796,166,008 (    0.61)  5,133,896,344,660 (    2.02)      323,658,769,588 (    1.04)
      branches                     84,372,488,583 ( 520.541)    610,310,494,402 ( 697.642)       66,683,292,992 ( 693.939)
      branch-misses                   159,438,978 (   0.19%)      1,396,368,563 (   0.23%)          174,406,934 (   0.26%)
      jobs7                              perfstat
      instructions                363,211,372,930 (    0.56)  5,988,205,600,879 (    1.75)      377,824,674,156 (    0.93)
      branches                     98,057,013,765 ( 463.117)    711,841,255,974 ( 598.762)       77,879,009,954 ( 600.443)
      branch-misses                   199,513,153 (   0.20%)      1,507,651,077 (   0.21%)          248,203,369 (   0.32%)
      jobs8                              perfstat
      instructions                413,960,354,615 (    0.52)  6,842,918,558,378 (    1.45)      431,938,486,581 (    0.83)
      branches                    111,812,574,884 ( 414.224)    813,299,084,518 ( 491.173)       89,062,699,827 ( 517.795)
      branch-misses                   233,584,845 (   0.21%)      1,531,593,921 (   0.19%)          286,818,489 (   0.32%)
      jobs9                              perfstat
      instructions                465,976,220,300 (    0.53)  7,698,467,237,372 (    1.47)      486,352,600,321 (    0.84)
      branches                    125,931,456,162 ( 424.063)    915,207,005,715 ( 498.192)      100,370,404,090 ( 517.439)
      branch-misses                   256,992,445 (   0.20%)      1,782,809,816 (   0.19%)          345,239,380 (   0.34%)
      jobs10                             perfstat
      instructions                517,406,372,715 (    0.53)  8,553,527,312,900 (    1.48)      540,732,653,094 (    0.84)
      branches                    139,839,780,676 ( 427.732)  1,016,737,699,389 ( 503.172)      111,696,557,638 ( 516.750)
      branch-misses                   259,595,561 (   0.19%)      1,952,570,279 (   0.19%)          357,818,661 (   0.32%)
      
      seconds elapsed        20.630411534     96.084546565    12.743373571
      seconds elapsed        22.292627625     100.984155001   14.407413560
      seconds elapsed        22.396016966     110.344880848   14.032201392
      seconds elapsed        22.517330949     113.351459170   14.243074935
      seconds elapsed        28.548305104     156.515193765   19.159286861
      seconds elapsed        30.453538116     164.559937678   19.362492717
      seconds elapsed        33.467108086     188.486827481   21.492612173
      seconds elapsed        35.617727591     209.602677783   23.256422492
      seconds elapsed        42.584239509     243.959902566   28.458540338
      seconds elapsed        47.683632526     269.635248851   31.542404137
      
      Over all, ZSTD has slower WRITE, but much faster READ (perhaps
      a static compression buffer used during the test helped ZSTD a
      lot), which results in faster test results.
      
      Memory consumption (zram mm_stat file):
      
      zram LZO mm_stat
      mm_stat (jobs1): 2147483648 23068672 33558528        0 33558528        0        0
      mm_stat (jobs2): 2147483648 23068672 33558528        0 33558528        0        0
      mm_stat (jobs3): 2147483648 23068672 33558528        0 33562624        0        0
      mm_stat (jobs4): 2147483648 23068672 33558528        0 33558528        0        0
      mm_stat (jobs5): 2147483648 23068672 33558528        0 33558528        0        0
      mm_stat (jobs6): 2147483648 23068672 33558528        0 33562624        0        0
      mm_stat (jobs7): 2147483648 23068672 33558528        0 33566720        0        0
      mm_stat (jobs8): 2147483648 23068672 33558528        0 33558528        0        0
      mm_stat (jobs9): 2147483648 23068672 33558528        0 33558528        0        0
      mm_stat (jobs10): 2147483648 23068672 33558528        0 33562624        0        0
      
      zram DEFLATE mm_stat
      mm_stat (jobs1): 2147483648 16252928 25178112        0 25178112        0        0
      mm_stat (jobs2): 2147483648 16252928 25178112        0 25178112        0        0
      mm_stat (jobs3): 2147483648 16252928 25178112        0 25178112        0        0
      mm_stat (jobs4): 2147483648 16252928 25178112        0 25178112        0        0
      mm_stat (jobs5): 2147483648 16252928 25178112        0 25178112        0        0
      mm_stat (jobs6): 2147483648 16252928 25178112        0 25178112        0        0
      mm_stat (jobs7): 2147483648 16252928 25178112        0 25190400        0        0
      mm_stat (jobs8): 2147483648 16252928 25178112        0 25190400        0        0
      mm_stat (jobs9): 2147483648 16252928 25178112        0 25178112        0        0
      mm_stat (jobs10): 2147483648 16252928 25178112        0 25178112        0        0
      
      zram ZSTD mm_stat
      mm_stat (jobs1): 2147483648 11010048 16781312        0 16781312        0        0
      mm_stat (jobs2): 2147483648 11010048 16781312        0 16781312        0        0
      mm_stat (jobs3): 2147483648 11010048 16781312        0 16785408        0        0
      mm_stat (jobs4): 2147483648 11010048 16781312        0 16781312        0        0
      mm_stat (jobs5): 2147483648 11010048 16781312        0 16781312        0        0
      mm_stat (jobs6): 2147483648 11010048 16781312        0 16781312        0        0
      mm_stat (jobs7): 2147483648 11010048 16781312        0 16781312        0        0
      mm_stat (jobs8): 2147483648 11010048 16781312        0 16781312        0        0
      mm_stat (jobs9): 2147483648 11010048 16781312        0 16785408        0        0
      mm_stat (jobs10): 2147483648 11010048 16781312        0 16781312        0        0
      
      ==================================================================================
      
      Official benchmarks [1]:
      
      Compressor name         Ratio   Compression     Decompress.
      zstd 1.1.3 -1           2.877   430 MB/s        1110 MB/s
      zlib 1.2.8 -1           2.743   110 MB/s        400 MB/s
      brotli 0.5.2 -0         2.708   400 MB/s        430 MB/s
      quicklz 1.5.0 -1        2.238   550 MB/s        710 MB/s
      lzo1x 2.09 -1           2.108   650 MB/s        830 MB/s
      lz4 1.7.5               2.101   720 MB/s        3600 MB/s
      snappy 1.1.3            2.091   500 MB/s        1650 MB/s
      lzf 3.6 -1              2.077   400 MB/s        860 MB/s
      
      Minchan said:
      
      : I did test with my sample data and compared zstd with deflate.  zstd's
      : compress ratio is lower a little bit but compression speed is much faster
      : 3 times more and decompress speed is too 2 times more.  With different
      : data, it is different but overall, zstd would be better for speed at the
      : cost of a little lower compress ratio(about 5%) so I believe it's worth to
      : replace deflate.
      
      [1] https://github.com/facebook/zstd
      
      Link: http://lkml.kernel.org/r/20170912050005.3247-1-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Tested-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ef3a8b1
    • Yafang Shao's avatar
      mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical · 0f6d24f8
      Yafang Shao authored
      The vm direct limit setting must be set greater than vm background limit
      setting.  Otherwise print a warning to help the operator to figure out
      that the vm dirtiness settings is in illogical state.
      
      Link: http://lkml.kernel.org/r/1506592464-30962-1-git-send-email-laoar.shao@gmail.comSigned-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f6d24f8
    • Gioh Kim's avatar
      mm/memblock.c: make the index explicit argument of for_each_memblock_type · 66e8b438
      Gioh Kim authored
      for_each_memblock_type macro function relies on idx variable defined in
      the caller context.  Silent macro arguments are almost always wrong
      thing to do.  They make code harder to read and easier to get wrong.
      Let's use an explicit iterator parameter for for_each_memblock_type and
      make the code more obious.  This patch is a mere cleanup and it
      shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170913133029.28911-1-gi-oh.kim@profitbricks.comSigned-off-by: default avatarGioh Kim <gi-oh.kim@profitbricks.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66e8b438
    • Michal Hocko's avatar
      mm, memory_hotplug: remove timeout from __offline_memory · ecde0f3e
      Michal Hocko authored
      We have a hardcoded 120s timeout after which the memory offline fails
      basically since the hot remove has been introduced.  This is essentially
      a policy implemented in the kernel.  Moreover there is no way to adjust
      the timeout and so we are sometimes facing memory offline failures if
      the system is under a heavy memory pressure or very intensive CPU
      workload on large machines.
      
      It is not very clear what purpose the timeout actually serves.  The
      offline operation is interruptible by a signal so if userspace wants
      some timeout based termination this can be done trivially by sending a
      signal.
      
      If there is a strong usecase to do this from the kernel then we should
      do it properly and have a it tunable from the userspace with the timeout
      disabled by default along with the explanation who uses it and for what
      purporse.
      
      Link: http://lkml.kernel.org/r/20170918070834.13083-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecde0f3e
    • Michal Hocko's avatar
      mm, memory_hotplug: do not fail offlining too early · 72b39cfc
      Michal Hocko authored
      Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2.
      
      While testing memory hotplug on a large 4TB machine we have noticed that
      memory offlining is just too eager to fail.  The primary reason is that
      the retry logic is just too easy to give up.  We have 4 ways out of the
      offline
      
      	- we have a permanent failure (isolation or memory notifiers fail,
      	  or hugetlb pages cannot be dropped)
      	- userspace sends a signal
      	- a hardcoded 120s timeout expires
      	- page migration fails 5 times
      
      This is way too convoluted and it doesn't scale very well.  We have seen
      both temporary migration failures as well as 120s being triggered.
      After removing those restrictions we were able to pass stress testing
      during memory hot remove without any other negative side effects
      observed.  Therefore I suggest dropping both hard coded policies.  I
      couldn't have found any specific reason for them in the changelog.  I
      neither didn't get any response [1] from Kamezawa.  If we need some
      upper bound - e.g.  timeout based - then we should have a proper and
      user defined policy for that.  In any case there should be a clear use
      case when introducing it.
      
      This patch (of 2):
      
      Memory offlining can fail too eagerly under heavy memory pressure.
      
        page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
        flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
        page dumped because: isolation failed
        page->mem_cgroup:ffff8801cd662000
        memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
      
      Isolation has failed here because the page is not on LRU.  Most probably
      because it was on the pcp LRU cache or it has been removed from the LRU
      already but it hasn't been freed yet.  In both cases the page doesn't
      look non-migrable so retrying more makes sense.
      
      __offline_pages seems rather cluttered when it comes to the retry logic.
      We have 5 retries at maximum and a timeout.  We could argue whether the
      timeout makes sense but failing just because of a race when somebody
      isoltes a page from LRU or puts it on a pcp LRU lists is just wrong.  It
      only takes it to race with a process which unmaps some pages and remove
      them from the LRU list and we can fail the whole offline because of
      something that is a temporary condition and actually not harmful for the
      offline.
      
      Please note that unmovable pages should be already excluded during
      start_isolate_page_range.  We could argue that has_unmovable_pages is
      racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
      but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
      pages in most cases and movable zone shouldn't contain unmovable pages
      at all.  Some of those pages might be pinned but not for ever because
      that would be a bug on its own.  In any case the context is still
      interruptible and so the userspace can easily bail out when the
      operation takes too long.  This is certainly better behavior than a
      hardcoded retry loop which is racy.
      
      Fix this by removing the max retry count and only rely on the timeout
      resp. interruption by a signal from the userspace.  Also retry rather
      than fail when check_pages_isolated sees some !free pages because those
      could be a result of the race as well.
      
      Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b39cfc
    • Michal Hocko's avatar
      mm, page_alloc: fail has_unmovable_pages when seeing reserved pages · d7ab3672
      Michal Hocko authored
      Reserved pages should be completely ignored by the core mm because they
      have a special meaning for their owners.  has_unmovable_pages doesn't
      check those so we rely on other tests (reference count, or PageLRU) to
      fail on such pages.  Althought this happens to work it is safer to
      simply check for those explicitly and do not rely on the owner of the
      page to abuse those fields for special purposes.
      
      Please note that this is more of a further fortification of the code
      rahter than a fix of an existing issue.
      
      Link: http://lkml.kernel.org/r/20171013120756.jeopthigbmm3c7bl@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7ab3672
    • Michal Hocko's avatar
      mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages() · 4da2ce25
      Michal Hocko authored
      Joonsoo has noticed that "mm: drop migrate type checks from
      has_unmovable_pages" would break CMA allocator because it relies on
      has_unmovable_pages returning false even for CMA pageblocks which in
      fact don't have to be movable:
      
       alloc_contig_range
         start_isolate_page_range
           set_migratetype_isolate
             has_unmovable_pages
      
      This is a result of the code sharing between CMA and memory hotplug
      while each one has a different idea of what has_unmovable_pages should
      return.  This is unfortunate but fixing it properly would require a lot
      of code duplication.
      
      Fix the issue by introducing the requested migrate type argument and
      special case MIGRATE_CMA case where CMA page blocks are handled
      properly.  This will work for memory hotplug because it requires
      MIGRATE_MOVABLE.
      
      Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: default avatarRan Wang <ran.wang_1@nxp.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4da2ce25
    • Michal Hocko's avatar
      mm: drop migrate type checks from has_unmovable_pages · d7b236e1
      Michal Hocko authored
      Michael has noticed that the memory offline tries to migrate kernel code
      pages when doing
      
       echo 0 > /sys/devices/system/memory/memory0/online
      
      The current implementation will fail the operation after several failed
      page migration attempts but we shouldn't even attempt to migrate that
      memory and fail right away because this memory is clearly not
      migrateable.  This will become a real problem when we drop the retry
      loop counter resp.  timeout.
      
      The real problem is in has_unmovable_pages in fact.  We should fail if
      there are any non migrateable pages in the area.  In orther to guarantee
      that remove the migrate type checks because MIGRATE_MOVABLE is not
      guaranteed to contain only migrateable pages.  It is merely a heuristic.
      Similarly MIGRATE_CMA does guarantee that the page allocator doesn't
      allocate any non-migrateable pages from the block but CMA allocations
      themselves are unlikely to migrateable.  Therefore remove both checks.
      
      [akpm@linux-foundation.org: remove unused local `mt']
      Link: http://lkml.kernel.org/r/20171013120013.698-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarTony Lindgren <tony@atomide.com>
      Tested-by: default avatarRan Wang <ran.wang_1@nxp.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7b236e1
    • Tahsin Erdogan's avatar
      mm/page-writeback.c: remove unused parameter from balance_dirty_pages() · 4c578dce
      Tahsin Erdogan authored
      "mapping" parameter to balance_dirty_pages() is not used anymore.
      
      Fixes: dfb8ae56 ("writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback")
      Link: http://lkml.kernel.org/r/20170927221311.23263-1-tahsin@google.comSigned-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c578dce
    • Huang Ying's avatar
      mm, swap: fix false error message in __swp_swapcount() · e9a6effa
      Huang Ying authored
      When a page fault occurs for a swap entry, the physical swap readahead
      (not the VMA base swap readahead) may readahead several swap entries
      after the fault swap entry.  The readahead algorithm calculates some of
      the swap entries to readahead via increasing the offset of the fault
      swap entry without checking whether they are beyond the end of the swap
      device and it relys on the __swp_swapcount() and swapcache_prepare() to
      check it.  Although __swp_swapcount() checks for the swap entry passed
      in, it will complain with the error message as follow for the expected
      invalid swap entry.  This may make the end users confused.
      
        swap_info_get: Bad swap offset entry 0200f8a7
      
      To fix the false error message, the swap entry checking is added in
      swapin_readahead() to avoid to pass the out-of-bound swap entries and
      the swap entry reserved for the swap header to __swp_swapcount() and
      swapcache_prepare().
      
      Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
      Fixes: e8c26ab6 ("mm/swap: skip readahead for unreferenced swap slots")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reported-by: default avatarChristian Kujau <lists@nerdbynature.de>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[4.11+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9a6effa
    • Minchan Kim's avatar
      mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference · aa8d22a1
      Minchan Kim authored
      When SWP_SYNCHRONOUS_IO swapped-in pages are shared by several
      processes, it can cause unnecessary memory wastage by skipping swap
      cache.  Because, with swapin fault by read, they could share a page if
      the page were in swap cache.  Thus, it avoids allocating same content
      new pages.
      
      This patch makes the swapcache skipping work only if the swap pte is
      non-sharable.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1507620825-5537-1-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa8d22a1
    • Minchan Kim's avatar
      mm, swap: skip swapcache for swapin of synchronous device · 0bcac06f
      Minchan Kim authored
      With fast swap storage, the platforms want to use swap more aggressively
      and swap-in is crucial to application latency.
      
      The rw_page() based synchronous devices like zram, pmem and btt are such
      fast storage.  When I profile swapin performance with zram lz4
      decompress test, S/W overhead is more than 70%.  Maybe, it would be
      bigger in nvdimm.
      
      This patch aims to reduce swap-in latency by skipping swapcache if the
      swap device is synchronous device like rw_page based device.  It
      enhances 45% my swapin test(5G sequential swapin, no readahead, from
      2.41sec to 1.64sec).
      
      Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0bcac06f
    • Minchan Kim's avatar
      mm, swap: introduce SWP_SYNCHRONOUS_IO · 539a6fea
      Minchan Kim authored
      If rw-page based fast storage is used for swap devices, we need to
      detect it to enhance swap IO operations.  This patch is preparation for
      optimizing of swap-in operation with next patch.
      
      Link: http://lkml.kernel.org/r/1505886205-9671-4-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      539a6fea
    • Minchan Kim's avatar
      bdi: introduce BDI_CAP_SYNCHRONOUS_IO · 23c47d2a
      Minchan Kim authored
      As discussed at
      
        https://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>
      
      someday we will remove rw_page().  If so, we need something to detect
      such super-fast storage on which synchronous IO operations like the
      current rw_page are always a win.
      
      Introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices.  With it, we
      could use various optimization techniques.
      
      Link: http://lkml.kernel.org/r/1505886205-9671-3-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23c47d2a
    • Minchan Kim's avatar
      zram: set BDI_CAP_STABLE_WRITES once · e447a015
      Minchan Kim authored
      With fast swap storage, the platform wants to use swap more aggressively
      and swap-in is crucial to application latency.
      
      The rw_page() based synchronous devices like zram, pmem and btt are such
      fast storage.  When I profile swapin performance with zram lz4
      decompress test, S/W overhead is more than 70%.  Maybe, it would be
      bigger in nvdimm.
      
      This patchset reduces swap-in latency by skipping swapcache if the swap
      device is a synchronous device like a rw_page() based device.
      
      It enhances by 45% my swapin test (5G sequential swapin, no readahead)
      from 2.41sec to 1.64sec.
      
      This patch (of 4):
      
      Commit 19b7ccf8 ("block: get rid of blk_integrity_revalidate()")
      fixed a weird thing (i.e., reset BDI_CAP_STABLE_WRITES flag
      unconditionally whenever revalidat_disk is called) so zram doesn't need
      to reset the flag any more when revalidating the bdev.  Instead, set the
      flag just once when the zram device is created.
      
      It shouldn't change any behavior.
      
      Link: http://lkml.kernel.org/r/1505886205-9671-2-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e447a015
    • Changbin Du's avatar
      mm: update comments for struct page.mapping · 41710443
      Changbin Du authored
      struct page.mapping can be NULL or points to one object of type
      address_space, anon_vma or KSM private structure.
      
      Link: http://lkml.kernel.org/r/1506485067-15954-1-git-send-email-changbin.du@intel.comSigned-off-by: default avatarChangbin Du <changbin.du@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41710443
    • Johannes Thumshirn's avatar
      net/rds/ib_fmr.c: use kmalloc_array_node() · c413af87
      Johannes Thumshirn authored
      Now that we have a NUMA-aware version of kmalloc_array() we can use it
      instead of kmalloc_node() without an overflow check in the size
      calculation.
      
      Link: http://lkml.kernel.org/r/20170927082038.3782-7-jthumshirn@suse.deSigned-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Marciniszyn <infinipath@intel.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c413af87
    • Johannes Thumshirn's avatar
      mm/mempool.c: use kmalloc_array_node() · 63762f50
      Johannes Thumshirn authored
      Now that we have a NUMA-aware version of kmalloc_array() we can use it
      instead of kmalloc_node() without an overflow check in the size
      calculation.
      
      Link: http://lkml.kernel.org/r/20170927082038.3782-6-jthumshirn@suse.deSigned-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Marciniszyn <infinipath@intel.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63762f50
    • Johannes Thumshirn's avatar
      drivers/infiniband/sw/rdmavt/qp.c: use kmalloc_array_node() · 3c073478
      Johannes Thumshirn authored
      Now that we have a NUMA-aware version of kmalloc_array() we can use it
      instead of kmalloc_node() without an overflow check in the size
      calculation.
      
      Link: http://lkml.kernel.org/r/20170927082038.3782-5-jthumshirn@suse.deSigned-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Marciniszyn <infinipath@intel.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c073478
    • Johannes Thumshirn's avatar
      drivers/infiniband/hw/qib/qib_init.c: use kmalloc_array_node() · 7d502071
      Johannes Thumshirn authored
      Now that we have a NUMA-aware version of kmalloc_array() we can use it
      instead of kmalloc_node() without an overflow check in the size
      calculation.
      
      Link: http://lkml.kernel.org/r/20170927082038.3782-4-jthumshirn@suse.deSigned-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Mike Marciniszyn <infinipath@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d502071
    • Johannes Thumshirn's avatar
      block/blk-mq.c: use kmalloc_array_node() · d904bfa7
      Johannes Thumshirn authored
      Now that we have a NUMA-aware version of kmalloc_array() we can use it
      instead of kmalloc_node() without an overflow check in the size
      calculation.
      
      Link: http://lkml.kernel.org/r/20170927082038.3782-3-jthumshirn@suse.deSigned-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Mike Marciniszyn <infinipath@intel.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d904bfa7
    • Johannes Thumshirn's avatar
      include/linux/slab.h: add kmalloc_array_node() and kcalloc_node() · 5799b255
      Johannes Thumshirn authored
      Patch series "Add kmalloc_array_node() and kcalloc_node()".
      
      Our current memeory allocation routines suffer form an API imbalance,
      for one we have kmalloc_array() and kcalloc() which check for overflows
      in size multiplication and we have kmalloc_node() and kzalloc_node()
      which allow for memory allocation on a certain NUMA node but don't check
      for eventual overflows.
      
      This patch (of 6):
      
      We have kmalloc_array() and kcalloc() wrappers on top of kmalloc() which
      ensure us overflow free multiplication for the size of a memory
      allocation but these implementations are not NUMA-aware.
      
      Likewise we have kmalloc_node() which is a NUMA-aware version of
      kmalloc() but the implementation is not aware of any possible overflows
      in eventual size calculations.
      
      Introduce a combination of the two above cases to have a NUMA-node aware
      version of kmalloc_array() and kcalloc().
      
      Link: http://lkml.kernel.org/r/20170927082038.3782-2-jthumshirn@suse.deSigned-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Marciniszyn <infinipath@intel.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5799b255
    • Miles Chen's avatar
      slub: fix sysfs duplicate filename creation when slub_debug=O · 11066386
      Miles Chen authored
      When slub_debug=O is set.  It is possible to clear debug flags for an
      "unmergeable" slab cache in kmem_cache_open().  It makes the "unmergeable"
      cache became "mergeable" in sysfs_slab_add().
      
      These caches will generate their "unique IDs" by create_unique_id(), but
      it is possible to create identical unique IDs.  In my experiment,
      sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and
      the kernel reports "sysfs: cannot create duplicate filename
      '/kernel/slab/:Ft-0004096'".
      
      To repeat my experiment, set disable_higher_order_debug=1,
      CONFIG_SLUB_DEBUG_ON=y in kernel-4.14.
      
      Fix this issue by setting unmergeable=1 if slub_debug=O and the the
      default slub_debug contains any no-merge flags.
      
      call path:
      kmem_cache_create()
        __kmem_cache_alias()	-> we set SLAB_NEVER_MERGE flags here
        create_cache()
          __kmem_cache_create()
            kmem_cache_open()	-> clear DEBUG_METADATA_FLAGS
            sysfs_slab_add()	-> the slab cache is mergeable now
      
        sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W       4.14.0-rc7ajb-00131-gd4c2e9fc-dirty #123
        Hardware name: linux,dummy-virt (DT)
        task: ffffffc07d4e0080 task.stack: ffffff8008008000
        PC is at sysfs_warn_dup+0x60/0x7c
        LR is at sysfs_warn_dup+0x60/0x7c
        pc :  lr :  pstate: 60000145
        Call trace:
         sysfs_warn_dup+0x60/0x7c
         sysfs_create_dir_ns+0x98/0xa0
         kobject_add_internal+0xa0/0x294
         kobject_init_and_add+0x90/0xb4
         sysfs_slab_add+0x90/0x200
         __kmem_cache_create+0x26c/0x438
         kmem_cache_create+0x164/0x1f4
         sg_pool_init+0x60/0x100
         do_one_initcall+0x38/0x12c
         kernel_init_freeable+0x138/0x1d4
         kernel_init+0x10/0xfc
         ret_from_fork+0x10/0x18
      
      Link: http://lkml.kernel.org/r/1510365805-5155-1-git-send-email-miles.chen@mediatek.comSigned-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11066386
    • Alexey Dobriyan's avatar
      slab, slub, slob: convert slab_flags_t to 32-bit · 4fd0b46e
      Alexey Dobriyan authored
      struct kmem_cache::flags is "unsigned long" which is unnecessary on
      64-bit as no flags are defined in the higher bits.
      
      Switch the field to 32-bit and save some space on x86_64 until such
      flags appear:
      
      	add/remove: 0/0 grow/shrink: 0/107 up/down: 0/-657 (-657)
      	function                                     old     new   delta
      	sysfs_slab_add                               720     719      -1
      				...
      	check_object                                 699     676     -23
      
      [akpm@linux-foundation.org: fix printk warning]
      Link: http://lkml.kernel.org/r/20171021100635.GA8287@avx2Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4fd0b46e