1. 02 Apr, 2020 40 commits
    • Wei Yang's avatar
      mm/sparsemem: get address to page struct instead of address to pfn · 4627d76d
      Wei Yang authored
      memmap should be the address to page struct instead of address to pfn.
      
      As mentioned by David, if system memory and devmem sit within a section,
      the mismatch address would lead kdump to dump unexpected memory.
      
      Since sub-section only works for SPARSEMEM_VMEMMAP, pfn_to_page() is valid
      to get the page struct address at this point.
      
      Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Link: http://lkml.kernel.org/r/20200210005048.10437-1-richardw.yang@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4627d76d
    • Brian Geffon's avatar
      selftests: add MREMAP_DONTUNMAP selftest · 0c28759e
      Brian Geffon authored
      Add a few simple self tests for the new flag MREMAP_DONTUNMAP, they are
      simple smoke tests which also demonstrate the behavior.
      
      [akpm@linux-foundation.org: convert eight-spaces to hard tabs]
      [bgeffon@google.com: v7]
        Link: http://lkml.kernel.org/r/20200221174248.244748-2-bgeffon@google.com
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: default avatarBrian Geffon <bgeffon@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Michael S . Tsirkin" <mst@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Deacon <will@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Jesse Barnes <jsbarnes@google.com>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Link: http://lkml.kernel.org/r/20200218173221.237674-2-bgeffon@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c28759e
    • Brian Geffon's avatar
      mm/mremap: add MREMAP_DONTUNMAP to mremap() · e346b381
      Brian Geffon authored
      When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
      the source mapping will not be removed.  The remap operation will be
      performed as it would have been normally by moving over the page tables to
      the new mapping.  The old vma will have any locked flags cleared, have no
      pagetables, and any userfaultfds that were watching that range will
      continue watching it.
      
      For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
      the mremap() call to fail.  Because MREMAP_DONTUNMAP always results in
      moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
      resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
      always be equal to the new_len otherwise it will return -EINVAL.
      
      We hope to use this in Chrome OS where with userfaultfd we could write an
      anonymous mapping to disk without having to STOP the process or worry
      about VMA permission changes.
      
      This feature also has a use case in Android, Lokesh Gidra has said that
      "As part of using userfaultfd for GC, We'll have to move the physical
      pages of the java heap to a separate location.  For this purpose mremap
      will be used.  Without the MREMAP_DONTUNMAP flag, when I mremap the java
      heap, its virtual mapping will be removed as well.  Therefore, we'll
      require performing mmap immediately after.  This is not only time
      consuming but also opens a time window where a native thread may call mmap
      and reserve the java heap's address range for its own usage.  This flag
      solves the problem."
      
      [bgeffon@google.com: v6]
        Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
      [bgeffon@google.com: v7]
        Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.comSigned-off-by: default avatarBrian Geffon <bgeffon@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Michael S . Tsirkin" <mst@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Deacon <will@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Jesse Barnes <jsbarnes@google.com>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e346b381
    • Jaewon Kim's avatar
      mm: mmap: add trace point of vm_unmapped_area · df529cab
      Jaewon Kim authored
      Even on 64 bit kernel, the mmap failure can happen for a 32 bit task.
      Virtual memory space shortage of a task on mmap is reported to userspace
      as -ENOMEM.  It can be confused as physical memory shortage of overall
      system.
      
      The vm_unmapped_area can be called to by some drivers or other kernel core
      system like filesystem.  In my platform, GPU driver calls to
      vm_unmapped_area and the driver returns -ENOMEM even in GPU side shortage.
      It can be hard to distinguish which code layer returns the -ENOMEM.
      
      Create mmap trace file and add trace point of vm_unmapped_area.
      
      i.e.)
      277.156599: vm_unmapped_area: addr=77e0d03000 err=0 total_vm=0x17014b flags=0x1 len=0x400000 lo=0x8000 hi=0x7878c27000 mask=0x0 ofs=0x1
      342.838740: vm_unmapped_area: addr=0 err=-12 total_vm=0xffb08 flags=0x0 len=0x100000 lo=0x40000000 hi=0xfffff000 mask=0x0 ofs=0x22
      
      [akpm@linux-foundation.org: prefix address printk with 0x, per Matthew]
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200320055823.27089-3-jaewon31.kim@samsung.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df529cab
    • Jaewon Kim's avatar
      mmap: remove inline of vm_unmapped_area · baceaf1c
      Jaewon Kim authored
      Patch series "mm: mmap: add mmap trace point", v3.
      
      Create mmap trace file and add trace point of vm_unmapped_area().
      
      This patch (of 2):
      
      In preparation for next patch remove inline of vm_unmapped_area and move
      code to mmap.c.  There is no logical change.
      
      Also remove unmapped_area[_topdown] out of mm.h, there is no code
      calling to them.
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Link: http://lkml.kernel.org/r/20200320055823.27089-2-jaewon31.kim@samsung.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      baceaf1c
    • Wang Wenhu's avatar
      mm/memory.c: clarify a confusing comment for vm_iomap_memory · abd69b9e
      Wang Wenhu authored
      The param "start" actually referes to the physical memory start, which is
      to be mapped into virtual area vma.  And it is the field vma->vm_start
      which stands for the start of the area.
      
      Most of the time, we do not read through whole implementation of a
      function but only the definition and essential comments.  Accurate
      comments are definitely the base stone.
      Signed-off-by: default avatarWang Wenhu <wenhu.wang@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200318052206.105104-1-wenhu.wang@vivo.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abd69b9e
    • WANG Wenhu's avatar
      mm: clarify a confusing comment for remap_pfn_range() · 86a76331
      WANG Wenhu authored
      It really made me scratch my head.  Replace the comment with an accurate
      and consistent description.
      
      The parameter pfn actually refers to the page frame number which is
      right-shifted by PAGE_SHIFT from the physical address.
      Signed-off-by: default avatarWANG Wenhu <wenhu.wang@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200310073955.43415-1-wenhu.wang@vivo.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86a76331
    • Peter Xu's avatar
      mm/userfaultfd: honor FAULT_FLAG_KILLABLE in fault path · 3e69ad08
      Peter Xu authored
      Userfaultfd fault path was by default killable even if the caller does not
      have FAULT_FLAG_KILLABLE.  That makes sense before in that when with gup
      we don't have FAULT_FLAG_KILLABLE properly set before.  Now after previous
      patch we've got FAULT_FLAG_KILLABLE applied even for gup code so it should
      also make sense to let userfaultfd to honor the FAULT_FLAG_KILLABLE.
      
      Because we're unconditionally setting FAULT_FLAG_KILLABLE in gup code
      right now, this patch should have no functional change.  It also cleaned
      the code a little bit by introducing some helpers.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160300.9941-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e69ad08
    • Peter Xu's avatar
      mm/gup: allow to react to fatal signals · 71335f37
      Peter Xu authored
      The existing gup code does not react to the fatal signals in many code
      paths.  For example, in one retry path of gup we're still using
      down_read() rather than down_read_killable().  Also, when doing page
      faults we don't pass in FAULT_FLAG_KILLABLE as well, which means that
      within the faulting process we'll wait in non-killable way as well.  These
      were spotted by Linus during the code review of some other patches.
      
      Let's allow the gup code to react to fatal signals to improve the
      responsiveness of threads when during gup and being killed.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160256.9887-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71335f37
    • Peter Xu's avatar
      mm/gup: allow VM_FAULT_RETRY for multiple times · 4426e945
      Peter Xu authored
      This is the gup counterpart of the change that allows the VM_FAULT_RETRY
      to happen for more than once.  One thing to mention is that we must check
      the fatal signal here before retry because the GUP can be interrupted by
      that, otherwise we can loop forever.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4426e945
    • Peter Xu's avatar
      mm: allow VM_FAULT_RETRY for multiple times · 4064b982
      Peter Xu authored
      The idea comes from a discussion between Linus and Andrea [1].
      
      Before this patch we only allow a page fault to retry once.  We achieved
      this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
      handle_mm_fault() the second time.  This was majorly used to avoid
      unexpected starvation of the system by looping over forever to handle the
      page fault on a single page.  However that should hardly happen, and after
      all for each code path to return a VM_FAULT_RETRY we'll first wait for a
      condition (during which time we should possibly yield the cpu) to happen
      before VM_FAULT_RETRY is really returned.
      
      This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
      flag when we receive VM_FAULT_RETRY.  It means that the page fault handler
      now can retry the page fault for multiple times if necessary without the
      need to generate another page fault event.  Meanwhile we still keep the
      FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
      page fault is the first attempt or not.
      
      Then we'll have these combinations of fault flags (only considering
      ALLOW_RETRY flag and TRIED flag):
      
        - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                                   retry, and this is the first try
      
        - ALLOW_RETRY and TRIED:   this means the page fault allows to
                                   retry, and this is not the first try
      
        - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                                   to retry at all
      
        - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
      
      In existing code we have multiple places that has taken special care of
      the first condition above by checking against (fault_flags &
      FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to detect
      the first retry of a page fault by checking against both (fault_flags &
      FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
      even the 2nd try will have the ALLOW_RETRY set, then use that helper in
      all existing special paths.  One example is in __lock_page_or_retry(), now
      we'll drop the mmap_sem only in the first attempt of page fault and we'll
      keep it in follow up retries, so old locking behavior will be retained.
      
      This will be a nice enhancement for current code [2] at the same time a
      supporting material for the future userfaultfd-writeprotect work, since in
      that work there will always be an explicit userfault writeprotect retry
      for protected pages, and if that cannot resolve the page fault (e.g., when
      userfaultfd-writeprotect is used in conjunction with swapped pages) then
      we'll possibly need a 3rd retry of the page fault.  It might also benefit
      other potential users who will have similar requirement like userfault
      write-protection.
      
      GUP code is not touched yet and will be covered in follow up patch.
      
      Please read the thread below for more information.
      
      [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
      [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4064b982
    • Peter Xu's avatar
      mm: introduce FAULT_FLAG_INTERRUPTIBLE · c270a7ee
      Peter Xu authored
      handle_userfaultfd() is currently the only one place in the kernel page
      fault procedures that can respond to non-fatal userspace signals.  It was
      trying to detect such an allowance by checking against USER & KILLABLE
      flags, which was "un-official".
      
      In this patch, we introduced a new flag (FAULT_FLAG_INTERRUPTIBLE) to show
      that the fault handler allows the fault procedure to respond even to
      non-fatal signals.  Meanwhile, add this new flag to the default fault
      flags so that all the page fault handlers can benefit from the new flag.
      With that, replacing the userfault check to this one.
      
      Since the line is getting even longer, clean up the fault flags a bit too
      to ease TTY users.
      
      Although we've got a new flag and applied it, we shouldn't have any
      functional change with this patch so far.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220195348.16302-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c270a7ee
    • Peter Xu's avatar
      mm: introduce FAULT_FLAG_DEFAULT · dde16072
      Peter Xu authored
      Although there're tons of arch-specific page fault handlers, most of them
      are still sharing the same initial value of the page fault flags.  Say,
      merely all of the page fault handlers would allow the fault to be retried,
      and they also allow the fault to respond to SIGKILL.
      
      Let's define a default value for the fault flags to replace those initial
      page fault flags that were copied over.  With this, it'll be far easier to
      introduce new fault flag that can be used by all the architectures instead
      of touching all the archs.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160238.9694-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dde16072
    • Peter Xu's avatar
      userfaultfd: don't retake mmap_sem to emulate NOPAGE · ef429ee7
      Peter Xu authored
      This patch removes the risk path in handle_userfault() then we will be
      sure that the callers of handle_mm_fault() will know that the VMAs might
      have changed.  Meanwhile with previous patch we don't lose responsiveness
      as well since the core mm code now can handle the nonfatal userspace
      signals even if we return VM_FAULT_RETRY.
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160234.9646-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef429ee7
    • Peter Xu's avatar
      mm: return faster for non-fatal signals in user mode faults · 8b9a65fd
      Peter Xu authored
      The idea comes from the upstream discussion between Linus and Andrea:
      
        https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
      
      A summary to the issue: there was a special path in handle_userfault() in
      the past that we'll return a VM_FAULT_NOPAGE when we detected non-fatal
      signals when waiting for userfault handling.  We did that by reacquiring
      the mmap_sem before returning.  However that brings a risk in that the
      vmas might have changed when we retake the mmap_sem and even we could be
      holding an invalid vma structure.
      
      This patch is a preparation of removing that special path by allowing the
      page fault to return even faster if we were interrupted by a non-fatal
      signal during a user-mode page fault handling routine.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160230.9598-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b9a65fd
    • Peter Xu's avatar
      sh/mm: use helper fault_signal_pending() · fb027ada
      Peter Xu authored
      Let SH to use the new fault_signal_pending() helper.  Here we'll need to
      move the up_read() out because that's actually needed as long as !RETRY
      cases.  At the meantime we can drop all the rest of up_read()s now (which
      seems to be cleaner).
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160226.9550-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb027ada
    • Peter Xu's avatar
      powerpc/mm: use helper fault_signal_pending() · c9a0dad1
      Peter Xu authored
      Let powerpc code to use the new helper, by moving the signal handling
      earlier before the retry logic.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160222.9422-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9a0dad1
    • Peter Xu's avatar
      arm64/mm: use helper fault_signal_pending() · b502f038
      Peter Xu authored
      Let the arm64 fault handling to use the new fault_signal_pending() helper,
      by moving the signal handling out of the retry logic.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155927.9264-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b502f038
    • Peter Xu's avatar
      arc/mm: use helper fault_signal_pending() · 24a62cf4
      Peter Xu authored
      Let ARC to use the new helper fault_signal_pending() by moving the signal
      check out of the retry logic as standalone.  This should also helps to
      simplify the code a bit.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155843.9172-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24a62cf4
    • Peter Xu's avatar
      x86/mm: use helper fault_signal_pending() · 39678191
      Peter Xu authored
      Let's move the fatal signal check even earlier so that we can directly use
      the new fault_signal_pending() in x86 mm code.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-5-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39678191
    • Peter Xu's avatar
      mm: introduce fault_signal_pending() · 4ef87322
      Peter Xu authored
      For most architectures, we've got a quick path to detect fatal signal
      after a handle_mm_fault().  Introduce a helper for that quick path.
      
      It cleans the current codes a bit so we don't need to duplicate the same
      check across archs.  More importantly, this will be an unified place that
      we handle the signal immediately right after an interrupted page fault, so
      it'll be much easier for us if we want to change the behavior of handling
      signals later on for all the archs.
      
      Note that currently only part of the archs are using this new helper,
      because some archs have their own way to handle signals.  In the follow up
      patches, we'll try to apply this helper to all the rest of archs.
      
      Another note is that the "regs" parameter in the new helper is not used
      yet.  It'll be used very soon.  Now we kept it in this patch only to avoid
      touching all the archs again in the follow up patches.
      
      [peterx@redhat.com: fix sparse warnings]
        Link: http://lkml.kernel.org/r/20200311145921.GD479302@xz-x1Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-4-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ef87322
    • Peter Xu's avatar
      mm/gup: fix __get_user_pages() on fault retry of hugetlb · ad415db8
      Peter Xu authored
      When follow_hugetlb_page() returns with *locked==0, it means we've got a
      VM_FAULT_RETRY within the fauling process and we've released the mmap_sem.
      When that happens, we should stop and bail out.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-3-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad415db8
    • Peter Xu's avatar
      mm/gup: rename "nonblocking" to "locked" where proper · 4f6da934
      Peter Xu authored
      Patch series "mm: Page fault enhancements", v6.
      
      This series contains cleanups and enhancements to current page fault
      logic.  The whole idea comes from the discussion between Andrea and Linus
      on the bug reported by syzbot here:
      
        https://lkml.org/lkml/2017/11/2/833
      
      Basically it does two things:
      
        (a) Allows the page fault logic to be more interactive on not only
            SIGKILL, but also the rest of userspace signals, and,
      
        (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
            than once.
      
      For (a): with the changes we should be able to react faster when page
      faults are working in parallel with userspace signals like SIGSTOP and
      SIGCONT (and more), and with that we can remove the buggy part in
      userfaultfd and benefit the whole page fault mechanism on faster signal
      processing to reach the userspace.
      
      For (b), we should be able to allow the page fault handler to loop for
      even more than twice.  Some context: for now since we have
      FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the
      same interrupt context, however never more than twice.  This can be not
      only a potential cleanup to remove this assumption since AFAIU the code
      itself doesn't really have this twice-only limitation (though that should
      be a protective approach in the past), at the same time it'll greatly
      simplify future works like userfaultfd write-protect where it's possible
      to retry for more than twice (please have a look at [1] below for a
      possible user that might require the page fault to be handled for a third
      time; if we can remove the retry limitation we can simply drop that patch
      and those complexity).
      
      This patch (of 16):
      
      There's plenty of places around __get_user_pages() that has a parameter
      "nonblocking" which does not really mean that "it won't block" (because it
      can really block) but instead it shows whether the mmap_sem is released by
      up_read() during the page fault handling mostly when VM_FAULT_RETRY is
      returned.
      
      We have the correct naming in e.g.  get_user_pages_locked() or
      get_user_pages_remote() as "locked", however there're still many places
      that are using the "nonblocking" as name.
      
      Renaming the places to "locked" where proper to better suite the
      functionality of the variable.  While at it, fixing up some of the
      comments accordingly.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f6da934
    • Matthew Wilcox (Oracle)'s avatar
      mm: add pagemap.h to the fine documentation · 767e5ee5
      Matthew Wilcox (Oracle) authored
      The documentation currently does not include the deathless prose written
      to describe functions in pagemap.h because it's not included in any rst
      file.  Fix up the mismatches between parameter names and the documentation
      and add the file to mm-api.
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Link: http://lkml.kernel.org/r/20200221220045.24989-1-willy@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      767e5ee5
    • Anshuman Khandual's avatar
      mm/vma: make is_vma_temporary_stack() available for general use · 222100ee
      Anshuman Khandual authored
      Currently the declaration and definition for is_vma_temporary_stack() are
      scattered.  Lets make is_vma_temporary_stack() helper available for
      general use and also drop the declaration from (include/linux/huge_mm.h)
      which is no longer required.  While at this, rename this as
      vma_is_temporary_stack() in line with existing helpers.  This should not
      cause any functional change.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      222100ee
    • Anshuman Khandual's avatar
      mm/vma: make vma_is_foreign() available for general use · 7969f226
      Anshuman Khandual authored
      Idea of a foreign VMA with respect to the present context is very generic.
      But currently there are two identical definitions for this in powerpc and
      x86 platforms.  Lets consolidate those redundant definitions while making
      vma_is_foreign() available for general use later.  This should not cause
      any functional change.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Link: http://lkml.kernel.org/r/1582782965-3274-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7969f226
    • Anshuman Khandual's avatar
      mm/vma: move VM_NO_KHUGEPAGED into generic header · b4443772
      Anshuman Khandual authored
      Patch series "mm/vma: some more minor changes", v2.
      
      The motivation here is to consolidate VMA flags and helpers in generic
      memory header and reduce code duplication when ever applicable.  If there
      are other possible similar instances which might be missing here, please
      do let me me know.  I will be happy to incorporate them.
      
      This patch (of 3):
      
      Move VM_NO_KHUGEPAGED into generic header (include/linux/mm.h).  This just
      makes sure that no VMA flag is scattered in individual function files any
      longer.  While at this, fix an old comment which is no longer valid.  This
      should not cause any functional change.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1582782965-3274-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4443772
    • Thomas Hellstrom's avatar
      mm/mapping_dirty_helpers: update huge page-table entry callbacks · b2a403fd
      Thomas Hellstrom authored
      Following the update of pagewalk code commit a07984d48146 ("mm: pagewalk:
      add p4d_entry() and pgd_entry()") we can modify the mapping_dirty_helpers'
      huge page-table entry callbacks to avoid splitting when a huge pud or -pmd
      is encountered.
      Signed-off-by: default avatarThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSteven Price <steven.price@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200203154305.15045-1-thomas_os@shipmail.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2a403fd
    • Roman Gushchin's avatar
      mm: memcg: make memory.oom.group tolerable to task migration · 48fe267c
      Roman Gushchin authored
      If a task is getting moved out of the OOMing cgroup, it might result in
      unexpected OOM killings if memory.oom.group is used anywhere in the cgroup
      tree.
      
      Imagine the following example:
      
                A (oom.group = 1)
               / \
        (OOM) B   C
      
      Let's say B's memory.max is exceeded and it's OOMing.  The OOM killer
      selects a task in B as a victim, but someone asynchronously moves the task
      into C.  mem_cgroup_get_oom_group() will iterate over all ancestors of C
      up to the root cgroup.  In theory it had to stop at the oom_domain level -
      the memory cgroup which is OOMing.  But because B is not an ancestor of C,
      it's not happening.  Instead it chooses A (because it's oom.group is set),
      and kills all tasks in A.  This behavior is wrong because the OOM happened
      in B, so there is no reason to kill anything outside.
      
      Fix this by checking it the memory cgroup to which the task belongs is a
      descendant of the oom_domain.  If not, memory.oom.group should be ignored,
      and the OOM killer should kill only the victim task.
      Reported-by: default avatarDan Schatzberg <dschatzberg@fb.com>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/20200316223510.3176148-1-guro@fb.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48fe267c
    • Chris Down's avatar
      mm, memcg: prevent mem_cgroup_protected store tearing · b3a7822e
      Chris Down authored
      The read side of this is all protected, but we can still tear if multiple
      iterations of mem_cgroup_protected are going at the same time.
      
      There's some intentional racing in mem_cgroup_protected which is ok, but
      load/store tearing should be avoided.
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/d1e9fbc0379fe8db475d82c8b6fbe048876e12ae.1584034301.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3a7822e
    • Chris Down's avatar
      mm, memcg: prevent memory.swap.max load tearing · 32d087cd
      Chris Down authored
      The write side of this is xchg()/smp_mb(), so that's all good.  Just a few
      sites missing a READ_ONCE.
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/bbec2c3d822217334855c8877a9d28b2a6d395fb.1584034301.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32d087cd
    • Chris Down's avatar
      mm, memcg: prevent memory.min load/store tearing · c3d53200
      Chris Down authored
      This can be set concurrently with reads, which may cause the wrong value
      to be propagated.
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/e809b4e6b0c1626dac6945970de06409a180ee65.1584034301.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3d53200
    • Chris Down's avatar
      mm, memcg: prevent memory.low load/store tearing · f86b810c
      Chris Down authored
      This can be set concurrently with reads, which may cause the wrong value
      to be propagated.
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/448206f44b0fa7be9dad2ca2601d2bcb2c0b7844.1584034301.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f86b810c
    • Chris Down's avatar
      mm, memcg: prevent memory.max load tearing · 15b42562
      Chris Down authored
      This one is a bit more nuanced because we have memcg_max_mutex, which is
      mostly just used for enforcing invariants, but we still need to READ_ONCE
      since (despite its name) it doesn't really protect memory.max access.
      
      On write (page_counter_set_max() and memory_max_write()) we use xchg(),
      which uses smp_mb(), so that's already fine.
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/50a31e5f39f8ae6c8fb73966ba1455f0924e8f44.1584034301.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15b42562
    • Chris Down's avatar
      mm, memcg: prevent memory.high load/store tearing · f6f989c5
      Chris Down authored
      A mem_cgroup's high attribute can be concurrently set at the same time as
      we are trying to read it -- for example, if we are in memory_high_write at
      the same time as we are trying to do high reclaim.
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/2f66f7038ed1d4688e59de72b627ae0ea52efa83.1584034301.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6f989c5
    • Vincenzo Frascino's avatar
      mm/memcontrol.c: make mem_cgroup_id_get_many() __maybe_unused · c1514c0a
      Vincenzo Frascino authored
      mem_cgroup_id_get_many() is currently used only when MMU or MEMCG_SWAP
      configuration options are enabled.  Having them disabled triggers the
      following warning at compile time:
      
        linux/mm/memcontrol.c:4797:13: warning: `mem_cgroup_id_get_many' defined but not used [-Wunused-function]
         static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n)
      
      Make mem_cgroup_id_get_many() __maybe_unused to address the issue.
      Signed-off-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200305164354.48147-1-vincenzo.frascino@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1514c0a
    • Shakeel Butt's avatar
      memcg: css_tryget_online cleanups · 8965aa28
      Shakeel Butt authored
      Currently multiple locations in memcg code, css_tryget_online() is being
      used. However it doesn't matter whether the cgroup is online for the
      callers. Online used to matter when we had reparenting on offlining and
      we needed a way to prevent new ones from showing up.
      
      The failure case for couple of these css_tryget_online usage is to
      fallback to root_mem_cgroup which kind of make bypassing the memcg
      limits possible for some workloads. For example creating an inotify
      group in a subcontainer and then deleting that container after moving the
      process to a different container will make all the event objects
      allocated for that group to the root_mem_cgroup. So, using
      css_tryget_online() is dangerous for such cases.
      
      Two locations still use the online version. The swapin of offlined
      memcg's pages and the memcg kmem cache creation. The kmem cache indeed
      needs the online version as the kernel does the reparenting of memcg
      kmem caches. For the swapin case, it has been left for later as the
      fallback is not really that concerning.
      
      With swap accounting enabled, if the memcg of the swapped out page is
      not online then the memcg extracted from the given 'mm' will be charged
      and if 'mm' is NULL then root memcg will be charged.  However I could
      not find a code path where the given 'mm' will be NULL for swap-in
      case.
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200302203109.179417-1-shakeelb@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8965aa28
    • Johannes Weiner's avatar
      mm: memcontrol: recursive memory.low protection · 8a931f80
      Johannes Weiner authored
      Right now, the effective protection of any given cgroup is capped by its
      own explicit memory.low setting, regardless of what the parent says.  The
      reasons for this are mostly historical and ease of implementation: to make
      delegation of memory.low safe, effective protection is the min() of all
      memory.low up the tree.
      
      Unfortunately, this limitation makes it impossible to protect an entire
      subtree from another without forcing the user to make explicit protection
      allocations all the way to the leaf cgroups - something that is highly
      undesirable in real life scenarios.
      
      Consider memory in a data center host.  At the cgroup top level, we have a
      distinction between system management software and the actual workload the
      system is executing.  Both branches are further subdivided into individual
      services, job components etc.
      
      We want to protect the workload as a whole from the system management
      software, but that doesn't mean we want to protect and prioritize
      individual workload wrt each other.  Their memory demand can vary over
      time, and we'd want the VM to simply cache the hottest data within the
      workload subtree.  Yet, the current memory.low limitations force us to
      allocate a fixed amount of protection to each workload component in order
      to get protection from system management software in general.  This
      results in very inefficient resource distribution.
      
      Another concern with mandating downward allocation is that, as the
      complexity of the cgroup tree grows, it gets harder for the lower levels
      to be informed about decisions made at the host-level.  Consider a
      container inside a namespace that in turn creates its own nested tree of
      cgroups to run multiple workloads.  It'd be extremely difficult to
      configure memory.low parameters in those leaf cgroups that on one hand
      balance pressure among siblings as the container desires, while also
      reflecting the host-level protection from e.g.  rpm upgrades, that lie
      beyond one or more delegation and namespacing points in the tree.
      
      It's highly unusual from a cgroup interface POV that nested levels have to
      be aware of and reflect decisions made at higher levels for them to be
      effective.
      
      To enable such use cases and scale configurability for complex trees, this
      patch implements a resource inheritance model for memory that is similar
      to how the CPU and the IO controller implement work-conserving resource
      allocations: a share of a resource allocated to a subree always applies to
      the entire subtree recursively, while allowing, but not mandating,
      children to further specify distribution rules.
      
      That means that if protection is explicitly allocated among siblings,
      those configured shares are being followed during page reclaim just like
      they are now.  However, if the memory.low set at a higher level is not
      fully claimed by the children in that subtree, the "floating" remainder is
      applied to each cgroup in the tree in proportion to its size.  Since
      reclaim pressure is applied in proportion to size as well, each child in
      that tree gets the same boost, and the effect is neutral among siblings -
      with respect to each other, they behave as if no memory control was
      enabled at all, and the VM simply balances the memory demands optimally
      within the subtree.  But collectively those cgroups enjoy a boost over the
      cgroups in neighboring trees.
      
      E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
      it's not getting a share of the hierarchically assigned resource, just
      that it doesn't claim a fixed amount of it to protect from its siblings.
      
      This allows us to recursively protect one subtree (workload) from another
      (system management), while letting subgroups compete freely among each
      other - without having to assign fixed shares to each leaf, and without
      nested groups having to echo higher-level settings.
      
      The floating protection composes naturally with fixed protection.
      Consider the following example tree:
      
      		A            A: low = 2G
                     / \          A1: low = 1G
                    A1 A2         A2: low = 0G
      
      As outside pressure is applied to this tree, A1 will enjoy a fixed
      protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
      evenly among A1 and A2, coming out to 1.5G and 0.5G.
      
      There is a slight risk of regressing theoretical setups where the
      top-level cgroups don't know about the true budgeting and set bogusly high
      "bypass" values that are meaningfully allocated down the tree.  Such
      setups would rely on unclaimed protection to be discarded, and
      distributing it would change the intended behavior.  Be safe and hide the
      new behavior behind a mount option, 'memory_recursiveprot'.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a931f80
    • Johannes Weiner's avatar
      mm: memcontrol: clean up and document effective low/min calculations · bc50bcc6
      Johannes Weiner authored
      The effective protection of any given cgroup is a somewhat complicated
      construct that depends on the ancestor's configuration, siblings'
      configurations, as well as current memory utilization in all these groups.
      It's done this way to satisfy hierarchical delegation requirements while
      also making the configuration semantics flexible and expressive in complex
      real life scenarios.
      
      Unfortunately, all the rules and requirements are sparsely documented, and
      the code is a little too clever in merging different scenarios into a
      single min() expression.  This makes it hard to reason about the
      implementation and avoid breaking semantics when making changes to it.
      
      This patch documents each semantic rule individually and splits out the
      handling of the overcommit case from the regular case.
      
      Michal Koutný also points out that the points of equilibrium as described
      in the existing example scenarios aren't actually accurate.  Delete these
      examples for now to avoid confusion.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-3-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc50bcc6
    • Johannes Weiner's avatar
      mm: memcontrol: fix memory.low proportional distribution · 503970e4
      Johannes Weiner authored
      Patch series "mm: memcontrol: recursive memory.low protection", v3.
      
      The current memory.low (and memory.min) semantics require protection to be
      assigned to a cgroup in an untinterrupted chain from the top-level cgroup
      all the way to the leaf.
      
      In practice, we want to protect entire cgroup subtrees from each other
      (system management software vs.  workload), but we would like the VM to
      balance memory optimally *within* each subtree, without having to make
      explicit weight allocations among individual components.  The current
      semantics make that impossible.
      
      They also introduce unmanageable complexity into more advanced resource
      trees.  For example:
      
                host root
                `- system.slice
                   `- rpm upgrades
                   `- logging
                `- workload.slice
                   `- a container
                      `- system.slice
                      `- workload.slice
                         `- job A
                            `- component 1
                            `- component 2
                         `- job B
      
      At a host-level perspective, we would like to protect the outer
      workload.slice subtree as a whole from rpm upgrades, logging etc.  But for
      that to be effective, right now we'd have to propagate it down through the
      container, the inner workload.slice, into the job cgroup and ultimately
      the component cgroups where memory is actually, physically allocated.
      This may cross several tree delegation points and namespace boundaries,
      which make such a setup near impossible.
      
      CPU and IO on the other hand are already distributed recursively.  The
      user would simply configure allowances at the host level, and they would
      apply to the entire subtree without any downward propagation.
      
      To enable the above-mentioned usecases and bring memory in line with other
      resource controllers, this patch series extends memory.low/min such that
      settings apply recursively to the entire subtree.  Users can still assign
      explicit shares in subgroups, but if they don't, any ancestral protection
      will be distributed such that children compete freely amongst each other -
      as if no memory control were enabled inside the subtree - but enjoy
      protection from neighboring trees.
      
      In the above example, the user would then be able to configure shares of
      CPU, IO and memory at the host level to comprehensively protect and
      isolate the workload.slice as a whole from system.slice activity.
      
      Patch #1 fixes an existing bug that can give a cgroup tree more protection
      than it should receive as per ancestor configuration.
      
      Patch #2 simplifies and documents the existing code to make it easier to
      reason about the changes in the next patch.
      
      Patch #3 finally implements recursive memory protection semantics.
      
      Because of a risk of regressing legacy setups, the new semantics are
      hidden behind a cgroup2 mount option, 'memory_recursiveprot'.
      
      More details in patch #3.
      
      This patch (of 3):
      
      When memory.low is overcommitted - i.e.  the children claim more
      protection than their shared ancestor grants them - the allowance is
      distributed in proportion to how much each sibling uses their own declared
      protection:
      
      	low_usage = min(memory.low, memory.current)
      	elow = parent_elow * (low_usage / siblings_low_usage)
      
      However, siblings_low_usage is not the sum of all low_usages. It sums
      up the usages of *only those cgroups that are within their memory.low*
      That means that low_usage can be *bigger* than siblings_low_usage, and
      consequently the total protection afforded to the children can be
      bigger than what the ancestor grants the subtree.
      
      Consider three groups where two are in excess of their protection:
      
        A/memory.low = 10G
        A/A1/memory.low = 10G, memory.current = 20G
        A/A2/memory.low = 10G, memory.current = 20G
        A/A3/memory.low = 10G, memory.current =  8G
        siblings_low_usage = 8G (only A3 contributes)
      
        A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
        A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
        A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G
      
        (the 12.5G are capped to the explicit memory.low setting of 10G)
      
      With that, the sum of all awarded protection below A is 30G, when A
      only grants 10G for the entire subtree.
      
      What does this mean in practice? A1 and A2 would still be in excess of
      their 10G allowance and would be reclaimed, whereas A3 would not. As
      they eventually drop below their protection setting, they would be
      counted in siblings_low_usage again and the error would right itself.
      
      When reclaim was applied in a binary fashion (cgroup is reclaimed when
      it's above its protection, otherwise it's skipped) this would actually
      work out just fine. However, since 1bc63fb1 ("mm, memcg: make scan
      aggression always exclude protection"), reclaim pressure is scaled to
      how much a cgroup is above its protection. As a result this
      calculation error unduly skews pressure away from A1 and A2 toward the
      rest of the system.
      
      But why did we do it like this in the first place?
      
      The reasoning behind exempting groups in excess from
      siblings_low_usage was to go after them first during reclaim in an
      overcommitted subtree:
      
        A/memory.low = 2G, memory.current = 4G
        A/A1/memory.low = 3G, memory.current = 2G
        A/A2/memory.low = 1G, memory.current = 2G
      
        siblings_low_usage = 2G (only A1 contributes)
        A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G
      
      While the children combined are overcomitting A and are technically
      both at fault, A2 is actively declaring unprotected memory and we
      would like to reclaim that first.
      
      However, while this sounds like a noble goal on the face of it, it
      doesn't make much difference in actual memory distribution: Because A
      is overcommitted, reclaim will not stop once A2 gets pushed back to
      within its allowance; we'll have to reclaim A1 either way. The end
      result is still that protection is distributed proportionally, with A1
      getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.
      
      [ If A weren't overcommitted, it wouldn't make a difference since each
        cgroup would just get the protection it declares:
      
        A/memory.low = 2G, memory.current = 3G
        A/A1/memory.low = 1G, memory.current = 1G
        A/A2/memory.low = 1G, memory.current = 2G
      
        With the current calculation:
      
        siblings_low_usage = 1G (only A1 contributes)
        A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
      
        Including excess groups in siblings_low_usage:
      
        siblings_low_usage = 2G
        A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]
      
      Simplify the calculation and fix the proportional reclaim bug by
      including excess cgroups in siblings_low_usage.
      
      After this patch, the effective memory.low distribution from the
      example above would be as follows:
      
        A/memory.low = 10G
        A/A1/memory.low = 10G, memory.current = 20G
        A/A2/memory.low = 10G, memory.current = 20G
        A/A3/memory.low = 10G, memory.current =  8G
        siblings_low_usage = 28G
      
        A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
        A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
        A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G
      
      Fixes: 1bc63fb1 ("mm, memcg: make scan aggression always exclude protection")
      Fixes: 23067153 ("mm: memory.low hierarchical behavior")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-2-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      503970e4