1. 02 Jun, 2020 40 commits
    • Christoph Hellwig's avatar
      mm: enforce that vmap can't map pages executable · cca98e9f
      Christoph Hellwig authored
      To help enforcing the W^X protection don't allow remapping existing pages
      as executable.
      
      x86 bits from Peter Zijlstra, arm64 bits from Mark Rutland.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Mark Rutland <mark.rutland@arm.com>.
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-20-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cca98e9f
    • Christoph Hellwig's avatar
      mm: remove the prot argument from vm_map_ram · d4efd79a
      Christoph Hellwig authored
      This is always PAGE_KERNEL - for long term mappings with other properties
      vmap should be used.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-19-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4efd79a
    • Christoph Hellwig's avatar
      mm: remove unmap_vmap_area · 855e57a1
      Christoph Hellwig authored
      This function just has a single caller, open code it there.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-18-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      855e57a1
    • Christoph Hellwig's avatar
      mm: remove map_vm_range · ed1f324c
      Christoph Hellwig authored
      Switch all callers to map_kernel_range, which symmetric to the unmap side
      (as well as the _noflush versions).
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-17-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed1f324c
    • Christoph Hellwig's avatar
      mm: don't return the number of pages from map_kernel_range{,_noflush} · 60bb4465
      Christoph Hellwig authored
      None of the callers needs the number of pages, and a 0 / -errno return
      value is a lot more intuitive.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-16-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60bb4465
    • Christoph Hellwig's avatar
      mm: rename vmap_page_range to map_kernel_range · a29adb62
      Christoph Hellwig authored
      This matches the map_kernel_range_noflush API.  Also change to pass a size
      instead of the end, similar to the noflush version.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-15-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a29adb62
    • Christoph Hellwig's avatar
      mm: remove vmap_page_range_noflush and vunmap_page_range · b521c43f
      Christoph Hellwig authored
      These have non-static aliases called map_kernel_range_noflush and
      unmap_kernel_range_noflush that just differ slightly in the calling
      conventions that pass addr + size instead of an end.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-14-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b521c43f
    • Christoph Hellwig's avatar
      mm: pass addr as unsigned long to vb_free · 78a0e8c4
      Christoph Hellwig authored
      Ever use of addr in vb_free casts to unsigned long first, and the caller
      has an unsigned long version of the address available anyway.  Just pass
      that and avoid all the casts.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-13-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78a0e8c4
    • Christoph Hellwig's avatar
      mm: only allow page table mappings for built-in zsmalloc · b607e6d1
      Christoph Hellwig authored
      This allows to unexport map_vm_area and unmap_kernel_range, which are
      rather deep internal and should not be available to modules, as they for
      example allow fine grained control of mapping permissions, and also
      allow splitting the setup of a vmalloc area and the actual mapping and
      thus expose vmalloc internals.
      
      zsmalloc is typically built-in and continues to work (just like the
      percpu-vm code using a similar patter), while modular zsmalloc also
      continues to work, but must use copies.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-12-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b607e6d1
    • Christoph Hellwig's avatar
      mm: rename CONFIG_PGTABLE_MAPPING to CONFIG_ZSMALLOC_PGTABLE_MAPPING · 8b136018
      Christoph Hellwig authored
      Rename the Kconfig variable to clarify the scope.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-11-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b136018
    • Christoph Hellwig's avatar
      mm: unexport unmap_kernel_range_noflush · 8f87cc93
      Christoph Hellwig authored
      There are no modular users of this function.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-10-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f87cc93
    • Christoph Hellwig's avatar
      mm: remove __get_vm_area · 49266277
      Christoph Hellwig authored
      Switch the two remaining callers to use __get_vm_area_caller instead.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-9-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49266277
    • Christoph Hellwig's avatar
      powerpc: remove __ioremap_at and __iounmap_at · 91f03f29
      Christoph Hellwig authored
      These helpers are only used for remapping the ISA I/O base.  Replace the
      mapping side with a remap_isa_range helper in isa-bridge.c that hard codes
      all the known arguments, and just remove __iounmap_at in favour of open
      coding it in the only caller.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-8-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91f03f29
    • Christoph Hellwig's avatar
      powerpc: add an ioremap_phb helper · b274014c
      Christoph Hellwig authored
      Factor code shared between pci_64 and electra_cf into a ioremap_pbh helper
      that follows the normal ioremap semantics, and returns a useful __iomem
      pointer.  Note that it opencodes __ioremap_at as we know from the callers
      the slab is available.  Switch pci_64 to also store the result as __iomem
      pointer, and unmap the result using iounmap instead of force casting and
      using vmalloc APIs.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-7-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b274014c
    • Christoph Hellwig's avatar
      dma-mapping: use vmap insted of reimplementing it · 515e5b6d
      Christoph Hellwig authored
      Replace the open coded instance of vmap with the actual function.  In
      the non-contiguous (IOMMU) case this requires an extra find_vm_area,
      but given that this isn't a fast path function that is a small price
      to pay.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-6-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      515e5b6d
    • Christoph Hellwig's avatar
      staging: media: ipu3: use vmap instead of reimplementing it · f8092aa1
      Christoph Hellwig authored
      Just use vmap instead of messing with vmalloc internals.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-5-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8092aa1
    • Christoph Hellwig's avatar
      staging: android: ion: use vmap instead of vm_map_ram · 5bf99174
      Christoph Hellwig authored
      vm_map_ram can keep mappings around after the vm_unmap_ram.  Using that
      with non-PAGE_KERNEL mappings can lead to all kinds of aliasing issues.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-4-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bf99174
    • Christoph Hellwig's avatar
      x86: fix vmap arguments in map_irq_stack · 03488011
      Christoph Hellwig authored
      vmap does not take a gfp_t, the flags argument is for VM_* flags.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-3-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03488011
    • Christoph Hellwig's avatar
      x86/hyperv: use vmalloc_exec for the hypercall page · 78bb17f7
      Christoph Hellwig authored
      Patch series "decruft the vmalloc API", v2.
      
      Peter noticed that with some dumb luck you can toast the kernel address
      space with exported vmalloc symbols.
      
      I used this as an opportunity to decruft the vmalloc.c API and make it
      much more systematic.  This also removes any chance to create vmalloc
      mappings outside the designated areas or using executable permissions
      from modules.  Besides that it removes more than 300 lines of code.
      
      This patch (of 29):
      
      Use the designated helper for allocating executable kernel memory, and
      remove the now unused PAGE_KERNEL_RX define.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Acked-by: default avatarWei Liu <wei.liu@kernel.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Link: http://lkml.kernel.org/r/20200414131348.444715-1-hch@lst.de
      Link: http://lkml.kernel.org/r/20200414131348.444715-2-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78bb17f7
    • Wetp Zhang's avatar
      mm, memory_failure: don't send BUS_MCEERR_AO for action required error · 872e9a20
      Wetp Zhang authored
      Some processes dont't want to be killed early, but in "Action Required"
      case, those also may be killed by BUS_MCEERR_AO when sharing memory with
      other which is accessing the fail memory.  And sending SIGBUS with
      BUS_MCEERR_AO for action required error is strange, so ignore the
      non-current processes here.
      Suggested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarWetp Zhang <wetp.zy@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/1590817116-21281-1-git-send-email-wetp.zy@linux.alibaba.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      872e9a20
    • chenqiwu's avatar
      mm/memory: remove unnecessary pte_devmap case in copy_one_pte() · 6972f55c
      chenqiwu authored
      Since commit 25b2995a ("mm: remove MEMORY_DEVICE_PUBLIC support"),
      the assignment to 'page' for pte_devmap case has been unnecessary.
      Let's remove it.
      
      [willy@infradead.org: changelog]
      Signed-off-by: default avatarchenqiwu <chenqiwu@xiaomi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/1587349685-31712-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6972f55c
    • Huang Ying's avatar
      /proc/PID/smaps: Add PMD migration entry parsing · c94b6923
      Huang Ying authored
      Now, when reading /proc/PID/smaps, the PMD migration entry in page table
      is simply ignored.  To improve the accuracy of /proc/PID/smaps, its
      parsing and processing is added.
      
      To test the patch, we run pmbench to eat 400 MB memory in background,
      then run /usr/bin/migratepages and `cat /proc/PID/smaps` every second.
      The issue as follows can be reproduced within 60 seconds.
      
      Before the patch, for the fully populated 400 MB anonymous VMA, some THP
      pages under migration may be lost as below.
      
        7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
        Size:             409600 kB
        KernelPageSize:        4 kB
        MMUPageSize:           4 kB
        Rss:              407552 kB
        Pss:              407552 kB
        Shared_Clean:          0 kB
        Shared_Dirty:          0 kB
        Private_Clean:         0 kB
        Private_Dirty:    407552 kB
        Referenced:       301056 kB
        Anonymous:        407552 kB
        LazyFree:              0 kB
        AnonHugePages:    405504 kB
        ShmemPmdMapped:        0 kB
        FilePmdMapped:        0 kB
        Shared_Hugetlb:        0 kB
        Private_Hugetlb:       0 kB
        Swap:                  0 kB
        SwapPss:               0 kB
        Locked:                0 kB
        THPeligible:		1
        VmFlags: rd wr mr mw me ac
      
      After the patch, it will be always,
      
        7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
        Size:             409600 kB
        KernelPageSize:        4 kB
        MMUPageSize:           4 kB
        Rss:              409600 kB
        Pss:              409600 kB
        Shared_Clean:          0 kB
        Shared_Dirty:          0 kB
        Private_Clean:         0 kB
        Private_Dirty:    409600 kB
        Referenced:       294912 kB
        Anonymous:        409600 kB
        LazyFree:              0 kB
        AnonHugePages:    407552 kB
        ShmemPmdMapped:        0 kB
        FilePmdMapped:        0 kB
        Shared_Hugetlb:        0 kB
        Private_Hugetlb:       0 kB
        Swap:                  0 kB
        SwapPss:               0 kB
        Locked:                0 kB
        THPeligible:		1
        VmFlags: rd wr mr mw me ac
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Link: http://lkml.kernel.org/r/20200403123059.1846960-1-ying.huang@intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c94b6923
    • Steven Price's avatar
      mm: ptdump: expand type of 'val' in note_page() · 99395ee3
      Steven Price authored
      The page table entry is passed in the 'val' argument to note_page(),
      however this was previously an "unsigned long" which is fine on 64-bit
      platforms.  But for 32 bit x86 it is not always big enough to contain a
      page table entry which may be 64 bits.
      
      Change the type to u64 to ensure that it is always big enough.
      
      [akpm@linux-foundation.org: fix riscv]
      Reported-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200521152308.33096-3-steven.price@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99395ee3
    • Steven Price's avatar
      x86: mm: ptdump: calculate effective permissions correctly · 1494e0c3
      Steven Price authored
      Patch series "Fix W+X debug feature on x86"
      
      Jan alerted me[1] that the W+X detection debug feature was broken in x86
      by my change[2] to switch x86 to use the generic ptdump infrastructure.
      
      Fundamentally the approach of trying to move the calculation of
      effective permissions into note_page() was broken because note_page() is
      only called for 'leaf' entries and the effective permissions are passed
      down via the internal nodes of the page tree.  The solution I've taken
      here is to create a new (optional) callback which is called for all
      nodes of the page tree and therefore can calculate the effective
      permissions.
      
      Secondly on some configurations (32 bit with PAE) "unsigned long" is not
      large enough to store the table entries.  The fix here is simple - let's
      just use a u64.
      
      [1] https://lore.kernel.org/lkml/d573dc7e-e742-84de-473d-f971142fa319@suse.com/
      [2] 2ae27137 ("x86: mm: convert dump_pagetables to use walk_page_range")
      
      This patch (of 2):
      
      By switching the x86 page table dump code to use the generic code the
      effective permissions are no longer calculated correctly because the
      note_page() function is only called for *leaf* entries.  To calculate
      the actual effective permissions it is necessary to observe the full
      hierarchy of the page tree.
      
      Introduce a new callback for ptdump which is called for every entry and
      can therefore update the prot_levels array correctly.  note_page() can
      then simply access the appropriate element in the array.
      
      [steven.price@arm.com: make the assignment conditional on val != 0]
        Link: http://lkml.kernel.org/r/430c8ab4-e7cd-6933-dde6-087fac6db872@arm.com
      Fixes: 2ae27137 ("x86: mm: convert dump_pagetables to use walk_page_range")
      Reported-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200521152308.33096-1-steven.price@arm.com
      Link: http://lkml.kernel.org/r/20200521152308.33096-2-steven.price@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1494e0c3
    • Zefan Li's avatar
      memcg: fix memcg_kmem_bypass() for remote memcg charging · 50d53d7c
      Zefan Li authored
      While trying to use remote memcg charging in an out-of-tree kernel
      module I found it's not working, because the current thread is a
      workqueue thread.
      
      As we will probably encounter this issue in the future as the users of
      memalloc_use_memcg() grow, and it's nothing wrong for this usage, it's
      better we fix it now.
      Signed-off-by: default avatarZefan Li <lizefan@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/1d202a12-26fe-0012-ea14-f025ddcd044a@huawei.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50d53d7c
    • Jakub Kicinski's avatar
      mm/memcg: automatically penalize tasks with high swap use · 4b82ab4f
      Jakub Kicinski authored
      Add a memory.swap.high knob, which can be used to protect the system
      from SWAP exhaustion.  The mechanism used for penalizing is similar to
      memory.high penalty (sleep on return to user space).
      
      That is not to say that the knob itself is equivalent to memory.high.
      The objective is more to protect the system from potentially buggy tasks
      consuming a lot of swap and impacting other tasks, or even bringing the
      whole system to stand still with complete SWAP exhaustion.  Hopefully
      without the need to find per-task hard limits.
      
      Slowing misbehaving tasks down gradually allows user space oom killers
      or other protection mechanisms to react.  oomd and earlyoom already do
      killing based on swap exhaustion, and memory.swap.high protection will
      help implement such userspace oom policies more reliably.
      
      We can use one counter for number of pages allocated under pressure to
      save struct task space and avoid two separate hierarchy walks on the hot
      path.  The exact overage is calculated on return to user space, anyway.
      
      Take the new high limit into account when determining if swap is "full".
      Borrowing the explanation from Johannes:
      
        The idea behind "swap full" is that as long as the workload has plenty
        of swap space available and it's not changing its memory contents, it
        makes sense to generously hold on to copies of data in the swap device,
        even after the swapin.  A later reclaim cycle can drop the page without
        any IO.  Trading disk space for IO.
      
        But the only two ways to reclaim a swap slot is when they're faulted
        in and the references go away, or by scanning the virtual address space
        like swapoff does - which is very expensive (one could argue it's too
        expensive even for swapoff, it's often more practical to just reboot).
      
        So at some point in the fill level, we have to start freeing up swap
        slots on fault/swapin.  Otherwise we could eventually run out of swap
        slots while they're filled with copies of data that is also in RAM.
      
        We don't want to OOM a workload because its available swap space is
        filled with redundant cache.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b82ab4f
    • Jakub Kicinski's avatar
      mm/memcg: move cgroup high memory limit setting into struct page_counter · d1663a90
      Jakub Kicinski authored
      High memory limit is currently recorded directly in struct mem_cgroup.
      We are about to add a high limit for swap, move the field to struct
      page_counter and add some helpers.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200527195846.102707-4-kuba@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1663a90
    • Jakub Kicinski's avatar
      mm/memcg: move penalty delay clamping out of calculate_high_delay() · ff144e69
      Jakub Kicinski authored
      We will want to call calculate_high_delay() twice - once for memory and
      once for swap, and we should apply the clamp value to sum of the
      penalties.  Clamping has to be applied outside of calculate_high_delay().
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200527195846.102707-3-kuba@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff144e69
    • Jakub Kicinski's avatar
      mm/memcg: prepare for swap over-high accounting and penalty calculation · 8a5dbc65
      Jakub Kicinski authored
      Patch series "memcg: Slow down swap allocation as the available space
      gets depleted", v6.
      
      Tejun describes the problem as follows:
      
      When swap runs out, there's an abrupt change in system behavior - the
      anonymous memory suddenly becomes unmanageable which readily breaks any
      sort of memory isolation and can bring down the whole system.  To avoid
      that, oomd [1] monitors free swap space and triggers kills when it drops
      below the specific threshold (e.g.  15%).
      
      While this works, it's far from ideal:
      
       - Depending on IO performance and total swap size, a given
         headroom might not be enough or too much.
      
       - oomd has to monitor swap depletion in addition to the usual
         pressure metrics and it currently doesn't consider memory.swap.max.
      
      Solve this by adapting parts of the approach that memory.high uses -
      slow down allocation as the resource gets depleted turning the depletion
      behavior from abrupt cliff one to gradual degradation observable through
      memory pressure metric.
      
      [1] https://github.com/facebookincubator/oomd
      
      This patch (of 4):
      
      Slice the memory overage calculation logic a little bit so we can reuse
      it to apply a similar penalty to the swap.  The logic which accesses the
      memory-specific fields (use and high values) has to be taken out of
      calculate_high_delay().
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200527195846.102707-1-kuba@kernel.org
      Link: http://lkml.kernel.org/r/20200527195846.102707-2-kuba@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a5dbc65
    • Shakeel Butt's avatar
      memcg: expose root cgroup's memory.stat · 54b512e9
      Shakeel Butt authored
      One way to measure the efficiency of memory reclaim is to look at the
      ratio (pgscan+pfrefill)/pgsteal.  However at the moment these stats are
      not updated consistently at the system level and the ratio of these are
      not very meaningful.  The pgsteal and pgscan are updated for only global
      reclaim while pgrefill gets updated for global as well as cgroup
      reclaim.
      
      Please note that this difference is only for system level vmstats.  The
      cgroup stats returned by memory.stat are actually consistent.  The
      cgroup's pgsteal contains number of reclaimed pages for global as well
      as cgroup reclaim.  So, one way to get the system level stats is to get
      these stats from root's memory.stat, so, expose memory.stat for the root
      cgroup.
      
      From Johannes Weiner:
      	There are subtle differences between /proc/vmstat and
      	memory.stat, and cgroup-aware code that wants to watch the full
      	hierarchy currently has to know about these intricacies and
      	translate semantics back and forth.
      
      	Generally having the fully recursive memory.stat at the root
      	level could help a broader range of usecases.
      
      Why not fix the stats by including both the global and cgroup reclaim
      activity instead of exposing root cgroup's memory.stat? The reason is
      the benefit of having metrics exposing the activity that happens purely
      due to machine capacity rather than localized activity that happens due
      to the limits throughout the cgroup tree.  Additionally there are
      userspace tools like sysstat(sar) which reads these stats to inform
      about the system level reclaim activity.  So, we should not break such
      use-cases.
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200508170630.94406-1-shakeelb@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54b512e9
    • Kaixu Xia's avatar
      mm: memcontrol: simplify value comparison between count and limit · 1c4448ed
      Kaixu Xia authored
      When the variables count and limit have the same value(count == limit),
      the result of min(margin, limit - count) statement should be 0 and the
      variable margin is set to 0.  So in this case, the min() statement is
      not necessary and we can directly set the variable margin to 0.
      Signed-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/1587479661-27237-1-git-send-email-kaixuxia@tencent.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c4448ed
    • Yafang Shao's avatar
      mm, memcg: add workingset_restore in memory.stat · a6f5576b
      Yafang Shao authored
      There's a new workingset counter introduced in commit 1899ad18 ("mm:
      workingset: tell cache transitions from workingset thrashing").  With
      the help of this counter we can know the workingset is transitioning or
      thrashing.  To leverage the benifit of this counter to memcg, we should
      introduce it into memory.stat.  Then we could know the workingset of the
      workload inside a memcg better.
      
      Bellow is the verification of this new counter in memory.stat.  Read a
      file into the memory and then read it again to make these pages be
      active.  The size of this file is 1G.  (memory.max is greater than file
      size) The counters in memory.stat will be
      
      	inactive_file 0
      	active_file 1073639424
      
      	workingset_refault 0
      	workingset_activate 0
      	workingset_restore 0
      	workingset_nodereclaim 0
      
      Trigger the memcg reclaim by setting a lower value to memory.high, and
      then some pages will be demoted into inactive list, and then some pages
      in the inactive list will be evicted into the storage.
      
      	inactive_file 498094080
      	active_file 310063104
      
      	workingset_refault 0
      	workingset_activate 0
      	workingset_restore 0
      	workingset_nodereclaim 0
      
      Then recover the memory.high and read the file into memory again.  As a
      result of it, the transitioning will occur.  Bellow is the result of
      this transitioning,
      
      	inactive_file 498094080
      	active_file 575397888
      
      	workingset_refault 64746
      	workingset_activate 64746
      	workingset_restore 64746
      	workingset_nodereclaim 0
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6f5576b
    • Miaohe Lin's avatar
      include/linux/swap.h: delete meaningless __add_to_swap_cache() declaration · 251af0cd
      Miaohe Lin authored
      Since commit 8d93b41c ("mm: Convert add_to_swap_cache to XArray"),
      __add_to_swap_cache and add_to_swap_cache are combined into one
      function.  There is no __add_to_swap_cache() anymore.
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Link: http://lkml.kernel.org/r/1590810326-2493-1-git-send-email-linmiaohe@huawei.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      251af0cd
    • Randy Dunlap's avatar
      mm: swapfile: fix /proc/swaps heading and Size/Used/Priority alignment · 6f793940
      Randy Dunlap authored
      Fix the heading and Size/Used/Priority field alignments in /proc/swaps.
      If the Size and/or Used value is >= 10000000 (8 bytes), then the
      alignment by using tab characters is broken.
      
      This patch maintains the use of tabs for alignment.  If spaces are
      preferred, we can just use a Field Width specifier for the bytes and
      inuse fields.  That way those fields don't have to be a multiple of 8
      bytes in width.  E.g., with a field width of 12, both Size and Used
      would always fit on the first line of an 80-column wide terminal (only
      Priority would be on the second line).
      
      There are actually 2 problems: heading alignment and field width.  On an
      xterm, if Used is 7 bytes in length, the tab does nothing, and the
      display is like this, with no space/tab between the Used and Priority
      fields.  (ugh)
      
      Filename				Type		Size	Used	Priority
      /dev/sda8                               partition	16779260	2023012-1
      
      To be clear, if one does 'cat /proc/swaps >/tmp/proc.swaps', it does look
      different, like so:
      
      Filename				Type		Size	Used	Priority
      /dev/sda8                               partition	16779260	2086988	-1
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Link: http://lkml.kernel.org/r/c0ffb41a-81ac-ddfa-d452-a9229ecc0387@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f793940
    • Huang Ying's avatar
      swap: reduce lock contention on swap cache from swap slots allocation · 49070588
      Huang Ying authored
      In some swap scalability test, it is found that there are heavy lock
      contention on swap cache even if we have split one swap cache radix tree
      per swap device to one swap cache radix tree every 64 MB trunk in commit
      4b3ef9da ("mm/swap: split swap cache into 64MB trunks").
      
      The reason is as follow.  After the swap device becomes fragmented so
      that there's no free swap cluster, the swap device will be scanned
      linearly to find the free swap slots.  swap_info_struct->cluster_next is
      the next scanning base that is shared by all CPUs.  So nearby free swap
      slots will be allocated for different CPUs.  The probability for
      multiple CPUs to operate on the same 64 MB trunk is high.  This causes
      the lock contention on the swap cache.
      
      To solve the issue, in this patch, for SSD swap device, a percpu version
      next scanning base (cluster_next_cpu) is added.  Every CPU will use its
      own per-cpu next scanning base.  And after finishing scanning a 64MB
      trunk, the per-cpu scanning base will be changed to the beginning of
      another randomly selected 64MB trunk.  In this way, the probability for
      multiple CPUs to operate on the same 64 MB trunk is reduced greatly.
      Thus the lock contention is reduced too.  For HDD, because sequential
      access is more important for IO performance, the original shared next
      scanning base is used.
      
      To test the patch, we have run 16-process pmbench memory benchmark on a
      2-socket server machine with 48 cores.  One ram disk is configured as the
      swap device per socket.  The pmbench working-set size is much larger than
      the available memory so that swapping is triggered.  The memory read/write
      ratio is 80/20 and the accessing pattern is random.  In the original
      implementation, the lock contention on the swap cache is heavy.  The perf
      profiling data of the lock contention code path is as following,
      
       _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.91
       _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:               7.11
       _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
       _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     1.66
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.29
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.03
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        0.93
      
      After applying this patch, it becomes,
      
       _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      2.3
       _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     2.26
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        1.8
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.19
      
      The lock contention on the swap cache is almost eliminated.
      
      And the pmbench score increases 18.5%.  The swapin throughput increases
      18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout throughput increases
      18.5% from 2.99 GB/s to 3.54 GB/s.
      
      We need really fast disk to show the benefit.  I have tried this on 2
      Intel P3600 NVMe disks.  The performance improvement is only about 1%.
      The improvement should be better on the faster disks, such as Intel Optane
      disk.
      
      [ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel]
        Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com
      [ying.huang@intel.com: v4]
        Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49070588
    • Huang Ying's avatar
      mm/swapfile.c: use prandom_u32_max() · 09fe06ce
      Huang Ying authored
      To improve the code readability and take advantage of the common
      implementation.
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200512081013.520201-1-ying.huang@intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09fe06ce
    • Wei Yang's avatar
      33e16272
    • Wei Yang's avatar
      mm/swapfile.c: classify SWAP_MAP_XXX to make it more readable · 4b4bb6bb
      Wei Yang authored
      swap_info_struct->swap_map[] encodes some flag and count. And to
      do some condition check, it also introduces some special values.
      
      Currently those macros are defined with some magic order, which makes
      audience hard to understand the exact meaning.
      
      This patch split those macros into three categories:
      
          flag
          special value for first swap_map
          special value for continued swap_map
      
      May this help audiences a little.
      
      [akpm@linux-foundation.org: tweak capitalization in comments]
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200501015259.32237-1-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b4bb6bb
    • Huang Ying's avatar
      swap: try to scan more free slots even when fragmented · ed43af10
      Huang Ying authored
      Now, the scalability of swap code will drop much when the swap device
      becomes fragmented, because the swap slots allocation batching stops
      working.  To solve the problem, in this patch, we will try to scan a
      little more swap slots with restricted effort to batch the swap slots
      allocation even if the swap device is fragmented.  Test shows that the
      benchmark score can increase up to 37.1% with the patch.  Details are as
      follows.
      
      The swap code has a per-cpu cache of swap slots.  These batch swap space
      allocations to improve swap subsystem scaling.  In the following code
      path,
      
        add_to_swap()
          get_swap_page()
            refill_swap_slots_cache()
              get_swap_pages()
      	  scan_swap_map_slots()
      
      scan_swap_map_slots() and get_swap_pages() can return multiple swap
      slots for each call.  These slots will be cached in the per-CPU swap
      slots cache, so that several following swap slot requests will be
      fulfilled there to avoid the lock contention in the lower level swap
      space allocation/freeing code path.
      
      But this only works when there are free swap clusters.  If a swap device
      becomes so fragmented that there's no free swap clusters,
      scan_swap_map_slots() and get_swap_pages() will return only one swap
      slot for each call in the above code path.  Effectively, this falls back
      to the situation before the swap slots cache was introduced, the heavy
      lock contention on the swap related locks kills the scalability.
      
      Why does it work in this way? Because the swap device could be large,
      and the free swap slot scanning could be quite time consuming, to avoid
      taking too much time to scanning free swap slots, the conservative
      method was used.
      
      In fact, this can be improved via scanning a little more free slots with
      strictly restricted effort.  Which is implemented in this patch.  In
      scan_swap_map_slots(), after the first free swap slot is gotten, we will
      try to scan a little more, but only if we haven't scanned too many slots
      (< LATENCY_LIMIT).  That is, the added scanning latency is strictly
      restricted.
      
      To test the patch, we have run 16-process pmbench memory benchmark on a
      2-socket server machine with 48 cores.  Multiple ram disks are
      configured as the swap devices.  The pmbench working-set size is much
      larger than the available memory so that swapping is triggered.  The
      memory read/write ratio is 80/20 and the accessing pattern is random, so
      the swap space becomes highly fragmented during the test.  In the
      original implementation, the lock contention on swap related locks is
      very heavy.  The perf profiling data of the lock contention code path is
      as following,
      
       _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap:             21.03
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    1.92
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      1.72
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       0.69
      
      While after applying this patch, it becomes,
      
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    4.89
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      3.85
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       1.1
       _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88
      
      That is, the lock contention on the swap locks is eliminated.
      
      And the pmbench score increases 37.1%.  The swapin throughput increases
      45.7% from 2.02 GB/s to 2.94 GB/s.  While the swapout throughput increases
      45.3% from 2.04 GB/s to 2.97 GB/s.
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed43af10
    • Wei Yang's avatar
      mm/swapfile.c: omit a duplicate code by compare tmp and max first · 7b9e2de1
      Wei Yang authored
      There are two duplicate code to handle the case when there is no available
      swap entry.  To avoid this, we can compare tmp and max first and let the
      second guard do its job.
      
      No functional change is expected.
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200421213824.8099-3-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b9e2de1