1. 05 Jun, 2020 2 commits
    • Ye Bin's avatar
      ata/libata: Fix usage of page address by page_address in ata_scsi_mode_select_xlat function · f650ef61
      Ye Bin authored
      BUG: KASAN: use-after-free in ata_scsi_mode_select_xlat+0x10bd/0x10f0
      drivers/ata/libata-scsi.c:4045
      Read of size 1 at addr ffff88803b8cd003 by task syz-executor.6/12621
      
      CPU: 1 PID: 12621 Comm: syz-executor.6 Not tainted 4.19.95 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.10.2-1ubuntu1 04/01/2014
      Call Trace:
      __dump_stack lib/dump_stack.c:77 [inline]
      dump_stack+0xac/0xee lib/dump_stack.c:118
      print_address_description+0x60/0x223 mm/kasan/report.c:253
      kasan_report_error mm/kasan/report.c:351 [inline]
      kasan_report mm/kasan/report.c:409 [inline]
      kasan_report.cold+0xae/0x2d8 mm/kasan/report.c:393
      ata_scsi_mode_select_xlat+0x10bd/0x10f0 drivers/ata/libata-scsi.c:4045
      ata_scsi_translate+0x2da/0x680 drivers/ata/libata-scsi.c:2035
      __ata_scsi_queuecmd drivers/ata/libata-scsi.c:4360 [inline]
      ata_scsi_queuecmd+0x2e4/0x790 drivers/ata/libata-scsi.c:4409
      scsi_dispatch_cmd+0x2ee/0x6c0 drivers/scsi/scsi_lib.c:1867
      scsi_queue_rq+0xfd7/0x1990 drivers/scsi/scsi_lib.c:2170
      blk_mq_dispatch_rq_list+0x1e1/0x19a0 block/blk-mq.c:1186
      blk_mq_do_dispatch_sched+0x147/0x3d0 block/blk-mq-sched.c:108
      blk_mq_sched_dispatch_requests+0x427/0x680 block/blk-mq-sched.c:204
      __blk_mq_run_hw_queue+0xbc/0x200 block/blk-mq.c:1308
      __blk_mq_delay_run_hw_queue+0x3c0/0x460 block/blk-mq.c:1376
      blk_mq_run_hw_queue+0x152/0x310 block/blk-mq.c:1413
      blk_mq_sched_insert_request+0x337/0x6c0 block/blk-mq-sched.c:397
      blk_execute_rq_nowait+0x124/0x320 block/blk-exec.c:64
      blk_execute_rq+0xc5/0x112 block/blk-exec.c:101
      sg_scsi_ioctl+0x3b0/0x6a0 block/scsi_ioctl.c:507
      sg_ioctl+0xd37/0x23f0 drivers/scsi/sg.c:1106
      vfs_ioctl fs/ioctl.c:46 [inline]
      file_ioctl fs/ioctl.c:501 [inline]
      do_vfs_ioctl+0xae6/0x1030 fs/ioctl.c:688
      ksys_ioctl+0x76/0xa0 fs/ioctl.c:705
      __do_sys_ioctl fs/ioctl.c:712 [inline]
      __se_sys_ioctl fs/ioctl.c:710 [inline]
      __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:710
      do_syscall_64+0xa0/0x2e0 arch/x86/entry/common.c:293
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45c479
      Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89
      f7 48
      89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 0f
      83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fb0e9602c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00007fb0e96036d4 RCX: 000000000045c479
      RDX: 0000000020000040 RSI: 0000000000000001 RDI: 0000000000000003
      RBP: 000000000076bfc0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 000000000000046d R14: 00000000004c6e1a R15: 000000000076bfcc
      
      Allocated by task 12577:
      set_track mm/kasan/kasan.c:460 [inline]
      kasan_kmalloc mm/kasan/kasan.c:553 [inline]
      kasan_kmalloc+0xbf/0xe0 mm/kasan/kasan.c:531
      __kmalloc+0xf3/0x1e0 mm/slub.c:3749
      kmalloc include/linux/slab.h:520 [inline]
      load_elf_phdrs+0x118/0x1b0 fs/binfmt_elf.c:441
      load_elf_binary+0x2de/0x4610 fs/binfmt_elf.c:737
      search_binary_handler fs/exec.c:1654 [inline]
      search_binary_handler+0x15c/0x4e0 fs/exec.c:1632
      exec_binprm fs/exec.c:1696 [inline]
      __do_execve_file.isra.0+0xf52/0x1a90 fs/exec.c:1820
      do_execveat_common fs/exec.c:1866 [inline]
      do_execve fs/exec.c:1883 [inline]
      __do_sys_execve fs/exec.c:1964 [inline]
      __se_sys_execve fs/exec.c:1959 [inline]
      __x64_sys_execve+0x8a/0xb0 fs/exec.c:1959
      do_syscall_64+0xa0/0x2e0 arch/x86/entry/common.c:293
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Freed by task 12577:
      set_track mm/kasan/kasan.c:460 [inline]
      __kasan_slab_free+0x129/0x170 mm/kasan/kasan.c:521
      slab_free_hook mm/slub.c:1370 [inline]
      slab_free_freelist_hook mm/slub.c:1397 [inline]
      slab_free mm/slub.c:2952 [inline]
      kfree+0x8b/0x1a0 mm/slub.c:3904
      load_elf_binary+0x1be7/0x4610 fs/binfmt_elf.c:1118
      search_binary_handler fs/exec.c:1654 [inline]
      search_binary_handler+0x15c/0x4e0 fs/exec.c:1632
      exec_binprm fs/exec.c:1696 [inline]
      __do_execve_file.isra.0+0xf52/0x1a90 fs/exec.c:1820
      do_execveat_common fs/exec.c:1866 [inline]
      do_execve fs/exec.c:1883 [inline]
      __do_sys_execve fs/exec.c:1964 [inline]
      __se_sys_execve fs/exec.c:1959 [inline]
      __x64_sys_execve+0x8a/0xb0 fs/exec.c:1959
      do_syscall_64+0xa0/0x2e0 arch/x86/entry/common.c:293
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The buggy address belongs to the object at ffff88803b8ccf00
      which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 259 bytes inside of
      512-byte region [ffff88803b8ccf00, ffff88803b8cd100)
      The buggy address belongs to the page:
      page:ffffea0000ee3300 count:1 mapcount:0 mapping:ffff88806cc03080
      index:0xffff88803b8cc780 compound_mapcount: 0
      flags: 0x100000000008100(slab|head)
      raw: 0100000000008100 ffffea0001104080 0000000200000002 ffff88806cc03080
      raw: ffff88803b8cc780 00000000800c000b 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
      ffff88803b8ccf00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ffff88803b8ccf80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88803b8cd000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ^
      ffff88803b8cd080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ffff88803b8cd100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      
      You can refer to "https://www.lkml.org/lkml/2019/1/17/474" reproduce
      this error.
      
      The exception code is "bd_len = p[3];", "p" value is ffff88803b8cd000
      which belongs to the cache kmalloc-512 of size 512. The "page_address(sg_page(scsi_sglist(scmd)))"
      maybe from sg_scsi_ioctl function "buffer" which allocated by kzalloc, so "buffer"
      may not page aligned.
      This also looks completely buggy on highmem systems and really needs to use a
      kmap_atomic.      --Christoph Hellwig
      To address above bugs, Paolo Bonzini advise to simpler to just make a char array
      of size CACHE_MPAGE_LEN+8+8+4-2(or just 64 to make it easy), use sg_copy_to_buffer
      to copy from the sglist into the buffer, and workthere.
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f650ef61
    • Navid Emamdoost's avatar
      sata_rcar: handle pm_runtime_get_sync failure cases · eea12388
      Navid Emamdoost authored
      Calling pm_runtime_get_sync increments the counter even in case of
      failure, causing incorrect ref count. Call pm_runtime_put if
      pm_runtime_get_sync fails.
      Signed-off-by: default avatarNavid Emamdoost <navid.emamdoost@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eea12388
  2. 04 Jun, 2020 38 commits
    • Linus Torvalds's avatar
      atomisp: avoid warning about unused function · 6929f71e
      Linus Torvalds authored
      The atomisp_mrfld_power() function isn't actually ever called, because
      the two call-sites have commented out the use because it breaks on some
      platforms.  That results in:
      
        drivers/staging/media/atomisp/pci/atomisp_v4l2.c:764:12: warning: ‘atomisp_mrfld_power’ defined but not used [-Wunused-function]
          764 | static int atomisp_mrfld_power(struct atomisp_device *isp, bool enable)
              |            ^~~~~~~~~~~~~~~~~~~
      
      during the build.
      
      Rather than commenting out the use entirely, just disable it
      semantically instead (using a "0 &&" construct), leaving the call in
      place from a syntax standpoint, and avoiding the warning.
      
      I really don't want my builds to have any warnings that can then hide
      real issues.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6929f71e
    • Linus Torvalds's avatar
      Merge tag 'media/v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · a98f670e
      Linus Torvalds authored
      Pull media updates from Mauro Carvalho Chehab:
      
       - Media documentation is now split into admin-guide, driver-api and
         userspace-api books (a longstanding request from Jon);
      
       - The media Kconfig was reorganized, in order to make easier to select
         drivers and their dependencies;
      
       - The testing drivers now has a separate directory;
      
       - added a new driver for Rockchip Video Decoder IP;
      
       - The atomisp staging driver was resurrected. It is meant to work with
         4 generations of cameras on Atom-based laptops, tablets and cell
         phones. So, it seems worth investing time to cleanup this driver and
         making it in good shape.
      
       - Added some V4L2 core ancillary routines to help with h264 codecs;
      
       - Added an ov2740 image sensor driver;
      
       - The si2157 gained support for Analog TV, which, in turn, added
         support for some cx231xx and cx23885 boards to also support analog
         standards;
      
       - Added some V4L2 controls (V4L2_CID_CAMERA_ORIENTATION and
         V4L2_CID_CAMERA_SENSOR_ROTATION) to help identifying where the camera
         is located at the device;
      
       - VIDIOC_ENUM_FMT was extended to support MC-centric devices;
      
       - Lots of drivers improvements and cleanups.
      
      * tag 'media/v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (503 commits)
        media: Documentation: media: Refer to mbus format documentation from CSI-2 docs
        media: s5k5baf: Replace zero-length array with flexible-array
        media: i2c: imx219: Drop <linux/clk-provider.h> and <linux/clkdev.h>
        media: i2c: Add ov2740 image sensor driver
        media: ov8856: Implement sensor module revision identification
        media: ov8856: Add devicetree support
        media: dt-bindings: ov8856: Document YAML bindings
        media: dvb-usb: Add Cinergy S2 PCIe Dual Port support
        media: dvbdev: Fix tuner->demod media controller link
        media: dt-bindings: phy: phy-rockchip-dphy-rx0: move rockchip dphy rx0 bindings out of staging
        media: staging: dt-bindings: phy-rockchip-dphy-rx0: remove non-used reg property
        media: atomisp: unify the version for isp2401 a0 and b0 versions
        media: atomisp: update TODO with the current data
        media: atomisp: adjust some code at sh_css that could be broken
        media: atomisp: don't produce errs for ignored IRQs
        media: atomisp: print IRQ when debugging
        media: atomisp: isp_mmu: don't use kmem_cache
        media: atomisp: add a notice about possible leak resources
        media: atomisp: disable the dynamic and reserved pools
        media: atomisp: turn on camera before setting it
        ...
      a98f670e
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · ee01c4d7
      Linus Torvalds authored
      Merge more updates from Andrew Morton:
       "More mm/ work, plenty more to come
      
        Subsystems affected by this patch series: slub, memcg, gup, kasan,
        pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
        thp, mmap, kconfig"
      
      * akpm: (131 commits)
        arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
        x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
        riscv: support DEBUG_WX
        mm: add DEBUG_WX support
        drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
        mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
        powerpc/mm: drop platform defined pmd_mknotpresent()
        mm: thp: don't need to drain lru cache when splitting and mlocking THP
        hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
        sparc32: register memory occupied by kernel as memblock.memory
        include/linux/memblock.h: fix minor typo and unclear comment
        mm, mempolicy: fix up gup usage in lookup_node
        tools/vm/page_owner_sort.c: filter out unneeded line
        mm: swap: memcg: fix memcg stats for huge pages
        mm: swap: fix vmstats for huge pages
        mm: vmscan: limit the range of LRU type balancing
        mm: vmscan: reclaim writepage is IO cost
        mm: vmscan: determine anon/file pressure balance at the reclaim root
        mm: balance LRU lists based on relative thrashing
        mm: only count actual rotations as LRU reclaim cost
        ...
      ee01c4d7
    • Zong Li's avatar
      arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined · 09587a09
      Zong Li authored
      Extract DEBUG_WX to mm/Kconfig.debug for shared use.  Change to use
      ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port.
      Signed-off-by: default avatarZong Li <zong.li@sifive.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/e19709e7576f65e303245fe520cad5f7bae72763.1587455584.git.zong.li@sifive.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09587a09
    • Zong Li's avatar
      x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined · 7e01ccb4
      Zong Li authored
      Extract DEBUG_WX to mm/Kconfig.debug for shared use.  Change to use
      ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port.
      Signed-off-by: default avatarZong Li <zong.li@sifive.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/430736828d149df3f5b462d291e845ec690e0141.1587455584.git.zong.li@sifive.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e01ccb4
    • Zong Li's avatar
      riscv: support DEBUG_WX · b422d28b
      Zong Li authored
      Support DEBUG_WX to check whether there are mapping with write and execute
      permission at the same time.
      
      [akpm@linux-foundation.org: replace macros with C]
      Signed-off-by: default avatarZong Li <zong.li@sifive.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/282e266311bced080bc6f7c255b92f87c1eb65d6.1587455584.git.zong.li@sifive.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b422d28b
    • Zong Li's avatar
      mm: add DEBUG_WX support · 375d315c
      Zong Li authored
      Patch series "Extract DEBUG_WX to shared use".
      
      Some architectures support DEBUG_WX function, it's verbatim from each
      others, so extract to mm/Kconfig.debug for shared use.
      
      PPC and ARM ports don't support generic page dumper yet, so we only
      refine x86 and arm64 port in this patch series.
      
      For RISC-V port, the DEBUG_WX support depends on other patches which
      be merged already:
        - RISC-V page table dumper
        - Support strict kernel memory permissions for security
      
      This patch (of 4):
      
      Some architectures support DEBUG_WX function, it's verbatim from each
      others.  Extract to mm/Kconfig.debug for shared use.
      
      [akpm@linux-foundation.org: reword text, per Will Deacon & Zong Li]
        Link: http://lkml.kernel.org/r/20200427194245.oxRJKj3fn%25akpm@linux-foundation.org
      [zong.li@sifive.com: remove the specific name of arm64]
        Link: http://lkml.kernel.org/r/3a6a92ecedc54e1d0fc941398e63d504c2cd5611.1589178399.git.zong.li@sifive.com
      [zong.li@sifive.com: add MMU dependency for DEBUG_WX]
        Link: http://lkml.kernel.org/r/4a674ac7863ff39ca91847b10e51209771f99416.1589178399.git.zong.li@sifive.comSuggested-by: default avatarPalmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarZong Li <zong.li@sifive.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/cover.1587455584.git.zong.li@sifive.com
      Link: http://lkml.kernel.org/r/23980cd0f0e5d79e24a92169116407c75bcc650d.1587455584.git.zong.li@sifive.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      375d315c
    • Scott Cheloha's avatar
      drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup · 4fb6eabf
      Scott Cheloha authored
      Searching for a particular memory block by id is an O(n) operation because
      each memory block's underlying device is kept in an unsorted linked list
      on the subsystem bus.
      
      We can cut the lookup cost to O(log n) if we cache each memory block
      in an xarray.  This time complexity improvement is significant on
      systems with many memory blocks.  For example:
      
      1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks.  With this
         change  memory_dev_init() completes ~12ms faster and walk_memory_blocks()
         completes ~12ms faster.
      
      Before:
      [    0.005042] memory_dev_init: adding memory blocks
      [    0.021591] memory_dev_init: added memory blocks
      [    0.022699] walk_memory_blocks: walking memory blocks
      [    0.038730] walk_memory_blocks: walked memory blocks 0-511
      
      After:
      [    0.005057] memory_dev_init: adding memory blocks
      [    0.009415] memory_dev_init: added memory blocks
      [    0.010519] walk_memory_blocks: walking memory blocks
      [    0.014135] walk_memory_blocks: walked memory blocks 0-511
      
      2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks.  With
         this change memory_dev_init() completes ~88ms faster and
         walk_memory_blocks() completes ~87ms faster.
      
      Before:
      [    0.252246] memory_dev_init: adding memory blocks
      [    0.395469] memory_dev_init: added memory blocks
      [    0.409413] walk_memory_blocks: walking memory blocks
      [    0.433028] walk_memory_blocks: walked memory blocks 0-511
      [    0.433094] walk_memory_blocks: walking memory blocks
      [    0.500244] walk_memory_blocks: walked memory blocks 131072-131583
      
      After:
      [    0.245063] memory_dev_init: adding memory blocks
      [    0.299539] memory_dev_init: added memory blocks
      [    0.313609] walk_memory_blocks: walking memory blocks
      [    0.315287] walk_memory_blocks: walked memory blocks 0-511
      [    0.315349] walk_memory_blocks: walking memory blocks
      [    0.316988] walk_memory_blocks: walked memory blocks 131072-131583
      
      3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks.  With
         this change we complete memory_dev_init() ~37 minutes faster and
         walk_memory_blocks() at least ~30 minutes faster.  The exact timing
         for walk_memory_blocks() is  missing, though I observed that the
         soft lockups in walk_memory_blocks() disappeared with the change,
         suggesting that lower bound.
      
      Before:
      [   13.703907] memory_dev_init: adding blocks
      [ 2287.406099] memory_dev_init: added all blocks
      [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      
      After:
      [   13.696898] memory_dev_init: adding blocks
      [   15.660035] memory_dev_init: added all blocks
      (the walk_memory_blocks traces disappear)
      
      There should be no significant negative impact for machines with few
      memory blocks.  A sparse xarray has a small footprint and an O(log n)
      lookup is negligibly slower than an O(n) lookup for only the smallest
      number of memory blocks.
      
      1. A 16GB x86 machine with 128MB memblocks has 132 blocks.  With this
         change memory_dev_init() completes ~300us faster and walk_memory_blocks()
         completes no faster or slower.  The improvement is pretty close to noise.
      
      Before:
      [    0.224752] memory_dev_init: adding memory blocks
      [    0.227116] memory_dev_init: added memory blocks
      [    0.227183] walk_memory_blocks: walking memory blocks
      [    0.227183] walk_memory_blocks: walked memory blocks 0-131
      
      After:
      [    0.224911] memory_dev_init: adding memory blocks
      [    0.226935] memory_dev_init: added memory blocks
      [    0.227089] walk_memory_blocks: walking memory blocks
      [    0.227089] walk_memory_blocks: walked memory blocks 0-131
      
      [david@redhat.com: document the locking]
        Link: http://lkml.kernel.org/r/bc21eec6-7251-4c91-2f57-9a0671f8d414@redhat.comSigned-off-by: default avatarScott Cheloha <cheloha@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rick Lindsley <ricklind@linux.vnet.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Link: http://lkml.kernel.org/r/20200121231028.13699-1-cheloha@linux.ibm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4fb6eabf
    • Anshuman Khandual's avatar
      mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid() · 86ec2da0
      Anshuman Khandual authored
      pmd_present() is expected to test positive after pmdp_mknotpresent() as
      the PMD entry still points to a valid huge page in memory.
      pmdp_mknotpresent() implies that given PMD entry is just invalidated from
      MMU perspective while still holding on to pmd_page() referred valid huge
      page thus also clearing pmd_present() test.  This creates the following
      situation which is counter intuitive.
      
      [pmd_present(pmd_mknotpresent(pmd)) = true]
      
      This renames pmd_mknotpresent() as pmd_mkinvalid() reflecting the helper's
      functionality more accurately while changing the above mentioned situation
      as follows.  This does not create any functional change.
      
      [pmd_present(pmd_mkinvalid(pmd)) = true]
      
      This is not applicable for platforms that define own pmdp_invalidate() via
      __HAVE_ARCH_PMDP_INVALIDATE.  Suggestion for renaming came during a
      previous discussion here.
      
      https://patchwork.kernel.org/patch/11019637/
      
      [anshuman.khandual@arm.com: change pmd_mknotvalid() to pmd_mkinvalid() per Will]
        Link: http://lkml.kernel.org/r/1587520326-10099-3-git-send-email-anshuman.khandual@arm.comSuggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Link: http://lkml.kernel.org/r/1584680057-13753-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86ec2da0
    • Anshuman Khandual's avatar
      powerpc/mm: drop platform defined pmd_mknotpresent() · 124cb3a6
      Anshuman Khandual authored
      Patch series "mm/thp: Rename pmd_mknotpresent() as pmd_mknotvalid()", v2.
      
      This series renames pmd_mknotpresent() as pmd_mknotvalid().  Before that
      it drops an existing pmd_mknotpresent() definition from powerpc platform
      which was never required as it defines it's pmdp_invalidate() through
      subscribing __HAVE_ARCH_PMDP_INVALIDATE.  This does not create any
      functional change.
      
      This rename was suggested by Catalin during a previous discussion while we
      were trying to change the THP helpers on arm64 platform for migration.
      
      https://patchwork.kernel.org/patch/11019637/
      
      This patch (of 2):
      
      Platform needs to define pmd_mknotpresent() for generic pmdp_invalidate()
      only when __HAVE_ARCH_PMDP_INVALIDATE is not subscribed.  Otherwise
      platform specific pmd_mknotpresent() is not required.  Hence just drop it.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1587520326-10099-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r/1584680057-13753-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r/1584680057-13753-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      124cb3a6
    • Yang Shi's avatar
      mm: thp: don't need to drain lru cache when splitting and mlocking THP · 67e4eb07
      Yang Shi authored
      Since commit 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival") THP would not stay in pagevec anymore.  So the optimization made
      by commit d9654322 ("thp: increase split_huge_page() success rate")
      doesn't make sense anymore, which tries to unpin munlocked THPs from
      pagevec by draining pagevec.
      
      Draining lru cache before isolating THP in mlock path is also unnecessary.
      b676b293 ("mm, thp: fix mapped pages avoiding unevictable list on
      mlock") added it and 9a73f61b ("thp, mlock: do not mlock PTE-mapped
      file huge pages") accidentally carried it over after the above
      optimization went in.
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/1585946493-7531-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67e4eb07
    • Shijie Hu's avatar
      hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs · 88590253
      Shijie Hu authored
      In a 32-bit program, running on arm64 architecture.  When the address
      space below mmap base is completely exhausted, shmat() for huge pages will
      return ENOMEM, but shmat() for normal pages can still success on no-legacy
      mode.  This seems not fair.
      
      For normal pages, the calling trace of get_unmapped_area() is:
      
      	=> mm->get_unmapped_area()
      	if on legacy mode,
      		=> arch_get_unmapped_area()
      			=> vm_unmapped_area()
      	if on no-legacy mode,
      		=> arch_get_unmapped_area_topdown()
      			=> vm_unmapped_area()
      
      For huge pages, the calling trace of get_unmapped_area() is:
      
      	=> file->f_op->get_unmapped_area()
      		=> hugetlb_get_unmapped_area()
      			=> vm_unmapped_area()
      
      To solve this issue, we only need to make hugetlb_get_unmapped_area() take
      the same way as mm->get_unmapped_area().  Add *bottomup() and *topdown()
      for hugetlbfs, and check current mm->get_unmapped_area() to decide which
      one to use.  If mm->get_unmapped_area is equal to
      arch_get_unmapped_area_topdown(), hugetlb_get_unmapped_area() calls
      topdown routine, otherwise calls bottomup routine.
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarShijie Hu <hushijie3@huawei.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: yangerkun <yangerkun@huawei.com>
      Cc: ChenGang <cg.chen@huawei.com>
      Cc: Chen Jie <chenjie6@huawei.com>
      Link: http://lkml.kernel.org/r/20200518065338.113664-1-hushijie3@huawei.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88590253
    • Mike Rapoport's avatar
      sparc32: register memory occupied by kernel as memblock.memory · 4360dfa9
      Mike Rapoport authored
      sparc32 never registered the memory occupied by the kernel image with
      memblock_add() and it only reserved this memory with meblock_reserve().
      
      With openbios as system firmware, the memory occupied by the kernel is
      reserved in openbios and removed from mem.available.  The prom setup code
      in the kernel uses mem.available to set up the memory banks and
      essentially there is a hole for the memory occupied by the kernel image.
      
      Later in bootmem_init() this memory is memblock_reserve()d.
      
      Up until recently, memmap initialization would call __init_single_page()
      for the pages in that hole, the free_low_memory_core_early() would mark
      them as reserved and everything would be Ok.
      
      After the change in memmap initialization introduced by the commit "mm:
      memmap_init: iterate over memblock regions rather that check each PFN",
      the hole is skipped and the page structs for it are not initialized.  And
      when they are passed from memblock to page allocator as reserved, the
      latter gets confused.
      
      Simply registering the memory occupied by the kernel with memblock_add()
      resolves this issue.
      
      Tested on qemu-system-sparc with Debian Etch [1] userspace.
      
      [1] https://people.debian.org/~aurel32/qemu/sparc/debian_etch_sparc_small.qcow2Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Link: https://lkml.kernel.org/r/20200517000050.GA87467@roeck-us.nlllllet/Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4360dfa9
    • chenqiwu's avatar
      include/linux/memblock.h: fix minor typo and unclear comment · 8cbd54f5
      chenqiwu authored
      Fix a minor typo "usabe->usable" for the current discription of member
      variable "memory" in struct memblock.
      
      BTW, I think it's unclear the member variable "base" in struct
      memblock_type is currently described as the physical address of memory
      region, change it to base address of the region is clearer since the
      variable is decorated as phys_addr_t.
      Signed-off-by: default avatarchenqiwu <chenqiwu@xiaomi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Link: http://lkml.kernel.org/r/1588846952-32166-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cbd54f5
    • Michal Hocko's avatar
      mm, mempolicy: fix up gup usage in lookup_node · 2d3a36a4
      Michal Hocko authored
      ba841078 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
      has added a special casing for 0 return value because that was a possible
      gup return value when interrupted by fatal signal.  This has been fixed by
      ae46d2aa ("mm/gup: Let __get_user_pages_locked() return -EINTR for
      fatal signal") in the mean time so ba841078 can be reverted.
      
      This patch however doesn't go all the way to revert it because the check
      for 0 is wrong and confusing here.  Firstly it is inherently unsafe to
      access the page when get_user_pages_locked returns 0 (aka no page
      returned).
      
      Fortunatelly this will not happen because get_user_pages_locked will not
      return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
      the case here.  Document this potential error code in gup code while we
      are at it.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Xu <peterx@redhat.com>
      Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d3a36a4
    • Changhee Han's avatar
      tools/vm/page_owner_sort.c: filter out unneeded line · 5b94ce2f
      Changhee Han authored
      To see a sorted result from page_owner, there must be a tiresome
      preprocessing step before running page_owner_sort.  This patch simply
      filters out lines which start with "PFN" while reading the page owner
      report.
      Signed-off-by: default avatarChanghee Han <ch0.han@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Link: http://lkml.kernel.org/r/20200429052940.16968-1-ch0.han@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b94ce2f
    • Shakeel Butt's avatar
      mm: swap: memcg: fix memcg stats for huge pages · 21e330fc
      Shakeel Butt authored
      The commit 2262185c ("mm: per-cgroup memory reclaim stats") added
      PGLAZYFREE, PGACTIVATE & PGDEACTIVATE stats for cgroups but missed
      couple of places and PGLAZYFREE missed huge page handling. Fix that.
      Also for PGLAZYFREE use the irq-unsafe function to update as the irq is
      already disabled.
      
      Fixes: 2262185c ("mm: per-cgroup memory reclaim stats")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/20200527182947.251343-1-shakeelb@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21e330fc
    • Shakeel Butt's avatar
      mm: swap: fix vmstats for huge pages · 5d91f31f
      Shakeel Butt authored
      Many of the callbacks called by pagevec_lru_move_fn() does not correctly
      update the vmstats for huge pages. Fix that. Also __pagevec_lru_add_fn()
      use the irq-unsafe alternative to update the stat as the irqs are
      already disabled.
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/20200527182916.249910-1-shakeelb@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d91f31f
    • Johannes Weiner's avatar
      mm: vmscan: limit the range of LRU type balancing · d483a5dd
      Johannes Weiner authored
      When LRU cost only shows up on one list, we abruptly stop scanning that
      list altogether.  That's an extreme reaction: by the time the other list
      starts thrashing and the pendulum swings back, we may have no recent age
      information on the first list anymore, and we could have significant
      latencies until the scanner has caught up.
      
      Soften this change in the feedback system by ensuring that no list
      receives less than a third of overall pressure, and only distribute the
      other 66% according to LRU cost.  This ensures that we maintain a minimum
      rate of aging on the entire workingset while it's being pressured, while
      still allowing a generous rate of convergence when the relative sizes of
      the lists need to adjust.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-15-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d483a5dd
    • Johannes Weiner's avatar
      mm: vmscan: reclaim writepage is IO cost · 96f8bf4f
      Johannes Weiner authored
      The VM tries to balance reclaim pressure between anon and file so as to
      reduce the amount of IO incurred due to the memory shortage.  It already
      counts refaults and swapins, but in addition it should also count
      writepage calls during reclaim.
      
      For swap, this is obvious: it's IO that wouldn't have occurred if the
      anonymous memory hadn't been under memory pressure.  From a relative
      balancing point of view this makes sense as well: even if anon is cold and
      reclaimable, a cache that isn't thrashing may have equally cold pages that
      don't require IO to reclaim.
      
      For file writeback, it's trickier: some of the reclaim writepage IO would
      have likely occurred anyway due to dirty expiration.  But not all of it -
      premature writeback reduces batching and generates additional writes.
      Since the flushers are already woken up by the time the VM starts writing
      cache pages one by one, let's assume that we'e likely causing writes that
      wouldn't have happened without memory pressure.  In addition, the per-page
      cost of IO would have probably been much cheaper if written in larger
      batches from the flusher thread rather than the single-page-writes from
      kswapd.
      
      For our purposes - getting the trend right to accelerate convergence on a
      stable state that doesn't require paging at all - this is sufficiently
      accurate.  If we later wanted to optimize for sustained thrashing, we can
      still refine the measurements.
      
      Count all writepage calls from kswapd as IO cost toward the LRU that the
      page belongs to.
      
      Why do this dynamically?  Don't we know in advance that anon pages require
      IO to reclaim, and so could build in a static bias?
      
      First, scanning is not the same as reclaiming.  If all the anon pages are
      referenced, we may not swap for a while just because we're scanning the
      anon list.  During this time, however, it's important that we age
      anonymous memory and the page cache at the same rate so that their
      hot-cold gradients are comparable.  Everything else being equal, we still
      want to reclaim the coldest memory overall.
      
      Second, we keep copies in swap unless the page changes.  If there is
      swap-backed data that's mostly read (tmpfs file) and has been swapped out
      before, we can reclaim it without incurring additional IO.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96f8bf4f
    • Johannes Weiner's avatar
      mm: vmscan: determine anon/file pressure balance at the reclaim root · 7cf111bc
      Johannes Weiner authored
      We split the LRU lists into anon and file, and we rebalance the scan
      pressure between them when one of them begins thrashing: if the file cache
      experiences workingset refaults, we increase the pressure on anonymous
      pages; if the workload is stalled on swapins, we increase the pressure on
      the file cache instead.
      
      With cgroups and their nested LRU lists, we currently don't do this
      correctly.  While recursive cgroup reclaim establishes a relative LRU
      order among the pages of all involved cgroups, LRU pressure balancing is
      done on an individual cgroup LRU level.  As a result, when one cgroup is
      thrashing on the filesystem cache while a sibling may have cold anonymous
      pages, pressure doesn't get equalized between them.
      
      This patch moves LRU balancing decision to the root of reclaim - the same
      level where the LRU order is established.
      
      It does this by tracking LRU cost recursively, so that every level of the
      cgroup tree knows the aggregate LRU cost of all memory within its domain.
      When the page scanner calculates the scan balance for any given individual
      cgroup's LRU list, it uses the values from the ancestor cgroup that
      initiated the reclaim cycle.
      
      If one sibling is then thrashing on the cache, it will tip the pressure
      balance inside its ancestors, and the next hierarchical reclaim iteration
      will go more after the anon pages in the tree.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cf111bc
    • Johannes Weiner's avatar
      mm: balance LRU lists based on relative thrashing · 314b57fb
      Johannes Weiner authored
      Since the LRUs were split into anon and file lists, the VM has been
      balancing between page cache and anonymous pages based on per-list ratios
      of scanned vs.  rotated pages.  In most cases that tips page reclaim
      towards the list that is easier to reclaim and has the fewest actively
      used pages, but there are a few problems with it:
      
      1. Refaults and LRU rotations are weighted the same way, even though
         one costs IO and the other costs a bit of CPU.
      
      2. The less we scan an LRU list based on already observed rotations,
         the more we increase the sampling interval for new references, and
         rotations become even more likely on that list. This can enter a
         death spiral in which we stop looking at one list completely until
         the other one is all but annihilated by page reclaim.
      
      Since commit a528910e ("mm: thrash detection-based file cache sizing")
      we have refault detection for the page cache.  Along with swapin events,
      they are good indicators of when the file or anon list, respectively, is
      too small for its workingset and needs to grow.
      
      For example, if the page cache is thrashing, the cache pages need more
      time in memory, while there may be colder pages on the anonymous list.
      Likewise, if swapped pages are faulting back in, it indicates that we
      reclaim anonymous pages too aggressively and should back off.
      
      Replace LRU rotations with refaults and swapins as the basis for relative
      reclaim cost of the two LRUs.  This will have the VM target list balances
      that incur the least amount of IO on aggregate.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      314b57fb
    • Johannes Weiner's avatar
      mm: only count actual rotations as LRU reclaim cost · 264e90cc
      Johannes Weiner authored
      When shrinking the active file list we rotate referenced pages only when
      they're in an executable mapping.  The others get deactivated.  When it
      comes to balancing scan pressure, though, we count all referenced pages as
      rotated, even the deactivated ones.  Yet they do not carry the same cost
      to the system: the deactivated page *might* refault later on, but the
      deactivation is tangible progress toward freeing pages; rotations on the
      other hand cost time and effort without getting any closer to freeing
      memory.
      
      Don't treat both events as equal.  The following patch will hook up LRU
      balancing to cache and anon refaults, which are a much more concrete cost
      signal for reclaiming one list over the other.  Thus, remove the maybe-IO
      cost bias from page references, and only note the CPU cost for actual
      rotations that prevent the pages from getting reclaimed.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-11-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      264e90cc
    • Johannes Weiner's avatar
      mm: deactivations shouldn't bias the LRU balance · fbbb602e
      Johannes Weiner authored
      Operations like MADV_FREE, FADV_DONTNEED etc.  currently move any affected
      active pages to the inactive list to accelerate their reclaim (good) but
      also steer page reclaim toward that LRU type, or away from the other
      (bad).
      
      The reason why this is undesirable is that such operations are not part of
      the regular page aging cycle, and rather a fluke that doesn't say much
      about the remaining pages on that list; they might all be in heavy use,
      and once the chunk of easy victims has been purged, the VM continues to
      apply elevated pressure on those remaining hot pages.  The other LRU,
      meanwhile, might have easily reclaimable pages, and there was never a need
      to steer away from it in the first place.
      
      As the previous patch outlined, we should focus on recording actually
      observed cost to steer the balance rather than speculating about the
      potential value of one LRU list over the other.  In that spirit, leave
      explicitely deactivated pages to the LRU algorithm to pick up, and let
      rotations decide which list is the easiest to reclaim.
      
      [cai@lca.pw: fix set-but-not-used warning]
        Link: http://lkml.kernel.org/r/20200522133335.GA624@Qians-MacBook-Air.localSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200520232525.798933-10-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbbb602e
    • Johannes Weiner's avatar
      mm: base LRU balancing on an explicit cost model · 1431d4d1
      Johannes Weiner authored
      Currently, scan pressure between the anon and file LRU lists is balanced
      based on a mixture of reclaim efficiency and a somewhat vague notion of
      "value" of having certain pages in memory over others.  That concept of
      value is problematic, because it has caused us to count any event that
      remotely makes one LRU list more or less preferrable for reclaim, even
      when these events are not directly comparable and impose very different
      costs on the system.  One example is referenced file pages that we still
      deactivate and referenced anonymous pages that we actually rotate back to
      the head of the list.
      
      There is also conceptual overlap with the LRU algorithm itself.  By
      rotating recently used pages instead of reclaiming them, the algorithm
      already biases the applied scan pressure based on page value.  Thus, when
      rebalancing scan pressure due to rotations, we should think of reclaim
      cost, and leave assessing the page value to the LRU algorithm.
      
      Lastly, considering both value-increasing as well as value-decreasing
      events can sometimes cause the same type of event to be counted twice,
      i.e.  how rotating a page increases the LRU value, while reclaiming it
      succesfully decreases the value.  In itself this will balance out fine,
      but it quietly skews the impact of events that are only recorded once.
      
      The abstract metric of "value", the murky relationship with the LRU
      algorithm, and accounting both negative and positive events make the
      current pressure balancing model hard to reason about and modify.
      
      This patch switches to a balancing model of accounting the concrete,
      actually observed cost of reclaiming one LRU over another.  For now, that
      cost includes pages that are scanned but rotated back to the list head.
      Subsequent patches will add consideration for IO caused by refaulting of
      recently evicted pages.
      
      Replace struct zone_reclaim_stat with two cost counters in the lruvec, and
      make everything that affects cost go through a new lru_note_cost()
      function.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1431d4d1
    • Johannes Weiner's avatar
      mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count() · a4fe1631
      Johannes Weiner authored
      When we calculate the relative scan pressure between the anon and file LRU
      lists, we have to assume that reclaim_stat can contain zeroes.  To avoid
      div0 crashes, we add 1 to all denominators like so:
      
              anon_prio = swappiness;
              file_prio = 200 - anon_prio;
      
      	[...]
      
              /*
               * The amount of pressure on anon vs file pages is inversely
               * proportional to the fraction of recently scanned pages on
               * each list that were recently referenced and in active use.
               */
              ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
              ap /= reclaim_stat->recent_rotated[0] + 1;
      
              fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
              fp /= reclaim_stat->recent_rotated[1] + 1;
              spin_unlock_irq(&pgdat->lru_lock);
      
              fraction[0] = ap;
              fraction[1] = fp;
              denominator = ap + fp + 1;
      
      While reclaim_stat can contain 0, it's not actually possible for ap + fp
      to be 0.  One of anon_prio or file_prio could be zero, but they must still
      add up to 200.  And the reclaim_stat fraction, due to the +1 in there, is
      always at least 1.  So if one of the two numerators is 0, the other one
      can't be.  ap + fp is always at least 1.  Drop the + 1.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-8-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4fe1631
    • Johannes Weiner's avatar
      mm: remove use-once cache bias from LRU balancing · 96824687
      Johannes Weiner authored
      When the splitlru patches divided page cache and swap-backed pages into
      separate LRU lists, the pressure balance between the lists was biased to
      account for the fact that streaming IO can cause memory pressure with a
      flood of pages that are used only once.  New page cache additions would
      tip the balance toward the file LRU, and repeat access would neutralize
      that bias again.  This ensured that page reclaim would always go for
      used-once cache first.
      
      Since e9868505 ("mm,vmscan: only evict file pages when we have
      plenty"), page reclaim generally skips over swap-backed memory entirely as
      long as there is used-once cache present, and will apply the LRU balancing
      when only repeatedly accessed cache pages are left - at which point the
      previous use-once bias will have been neutralized.  This makes the
      use-once cache balancing bias unnecessary.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-7-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96824687
    • Johannes Weiner's avatar
      mm: workingset: let cache workingset challenge anon · 34e58cac
      Johannes Weiner authored
      We activate cache refaults with reuse distances in pages smaller than the
      size of the total cache.  This allows new pages with competitive access
      frequencies to establish themselves, as well as challenge and potentially
      displace pages on the active list that have gone cold.
      
      However, that assumes that active cache can only replace other active
      cache in a competition for the hottest memory.  This is not a great
      default assumption.  The page cache might be thrashing while there are
      enough completely cold and unused anonymous pages sitting around that we'd
      only have to write to swap once to stop all IO from the cache.
      
      Activate cache refaults when their reuse distance in pages is smaller than
      the total userspace workingset, including anonymous pages.
      
      Reclaim can still decide how to balance pressure among the two LRUs
      depending on the IO situation.  Rotational drives will prefer avoiding
      random IO from swap and go harder after cache.  But fundamentally, hot
      cache should be able to compete with anon pages for a place in RAM.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34e58cac
    • Johannes Weiner's avatar
      mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() · 6058eaec
      Johannes Weiner authored
      They're the same function, and for the purpose of all callers they are
      equivalent to lru_cache_add().
      
      [akpm@linux-foundation.org: fix it for local_lock changes]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6058eaec
    • Johannes Weiner's avatar
      mm: allow swappiness that prefers reclaiming anon over the file workingset · c843966c
      Johannes Weiner authored
      With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap
      devices such as zswap, it's possible for swap to be much faster than
      filesystems, and for swapping to be preferable over thrashing filesystem
      caches.
      
      Allow setting swappiness - which defines the rough relative IO cost of
      cache misses between page cache and swap-backed pages - to reflect such
      situations by making the swap-preferred range configurable.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c843966c
    • Johannes Weiner's avatar
      mm: keep separate anon and file statistics on page reclaim activity · 497a6c1b
      Johannes Weiner authored
      Having statistics on pages scanned and pages reclaimed for both anon and
      file pages makes it easier to evaluate changes to LRU balancing.
      
      While at it, clean up the stat-keeping mess for isolation, putback,
      reclaim stats etc.  a bit: first the physical LRU operation (isolation and
      putback), followed by vmstats, reclaim_stats, and then vm events.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-3-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      497a6c1b
    • Johannes Weiner's avatar
      mm: fix LRU balancing effect of new transparent huge pages · 5df74196
      Johannes Weiner authored
      The reclaim code that balances between swapping and cache reclaim tries to
      predict likely reuse based on in-memory reference patterns alone.  This
      works in many cases, but when it fails it cannot detect when the cache is
      thrashing pathologically, or when we're in the middle of a swap storm.
      
      The high seek cost of rotational drives under which the algorithm evolved
      also meant that mistakes could quickly result in lockups from too
      aggressive swapping (which is predominantly random IO).  As a result, the
      balancing code has been tuned over time to a point where it mostly goes
      for page cache and defers swapping until the VM is under significant
      memory pressure.
      
      The resulting strategy doesn't make optimal caching decisions - where
      optimal is the least amount of IO required to execute the workload.
      
      The proliferation of fast random IO devices such as SSDs, in-memory
      compression such as zswap, and persistent memory technologies on the
      horizon, has made this undesirable behavior very noticable: Even in the
      presence of large amounts of cold anonymous memory and a capable swap
      device, the VM refuses to even seriously scan these pages, and can leave
      the page cache thrashing needlessly.
      
      This series sets out to address this.  Since commit ("a528910e mm:
      thrash detection-based file cache sizing") we have exact tracking of
      refault IO - the ultimate cost of reclaiming the wrong pages.  This allows
      us to use an IO cost based balancing model that is more aggressive about
      scanning anonymous memory when the cache is thrashing, while being able to
      avoid unnecessary swap storms.
      
      These patches base the LRU balance on the rate of refaults on each list,
      times the relative IO cost between swap device and filesystem
      (swappiness), in order to optimize reclaim for least IO cost incurred.
      
      	History
      
      I floated these changes in 2016.  At the time they were incomplete and
      full of workarounds due to a lack of infrastructure in the reclaim code:
      We didn't have PageWorkingset, we didn't have hierarchical cgroup
      statistics, and problems with the cgroup swap controller.  As swapping
      wasn't too high a priority then, the patches stalled out.  With all
      dependencies in place now, here we are again with much cleaner,
      feature-complete patches.
      
      I kept the acks for patches that stayed materially the same :-)
      
      Below is a series of test results that demonstrate certain problematic
      behavior of the current code, as well as showcase the new code's more
      predictable and appropriate balancing decisions.
      
      	Test #1: No convergence
      
      This test shows an edge case where the VM currently doesn't converge at
      all on a new file workingset with a stale anon/tmpfs set.
      
      The test sets up a cold anon set the size of 3/4 RAM, then tries to
      establish a new file set half the size of RAM (flat access pattern).
      
      The vanilla kernel refuses to even scan anon pages and never converges.
      The file set is perpetually served from the filesystem.
      
      The first test kernel is with the series up to the workingset patch
      applied.  This allows thrashing page cache to challenge the anonymous
      workingset.  The VM then scans the lists based on the current
      scanned/rotated balancing algorithm.  It converges on a stable state where
      all cold anon pages are pushed out and the fileset is served entirely from
      cache:
      
      			    noconverge/5.7-rc5-mm	noconverge/5.7-rc5-mm-workingset
      Scanned			417719308.00 (    +0.00%)		64091155.00 (   -84.66%)
      Reclaimed		417711094.00 (    +0.00%)		61640308.00 (   -85.24%)
      Reclaim efficiency %	      100.00 (    +0.00%)		      96.18 (    -3.78%)
      Scanned file		417719308.00 (    +0.00%)		59211118.00 (   -85.83%)
      Scanned anon			0.00 (    +0.00%)	         4880037.00 (          )
      Swapouts			0.00 (    +0.00%)	         2439957.00 (          )
      Swapins				0.00 (    +0.00%)		     257.00 (          )
      Refaults		415246605.00 (    +0.00%)		59183722.00 (   -85.75%)
      Restore refaults		0.00 (    +0.00%)	        54988252.00 (          )
      
      The second test kernel is with the full patch series applied, which
      replaces the scanned/rotated ratios with refault/swapin rate-based
      balancing.  It evicts the cold anon pages more aggressively in the
      presence of a thrashing cache and the absence of swapins, and so converges
      with about 60% of the IO and reclaim activity:
      
      			noconverge/5.7-rc5-mm-workingset	noconverge/5.7-rc5-mm-lrubalance
      Scanned				64091155.00 (    +0.00%)		37579741.00 (   -41.37%)
      Reclaimed			61640308.00 (    +0.00%)		35129293.00 (   -43.01%)
      Reclaim efficiency %		      96.18 (    +0.00%)		      93.48 (    -2.78%)
      Scanned file			59211118.00 (    +0.00%)		32708385.00 (   -44.76%)
      Scanned anon			 4880037.00 (    +0.00%)		 4871356.00 (    -0.18%)
      Swapouts			 2439957.00 (    +0.00%)		 2435565.00 (    -0.18%)
      Swapins				     257.00 (    +0.00%)		     262.00 (    +1.94%)
      Refaults			59183722.00 (    +0.00%)		32675667.00 (   -44.79%)
      Restore refaults		54988252.00 (    +0.00%)		28480430.00 (   -48.21%)
      
      We're triggering this case in host sideloading scenarios: When a host's
      primary workload is not saturating the machine (primary load is usually
      driven by user activity), we can optimistically sideload a batch job; if
      user activity picks up and the primary workload needs the whole host
      during this time, we freeze the sideload and rely on it getting pushed to
      swap.  Frequently that swapping doesn't happen and the completely inactive
      sideload simply stays resident while the expanding primary worklad is
      struggling to gain ground.
      
      	Test #2: Kernel build
      
      This test is a a kernel build that is slightly memory-restricted (make -j4
      inside a 400M cgroup).
      
      Despite the very aggressive swapping of cold anon pages in test #1, this
      test shows that the new kernel carefully balances swap against cache
      refaults when both the file and the cache set are pressured.
      
      It shows the patched kernel to be slightly better at finding the coldest
      memory from the combined anon and file set to evict under pressure.  The
      result is lower aggregate reclaim and paging activity:
      
      z				    5.7-rc5-mm	5.7-rc5-mm-lrubalance
      Real time		   210.60 (    +0.00%)	   210.97 (    +0.18%)
      User time		   745.42 (    +0.00%)	   746.48 (    +0.14%)
      System time		    69.78 (    +0.00%)	    69.79 (    +0.02%)
      Scanned file		354682.00 (    +0.00%)	293661.00 (   -17.20%)
      Scanned anon		465381.00 (    +0.00%)	378144.00 (   -18.75%)
      Swapouts		185920.00 (    +0.00%)	147801.00 (   -20.50%)
      Swapins			 34583.00 (    +0.00%)	 32491.00 (    -6.05%)
      Refaults		212664.00 (    +0.00%)	172409.00 (   -18.93%)
      Restore refaults	 48861.00 (    +0.00%)	 80091.00 (   +63.91%)
      Total paging IO		433167.00 (    +0.00%)	352701.00 (   -18.58%)
      
      	Test #3: Overload
      
      This next test is not about performance, but rather about the
      predictability of the algorithm.  The current balancing behavior doesn't
      always lead to comprehensible results, which makes performance analysis
      and parameter tuning (swappiness e.g.) very difficult.
      
      The test shows the balancing behavior under equivalent anon and file
      input.  Anon and file sets are created of equal size (3/4 RAM), have the
      same access patterns (a hot-cold gradient), and synchronized access rates.
      Swappiness is raised from the default of 60 to 100 to indicate equal IO
      cost between swap and cache.
      
      With the vanilla balancing code, anon scans make up around 9% of the total
      pages scanned, or a ~1:10 ratio.  This is a surprisingly skewed ratio, and
      it's an outcome that is hard to explain given the input parameters to the
      VM.
      
      The new balancing model targets a 1:2 balance: All else being equal,
      reclaiming a file page costs one page IO - the refault; reclaiming an anon
      page costs two IOs - the swapout and the swapin.  In the test we observe a
      ~1:3 balance.
      
      The scanned and paging IO numbers indicate that the anon LRU algorithm we
      have in place right now does a slightly worse job at picking the coldest
      pages compared to the file algorithm.  There is ongoing work to improve
      this, like Joonsoo's anon workingset patches; however, it's difficult to
      compare the two aging strategies when the balancing between them is
      behaving unintuitively.
      
      The slightly less efficient anon reclaim results in a deviation from the
      optimal 1:2 scan ratio we would like to see here - however, 1:3 is much
      closer to what we'd want to see in this test than the vanilla kernel's
      aging of 10+ cache pages for every anonymous one:
      
      			overload-100/5.7-rc5-mm-workingset	overload-100/5.7-rc5-mm-lrubalance-realfile
      Scanned				 533633725.00 (    +0.00%)			  595687785.00 (   +11.63%)
      Reclaimed			 494325440.00 (    +0.00%)			  518154380.00 (    +4.82%)
      Reclaim efficiency %			92.63 (    +0.00%)				 86.98 (    -6.03%)
      Scanned file			 484532894.00 (    +0.00%)			  456937722.00 (    -5.70%)
      Scanned anon			  49100831.00 (    +0.00%)			  138750063.00 (  +182.58%)
      Swapouts			   8096423.00 (    +0.00%)			   48982142.00 (  +504.98%)
      Swapins				  10027384.00 (    +0.00%)			   62325044.00 (  +521.55%)
      Refaults			 479819973.00 (    +0.00%)			  451309483.00 (    -5.94%)
      Restore refaults		 426422087.00 (    +0.00%)			  399914067.00 (    -6.22%)
      Total paging IO			 497943780.00 (    +0.00%)			  562616669.00 (   +12.99%)
      
      	Test #4: Parallel IO
      
      It's important to note that these patches only affect the situation where
      the kernel has to reclaim workingset memory, which is usually a
      transitionary period.  The vast majority of page reclaim occuring in a
      system is from trimming the ever-expanding page cache.
      
      These patches don't affect cache trimming behavior.  We never swap as long
      as we only have use-once cache moving through the file LRU, we only
      consider swapping when the cache is actively thrashing.
      
      The following test demonstrates this.  It has an anon workingset that
      takes up half of RAM and then writes a file that is twice the size of RAM
      out to disk.
      
      As the cache is funneled through the inactive file list, no anon pages are
      scanned (aside from apparently some background noise of 10 pages):
      
      					  5.7-rc5-mm		          5.7-rc5-mm-lrubalance
      Scanned			    10714722.00 (    +0.00%)		       10723445.00 (    +0.08%)
      Reclaimed		    10703596.00 (    +0.00%)		       10712166.00 (    +0.08%)
      Reclaim efficiency %		  99.90 (    +0.00%)			     99.89 (    -0.00%)
      Scanned file		    10714722.00 (    +0.00%)		       10723435.00 (    +0.08%)
      Scanned anon			   0.00 (    +0.00%)			     10.00 (          )
      Swapouts			   0.00 (    +0.00%)			      7.00 (          )
      Swapins				   0.00 (    +0.00%)			      0.00 (    +0.00%)
      Refaults			  92.00 (    +0.00%)			     41.00 (   -54.84%)
      Restore refaults		   0.00 (    +0.00%)			      0.00 (    +0.00%)
      Total paging IO			  92.00 (    +0.00%)			     48.00 (   -47.31%)
      
      This patch (of 14):
      
      Currently, THP are counted as single pages until they are split right
      before being swapped out.  However, at that point the VM is already in the
      middle of reclaim, and adjusting the LRU balance then is useless.
      
      Always account THP by the number of basepages, and remove the fixup from
      the splitting path.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-1-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20200520232525.798933-2-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5df74196
    • Johannes Weiner's avatar
      mm: memcontrol: update page->mem_cgroup stability rules · a0b5b414
      Johannes Weiner authored
      The previous patches have simplified the access rules around
      page->mem_cgroup somewhat:
      
      1. We never change page->mem_cgroup while the page is isolated by
         somebody else.  This was by far the biggest exception to our rules and
         it didn't stop at lock_page() or lock_page_memcg().
      
      2. We charge pages before they get put into page tables now, so the
         somewhat fishy rule about "can be in page table as long as it's still
         locked" is now gone and boiled down to having an exclusive reference to
         the page.
      
      Document the new rules.  Any of the following will stabilize the
      page->mem_cgroup association:
      
      - the page lock
      - LRU isolation
      - lock_page_memcg()
      - exclusive access to the page
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-20-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0b5b414
    • Johannes Weiner's avatar
      mm: memcontrol: delete unused lrucare handling · d9eb1ea2
      Johannes Weiner authored
      Swapin faults were the last event to charge pages after they had already
      been put on the LRU list.  Now that we charge directly on swapin, the
      lrucare portion of the charge code is unused.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9eb1ea2
    • Alex Shi's avatar
      mm: memcontrol: document the new swap control behavior · 0a27cae1
      Alex Shi authored
      Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-18-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a27cae1
    • Johannes Weiner's avatar
      mm: memcontrol: charge swapin pages on instantiation · 4c6355b2
      Johannes Weiner authored
      Right now, users that are otherwise memory controlled can easily escape
      their containment and allocate significant amounts of memory that they're
      not being charged for.  That's because swap readahead pages are not being
      charged until somebody actually faults them into their page table.  This
      can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
      allocations without charging the pages.
      
      There are additional problems with the delayed charging of swap pages:
      
      1. To implement refault/workingset detection for anonymous pages, we
         need to have a target LRU available at swapin time, but the LRU is not
         determinable until the page has been charged.
      
      2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
         stable when the page is isolated from the LRU; otherwise, the locks
         change under us.  But swapcache gets charged after it's already on the
         LRU, and even if we cannot isolate it ourselves (since charging is not
         exactly optional).
      
      The previous patch ensured we always maintain cgroup ownership records for
      swap pages.  This patch moves the swapcache charging point from the fault
      handler to swapin time to fix all of the above problems.
      
      v2: simplify swapin error checking (Joonsoo)
      
      [hughd@google.com: fix livelock in __read_swap_cache_async()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvilsSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c6355b2
    • Johannes Weiner's avatar
      mm: memcontrol: make swap tracking an integral part of memory control · 2d1c4980
      Johannes Weiner authored
      Without swap page tracking, users that are otherwise memory controlled can
      easily escape their containment and allocate significant amounts of memory
      that they're not being charged for.  That's because swap does readahead,
      but without the cgroup records of who owned the page at swapout, readahead
      pages don't get charged until somebody actually faults them into their
      page table and we can identify an owner task.  This can be maliciously
      exploited with MADV_WILLNEED, which triggers arbitrary readahead
      allocations without charging the pages.
      
      Make swap swap page tracking an integral part of memcg and remove the
      Kconfig options.  In the first place, it was only made configurable to
      allow users to save some memory.  But the overhead of tracking cgroup
      ownership per swap page is minimal - 2 byte per page, or 512k per 1G of
      swap, or 0.04%.  Saving that at the expense of broken containment
      semantics is not something we should present as a coequal option.
      
      The swapaccount=0 boot option will continue to exist, and it will
      eliminate the page_counter overhead and hide the swap control files, but
      it won't disable swap slot ownership tracking.
      
      This patch makes sure we always have the cgroup records at swapin time;
      the next patch will fix the actual bug by charging readahead swap pages at
      swapin time rather than at fault time.
      
      v2: fix double swap charge bug in cgroup1/cgroup2 code gating
      
      [hannes@cmpxchg.org: fix crash with cgroup_disable=memory]
        Link: http://lkml.kernel.org/r/20200521215855.GB815153@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Link: http://lkml.kernel.org/r/20200508183105.225460-16-hannes@cmpxchg.orgDebugged-by: default avatarHugh Dickins <hughd@google.com>
      Debugged-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d1c4980
    • Johannes Weiner's avatar
      mm: memcontrol: prepare swap controller setup for integration · eccb52e7
      Johannes Weiner authored
      A few cleanups to streamline the swap controller setup:
      
      - Replace the do_swap_account flag with cgroup_memory_noswap. This
        brings it in line with other functionality that is usually available
        unless explicitly opted out of - nosocket, nokmem.
      
      - Remove the really_do_swap_account flag that stores the boot option
        and is later used to switch the do_swap_account. It's not clear why
        this indirection is/was necessary. Use do_swap_account directly.
      
      - Minor coding style polishing
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-15-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eccb52e7