1. 14 Aug, 2023 16 commits
    • Chao Yu's avatar
      f2fs: fix to account gc stats correctly · 9bf1dcbd
      Chao Yu authored
      As reported, status debugfs entry shows inconsistent GC stats as below:
      
      GC calls: 6008 (BG: 6161)
        - data segments : 3053 (BG: 3053)
        - node segments : 2955 (BG: 2955)
      
      Total GC calls is larger than BGGC calls, the reason is:
      - f2fs_stat_info.call_count accounts total migrated section count
      by f2fs_gc()
      - f2fs_stat_info.bg_gc accounts total call times of f2fs_gc() from
      background gc_thread
      
      Another issue is gc_foreground_calls sysfs entry shows total GC call
      count rather than FGGC call count.
      
      This patch changes as below for fix:
      - account GC calls and migrated segment count separately
      - support to account migrated section count if it enables large section
      mode
      - fix to show correct value in gc_foreground_calls sysfs entry
      
      Fixes: fc7100ea ("f2fs: Add f2fs stats to sysfs")
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      9bf1dcbd
    • Chao Yu's avatar
      f2fs: remove unneeded check condition in __f2fs_setxattr() · bc3994ff
      Chao Yu authored
      It has checked return value of write_all_xattrs(), remove unneeded
      following check condition.
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      bc3994ff
    • Chao Yu's avatar
      f2fs: fix to update i_ctime in __f2fs_setxattr() · 8874ad7d
      Chao Yu authored
      generic/728       - output mismatch (see /media/fstests/results//generic/728.out.bad)
          --- tests/generic/728.out	2023-07-19 07:10:48.362711407 +0000
          +++ /media/fstests/results//generic/728.out.bad	2023-07-19 08:39:57.000000000 +0000
           QA output created by 728
          +Expected ctime to change after setxattr.
          +Expected ctime to change after removexattr.
           Silence is golden
          ...
          (Run 'diff -u /media/fstests/tests/generic/728.out /media/fstests/results//generic/728.out.bad'  to see the entire diff)
      generic/729        1s
      
      It needs to update i_ctime after {set,remove}xattr, fix it.
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      8874ad7d
    • Chao Yu's avatar
      Revert "f2fs: fix to do sanity check on extent cache correctly" · 958ccbbf
      Chao Yu authored
      syzbot reports a f2fs bug as below:
      
      UBSAN: array-index-out-of-bounds in fs/f2fs/f2fs.h:3275:19
      index 1409 is out of range for type '__le32[923]' (aka 'unsigned int[923]')
      Call Trace:
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
       ubsan_epilogue lib/ubsan.c:217 [inline]
       __ubsan_handle_out_of_bounds+0x11c/0x150 lib/ubsan.c:348
       inline_data_addr fs/f2fs/f2fs.h:3275 [inline]
       __recover_inline_status fs/f2fs/inode.c:113 [inline]
       do_read_inode fs/f2fs/inode.c:480 [inline]
       f2fs_iget+0x4730/0x48b0 fs/f2fs/inode.c:604
       f2fs_fill_super+0x640e/0x80c0 fs/f2fs/super.c:4601
       mount_bdev+0x276/0x3b0 fs/super.c:1391
       legacy_get_tree+0xef/0x190 fs/fs_context.c:611
       vfs_get_tree+0x8c/0x270 fs/super.c:1519
       do_new_mount+0x28f/0xae0 fs/namespace.c:3335
       do_mount fs/namespace.c:3675 [inline]
       __do_sys_mount fs/namespace.c:3884 [inline]
       __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The issue was bisected to:
      
      commit d48a7b3a
      Author: Chao Yu <chao@kernel.org>
      Date:   Mon Jan 9 03:49:20 2023 +0000
      
          f2fs: fix to do sanity check on extent cache correctly
      
      The root cause is we applied both v1 and v2 of the patch, v2 is the right
      fix, so it needs to revert v1 in order to fix reported issue.
      
      v1:
      commit d48a7b3a ("f2fs: fix to do sanity check on extent cache correctly")
      https://lore.kernel.org/lkml/20230109034920.492914-1-chao@kernel.org/
      
      v2:
      commit 269d1194 ("f2fs: fix to do sanity check on extent cache correctly")
      https://lore.kernel.org/lkml/20230207134808.1827869-1-chao@kernel.org/
      
      Reported-by: syzbot+601018296973a481f302@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/linux-f2fs-devel/000000000000fcf0690600e4d04d@google.com/
      Fixes: d48a7b3a ("f2fs: fix to do sanity check on extent cache correctly")
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      958ccbbf
    • Minjie Du's avatar
      f2fs: increase usage of folio_next_index() helper · a842a909
      Minjie Du authored
      Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
      the existing helper folio_next_index().
      Signed-off-by: default avatarMinjie Du <duminjie@vivo.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      a842a909
    • Chunhai Guo's avatar
      f2fs: Only lfs mode is allowed with zoned block device feature · 2bd4df8f
      Chunhai Guo authored
      Now f2fs support four block allocation modes: lfs, adaptive,
      fragment:segment, fragment:block. Only lfs mode is allowed with zoned block
      device feature.
      
      Fixes: 6691d940 ("f2fs: introduce fragment allocation mode mount option")
      Signed-off-by: default avatarChunhai Guo <guochunhai@vivo.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      2bd4df8f
    • Shin'ichiro Kawasaki's avatar
      f2fs: check zone type before sending async reset zone command · 3cb88bc1
      Shin'ichiro Kawasaki authored
      The commit 25f90805 ("f2fs: add async reset zone command support")
      introduced "async reset zone commands" by calling
      __submit_zone_reset_cmd() in async discard operations. However,
      __submit_zone_reset_cmd() is called regardless of zone type of discard
      target zone. When devices have conventional zones, zone reset commands
      are sent to the conventional zones and cause I/O errors.
      
      Avoid the I/O errors by checking that the discard target zone type is
      sequential write required. If not, handle the discard operation in same
      manner as non-zoned, regular block devices. For that purpose, add a new
      helper function f2fs_bdev_index() which gets index of the zone reset
      target device.
      
      Fixes: 25f90805 ("f2fs: add async reset zone command support")
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      3cb88bc1
    • Chao Yu's avatar
      f2fs: compress: don't {,de}compress non-full cluster · 025b3602
      Chao Yu authored
      f2fs won't compress non-full cluster in tail of file, let's skip
      dirtying and rewrite such cluster during f2fs_ioc_{,de}compress_file.
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      025b3602
    • Chao Yu's avatar
      f2fs: allow f2fs_ioc_{,de}compress_file to be interrupted · 3a2c0e55
      Chao Yu authored
      This patch allows f2fs_ioc_{,de}compress_file() to be interrupted, so that,
      userspace won't be blocked when manual {,de}compression on large file is
      interrupted by signal.
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      3a2c0e55
    • Christoph Hellwig's avatar
      f2fs: don't reopen the main block device in f2fs_scan_devices · 51bf8d3c
      Christoph Hellwig authored
      f2fs_scan_devices reopens the main device since the very beginning, which
      has always been useless, and also means that we don't pass the right
      holder for the reopen, which now leads to a warning as the core super.c
      holder ops aren't passed in for the reopen.
      
      Fixes: 3c62be17 ("f2fs: support multiple devices")
      Fixes: 0718afd4 ("block: introduce holder ops")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      51bf8d3c
    • Chao Yu's avatar
      f2fs: fix to avoid mmap vs set_compress_option case · b5ab3276
      Chao Yu authored
      Compression option in inode should not be changed after they have
      been used, however, it may happen in below race case:
      
      Thread A				Thread B
      - f2fs_ioc_set_compress_option
       - check f2fs_is_mmap_file()
       - check get_dirty_pages()
       - check F2FS_HAS_BLOCKS()
      					- f2fs_file_mmap
      					 - set_inode_flag(FI_MMAP_FILE)
      					- fault
      					 - do_page_mkwrite
      					  - f2fs_vm_page_mkwrite
      					  - f2fs_get_block_locked
      					 - fault_dirty_shared_page
      					  - set_page_dirty
       - update i_compress_algorithm
       - update i_log_cluster_size
       - update i_cluster_size
      
      Avoid such race condition by covering f2fs_file_mmap() w/ i_sem lock,
      meanwhile add mmap file check condition in f2fs_may_compress() as well.
      
      Fixes: e1e8debe ("f2fs: add F2FS_IOC_SET_COMPRESS_OPTION ioctl")
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      b5ab3276
    • Randy Dunlap's avatar
      f2fs: fix spelling in ABI documentation · c709d099
      Randy Dunlap authored
      Correct spelling problems as identified by codespell.
      
      Fixes: 9e615dbb ("f2fs: add missing description for ipu_policy node")
      Fixes: b2e4a2b3 ("f2fs: expose discard related parameters in sysfs")
      Fixes: 846ae671 ("f2fs: expose extension_list sysfs entry")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Chao Yu <chao@kernel.org>
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: Yangtao Li <frank.li@vivo.com>
      Cc: Konstantin Vyshetsky <vkon@google.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      c709d099
    • Jaegeuk Kim's avatar
      f2fs: get out of a repeat loop when getting a locked data page · d2d9bb3b
      Jaegeuk Kim authored
      https://bugzilla.kernel.org/show_bug.cgi?id=216050
      
      Somehow we're getting a page which has a different mapping.
      Let's avoid the infinite loop.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      d2d9bb3b
    • Jaegeuk Kim's avatar
      f2fs: flush inode if atomic file is aborted · a3ab5574
      Jaegeuk Kim authored
      Let's flush the inode being aborted atomic operation to avoid stale dirty
      inode during eviction in this call stack:
      
        f2fs_mark_inode_dirty_sync+0x22/0x40 [f2fs]
        f2fs_abort_atomic_write+0xc4/0xf0 [f2fs]
        f2fs_evict_inode+0x3f/0x690 [f2fs]
        ? sugov_start+0x140/0x140
        evict+0xc3/0x1c0
        evict_inodes+0x17b/0x210
        generic_shutdown_super+0x32/0x120
        kill_block_super+0x21/0x50
        deactivate_locked_super+0x31/0x90
        cleanup_mnt+0x100/0x160
        task_work_run+0x59/0x90
        do_exit+0x33b/0xa50
        do_group_exit+0x2d/0x80
        __x64_sys_exit_group+0x14/0x20
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This triggers f2fs_bug_on() in f2fs_evict_inode:
       f2fs_bug_on(sbi, is_inode_flag_set(inode, FI_DIRTY_INODE));
      
      This fixes the syzbot report:
      
      loop0: detected capacity change from 0 to 131072
      F2FS-fs (loop0): invalid crc value
      F2FS-fs (loop0): Found nat_bits in checkpoint
      F2FS-fs (loop0): Mounted with checkpoint version = 48b305e4
      ------------[ cut here ]------------
      kernel BUG at fs/f2fs/inode.c:869!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 5014 Comm: syz-executor220 Not tainted 6.4.0-syzkaller-11479-g6cd06ab1 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
      RIP: 0010:f2fs_evict_inode+0x172d/0x1e00 fs/f2fs/inode.c:869
      Code: ff df 48 c1 ea 03 80 3c 02 00 0f 85 6a 06 00 00 8b 75 40 ba 01 00 00 00 4c 89 e7 e8 6d ce 06 00 e9 aa fc ff ff e8 63 22 e2 fd <0f> 0b e8 5c 22 e2 fd 48 c7 c0 a8 3a 18 8d 48 ba 00 00 00 00 00 fc
      RSP: 0018:ffffc90003a6fa00 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
      RDX: ffff8880273b8000 RSI: ffffffff83a2bd0d RDI: 0000000000000007
      RBP: ffff888077db91b0 R08: 0000000000000007 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: ffff888029a3c000
      R13: ffff888077db9660 R14: ffff888029a3c0b8 R15: ffff888077db9c50
      FS:  0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1909bb9000 CR3: 00000000276a9000 CR4: 0000000000350ef0
      Call Trace:
       <TASK>
       evict+0x2ed/0x6b0 fs/inode.c:665
       dispose_list+0x117/0x1e0 fs/inode.c:698
       evict_inodes+0x345/0x440 fs/inode.c:748
       generic_shutdown_super+0xaf/0x480 fs/super.c:478
       kill_block_super+0x64/0xb0 fs/super.c:1417
       kill_f2fs_super+0x2af/0x3c0 fs/f2fs/super.c:4704
       deactivate_locked_super+0x98/0x160 fs/super.c:330
       deactivate_super+0xb1/0xd0 fs/super.c:361
       cleanup_mnt+0x2ae/0x3d0 fs/namespace.c:1254
       task_work_run+0x16f/0x270 kernel/task_work.c:179
       exit_task_work include/linux/task_work.h:38 [inline]
       do_exit+0xa9a/0x29a0 kernel/exit.c:874
       do_group_exit+0xd4/0x2a0 kernel/exit.c:1024
       __do_sys_exit_group kernel/exit.c:1035 [inline]
       __se_sys_exit_group kernel/exit.c:1033 [inline]
       __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1033
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f309be71a09
      Code: Unable to access opcode bytes at 0x7f309be719df.
      RSP: 002b:00007fff171df518 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 00007f309bef7330 RCX: 00007f309be71a09
      RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
      RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 00007f309bef1e40
      R10: 0000000000010600 R11: 0000000000000246 R12: 00007f309bef7330
      R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      Modules linked in:
      ---[ end trace 0000000000000000 ]---
      RIP: 0010:f2fs_evict_inode+0x172d/0x1e00 fs/f2fs/inode.c:869
      Code: ff df 48 c1 ea 03 80 3c 02 00 0f 85 6a 06 00 00 8b 75 40 ba 01 00 00 00 4c 89 e7 e8 6d ce 06 00 e9 aa fc ff ff e8 63 22 e2 fd <0f> 0b e8 5c 22 e2 fd 48 c7 c0 a8 3a 18 8d 48 ba 00 00 00 00 00 fc
      RSP: 0018:ffffc90003a6fa00 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
      RDX: ffff8880273b8000 RSI: ffffffff83a2bd0d RDI: 0000000000000007
      RBP: ffff888077db91b0 R08: 0000000000000007 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: ffff888029a3c000
      R13: ffff888077db9660 R14: ffff888029a3c0b8 R15: ffff888077db9c50
      FS:  0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1909bb9000 CR3: 00000000276a9000 CR4: 0000000000350ef0
      
      Cc: <stable@vger.kernel.org>
      Reported-and-tested-by: syzbot+e1246909d526a9d470fa@syzkaller.appspotmail.com
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      a3ab5574
    • Chao Yu's avatar
      f2fs: don't handle error case of f2fs_compress_alloc_page() · 863907a4
      Chao Yu authored
      f2fs_compress_alloc_page() uses mempool to allocate memory, it never
      fail, don't handle error case in its callers.
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      863907a4
    • Jaegeuk Kim's avatar
      Revert "f2fs: clean up w/ sbi->log_sectors_per_block" · 579c7e41
      Jaegeuk Kim authored
      This reverts commit bfd47662.
      
      Shinichiro Kawasaki reported:
      
      When I ran workloads on f2fs using v6.5-rcX with fixes [1][2] and a zoned block
      devices with 4kb logical block size, I observe mount failure as follows. When
      I revert this commit, the failure goes away.
      
      [  167.781975][ T1555] F2FS-fs (dm-0): IO Block Size:        4 KB
      [  167.890728][ T1555] F2FS-fs (dm-0): Found nat_bits in checkpoint
      [  171.482588][ T1555] F2FS-fs (dm-0): Zone without valid block has non-zero write pointer. Reset the write pointer: wp[0x1300,0x8]
      [  171.496000][ T1555] F2FS-fs (dm-0): (0) : Unaligned zone reset attempted (block 280000 + 80000)
      [  171.505037][ T1555] F2FS-fs (dm-0): Discard zone failed:  (errno=-5)
      
      The patch replaced "sbi->log_blocksize - SECTOR_SHIFT" with
      "sbi->log_sectors_per_block". However, I think these two are not equal when the
      device has 4k logical block size. The former uses Linux kernel sector size 512
      byte. The latter use 512b sector size or 4kb sector size depending on the
      device. mkfs.f2fs obtains logical block size via BLKSSZGET ioctl from the device
      and reflects it to the value sbi->log_sector_size_per_block. This causes
      unexpected write pointer calculations in check_zone_write_pointer(). This
      resulted in unexpected zone reset and the mount failure.
      
      [1] https://lkml.kernel.org/linux-f2fs-devel/20230711050101.GA19128@lst.de/
      [2] https://lore.kernel.org/linux-f2fs-devel/20230804091556.2372567-1-shinichiro.kawasaki@wdc.com/
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Fixes: bfd47662 ("f2fs: clean up w/ sbi->log_sectors_per_block")
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      579c7e41
  2. 09 Jul, 2023 10 commits
  3. 08 Jul, 2023 14 commits
    • Hugh Dickins's avatar
      mm: lock newly mapped VMA with corrected ordering · 1c7873e3
      Hugh Dickins authored
      Lockdep is certainly right to complain about
      
        (&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
                       but task is already holding lock:
        (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
      
      Invert those to the usual ordering.
      
      Fixes: 33313a74 ("mm: lock newly mapped VMA which can be modified after it becomes visible")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c7873e3
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of... · 946c6b59
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull hotfixes from Andrew Morton:
       "16 hotfixes. Six are cc:stable and the remainder address post-6.4
        issues"
      
      The merge undoes the disabling of the CONFIG_PER_VMA_LOCK feature, since
      it was all hopefully fixed in mainline.
      
      * tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        lib: dhry: fix sleeping allocations inside non-preemptable section
        kasan, slub: fix HW_TAGS zeroing with slub_debug
        kasan: fix type cast in memory_is_poisoned_n
        mailmap: add entries for Heiko Stuebner
        mailmap: update manpage link
        bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
        MAINTAINERS: add linux-next info
        mailmap: add Markus Schneider-Pargmann
        writeback: account the number of pages written back
        mm: call arch_swap_restore() from do_swap_page()
        squashfs: fix cache race with migration
        mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison
        docs: update ocfs2-devel mailing list address
        MAINTAINERS: update ocfs2-devel mailing list address
        mm: disable CONFIG_PER_VMA_LOCK until its fixed
        fork: lock VMAs of the parent process when forking
      946c6b59
    • Suren Baghdasaryan's avatar
      fork: lock VMAs of the parent process when forking · fb49c455
      Suren Baghdasaryan authored
      When forking a child process, the parent write-protects anonymous pages
      and COW-shares them with the child being forked using copy_present_pte().
      
      We must not take any concurrent page faults on the source vma's as they
      are being processed, as we expect both the vma and the pte's behind it
      to be stable.  For example, the anon_vma_fork() expects the parents
      vma->anon_vma to not change during the vma copy.
      
      A concurrent page fault on a page newly marked read-only by the page
      copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
      source vma, defeating the anon_vma_clone() that wasn't done because the
      parent vma originally didn't have an anon_vma, but we now might end up
      copying a pte entry for a page that has one.
      
      Before the per-vma lock based changes, the mmap_lock guaranteed
      exclusion with concurrent page faults.  But now we need to do a
      vma_start_write() to make sure no concurrent faults happen on this vma
      while it is being processed.
      
      This fix can potentially regress some fork-heavy workloads.  Kernel
      build time did not show noticeable regression on a 56-core machine while
      a stress test mapping 10000 VMAs and forking 5000 times in a tight loop
      shows ~5% regression.  If such fork time regression is unacceptable,
      disabling CONFIG_PER_VMA_LOCK should restore its performance.  Further
      optimizations are possible if this regression proves to be problematic.
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarJiri Slaby <jirislaby@kernel.org>
      Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/Reported-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/Reported-by: default avatarJacob Young <jacobly.alt@gmail.com>
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
      Fixes: 0bff0aae ("x86/mm: try VMA lock-based page fault handling first")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb49c455
    • Suren Baghdasaryan's avatar
      mm: lock newly mapped VMA which can be modified after it becomes visible · 33313a74
      Suren Baghdasaryan authored
      mmap_region adds a newly created VMA into VMA tree and might modify it
      afterwards before dropping the mmap_lock.  This poses a problem for page
      faults handled under per-VMA locks because they don't take the mmap_lock
      and can stumble on this VMA while it's still being modified.  Currently
      this does not pose a problem since post-addition modifications are done
      only for file-backed VMAs, which are not handled under per-VMA lock.
      However, once support for handling file-backed page faults with per-VMA
      locks is added, this will become a race.
      
      Fix this by write-locking the VMA before inserting it into the VMA tree.
      Other places where a new VMA is added into VMA tree do not modify it
      after the insertion, so do not need the same locking.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33313a74
    • Suren Baghdasaryan's avatar
      mm: lock a vma before stack expansion · c137381f
      Suren Baghdasaryan authored
      With recent changes necessitating mmap_lock to be held for write while
      expanding a stack, per-VMA locks should follow the same rules and be
      write-locked to prevent page faults into the VMA being expanded. Add
      the necessary locking.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c137381f
    • Linus Torvalds's avatar
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 7fcd473a
      Linus Torvalds authored
      Pull more SCSI updates from James Bottomley:
       "A few late arriving patches that missed the initial pull request. It's
        mostly bug fixes (the dt-bindings is a fix for the initial pull)"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: ufs: core: Remove unused function declaration
        scsi: target: docs: Remove tcm_mod_builder.py
        scsi: target: iblock: Quiet bool conversion warning with pr_preempt use
        scsi: dt-bindings: ufs: qcom: Fix ICE phandle
        scsi: core: Simplify scsi_cdl_check_cmd()
        scsi: isci: Fix comment typo
        scsi: smartpqi: Replace one-element arrays with flexible-array members
        scsi: target: tcmu: Replace strlcpy() with strscpy()
        scsi: ncr53c8xx: Replace strlcpy() with strscpy()
        scsi: lpfc: Fix lpfc_name struct packing
      7fcd473a
    • Linus Torvalds's avatar
      Merge tag 'i2c-for-6.5-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 84dc5aa3
      Linus Torvalds authored
      Pull more i2c updates from Wolfram Sang:
      
       - xiic patch should have been in the original pull but slipped through
      
       - mpc patch fixes a build regression
      
       - nomadik cleanup
      
      * tag 'i2c-for-6.5-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: mpc: Drop unused variable
        i2c: nomadik: Remove a useless call in the remove function
        i2c: xiic: Don't try to handle more interrupt events after error
      84dc5aa3
    • Linus Torvalds's avatar
      Merge tag 'hardening-v6.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 8fc3b8f0
      Linus Torvalds authored
      Pull hardening fixes from Kees Cook:
      
       - Check for NULL bdev in LoadPin (Matthias Kaehlcke)
      
       - Revert unwanted KUnit FORTIFY build default
      
       - Fix 1-element array causing boot warnings with xhci-hub
      
      * tag 'hardening-v6.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        usb: ch9: Replace bmSublinkSpeedAttr 1-element array with flexible array
        Revert "fortify: Allow KUnit test to build without FORTIFY"
        dm: verity-loadpin: Add NULL pointer check for 'bdev' parameter
      8fc3b8f0
    • Anup Sharma's avatar
      ntb: hw: amd: Fix debugfs_create_dir error checking · bff6efc5
      Anup Sharma authored
      The debugfs_create_dir function returns ERR_PTR in case of error, and the
      only correct way to check if an error occurred is 'IS_ERR' inline function.
      This patch will replace the null-comparison with IS_ERR.
      Signed-off-by: default avatarAnup Sharma <anupnewsmail@gmail.com>
      Suggested-by: default avatarIvan Orlov <ivan.orlov0322@gmail.com>
      Signed-off-by: default avatarJon Mason <jdmason@kudzu.us>
      bff6efc5
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-for-v6.5-2-2023-07-06' of... · c206353d
      Linus Torvalds authored
      Merge tag 'perf-tools-for-v6.5-2-2023-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next
      
      Pull more perf tools updates from Namhyung Kim:
       "These are remaining changes and fixes for this cycle.
      
        Build:
      
         - Allow generating vmlinux.h from BTF using `make GEN_VMLINUX_H=1`
           and skip if the vmlinux has no BTF.
      
         - Replace deprecated clang -target xxx option by --target=xxx.
      
        perf record:
      
         - Print event attributes with well known type and config symbols in
           the debug output like below:
      
             # perf record -e cycles,cpu-clock -C0 -vv true
             <SNIP>
             ------------------------------------------------------------
             perf_event_attr:
               type                             0 (PERF_TYPE_HARDWARE)
               size                             136
               config                           0 (PERF_COUNT_HW_CPU_CYCLES)
               { sample_period, sample_freq }   4000
               sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
               read_format                      ID
               disabled                         1
               inherit                          1
               freq                             1
               sample_id_all                    1
               exclude_guest                    1
             ------------------------------------------------------------
             sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 5
             ------------------------------------------------------------
             perf_event_attr:
               type                             1 (PERF_TYPE_SOFTWARE)
               size                             136
               config                           0 (PERF_COUNT_SW_CPU_CLOCK)
               { sample_period, sample_freq }   4000
               sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
               read_format                      ID
               disabled                         1
               inherit                          1
               freq                             1
               sample_id_all                    1
               exclude_guest                    1
      
         - Update AMD IBS event error message since it now support per-process
           profiling but no priviledge filters.
      
             $ sudo perf record -e ibs_op//k -C 0
             Error:
             AMD IBS doesn't support privilege filtering. Try again without
             the privilege modifiers (like 'k') at the end.
      
        perf lock contention:
      
         - Support CSV style output using -x option
      
             $ sudo perf lock con -ab -x, sleep 1
             # output: contended, total wait, max wait, avg wait, type, caller
             19, 194232, 21415, 10222, spinlock, process_one_work+0x1f0
             15, 162748, 23843, 10849, rwsem:R, do_user_addr_fault+0x40e
             4, 86740, 23415, 21685, rwlock:R, ep_poll_callback+0x2d
             1, 84281, 84281, 84281, mutex, iwl_mvm_async_handlers_wk+0x135
             8, 67608, 27404, 8451, spinlock, __queue_work+0x174
             3, 58616, 31125, 19538, rwsem:W, do_mprotect_pkey+0xff
             3, 52953, 21172, 17651, rwlock:W, do_epoll_wait+0x248
             2, 30324, 19704, 15162, rwsem:R, do_madvise+0x3ad
             1, 24619, 24619, 24619, spinlock, rcu_core+0xd4
      
         - Add --output option to save the data to a file not to be interfered
           by other debug messages.
      
        Test:
      
         - Fix event parsing test on ARM where there's no raw PMU nor supports
           PERF_PMU_CAP_EXTENDED_HW_TYPE.
      
         - Update the lock contention test case for CSV output.
      
         - Fix a segfault in the daemon command test.
      
        Vendor events (JSON):
      
         - Add has_event() to check if the given event is available on system
           at runtime. On Intel machines, some transaction events may not be
           present when TSC extensions are disabled.
      
         - Update Intel event metrics.
      
        Misc:
      
         - Sort symbols by name using an external array of pointers instead of
           a rbtree node in the symbol. This will save 16-bytes or 24-bytes
           per symbol whether the sorting is actually requested or not.
      
         - Fix unwinding DWARF callstacks using libdw when --symfs option is
           used"
      
      * tag 'perf-tools-for-v6.5-2-2023-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next: (38 commits)
        perf test: Fix event parsing test when PERF_PMU_CAP_EXTENDED_HW_TYPE isn't supported.
        perf test: Fix event parsing test on Arm
        perf evsel amd: Fix IBS error message
        perf: unwind: Fix symfs with libdw
        perf symbol: Fix uninitialized return value in symbols__find_by_name()
        perf test: Test perf lock contention CSV output
        perf lock contention: Add --output option
        perf lock contention: Add -x option for CSV style output
        perf lock: Remove stale comments
        perf vendor events intel: Update tigerlake to 1.13
        perf vendor events intel: Update skylakex to 1.31
        perf vendor events intel: Update skylake to 57
        perf vendor events intel: Update sapphirerapids to 1.14
        perf vendor events intel: Update icelakex to 1.21
        perf vendor events intel: Update icelake to 1.19
        perf vendor events intel: Update cascadelakex to 1.19
        perf vendor events intel: Update meteorlake to 1.03
        perf vendor events intel: Add rocketlake events/metrics
        perf vendor metrics intel: Make transaction metrics conditional
        perf jevents: Support for has_event function
        ...
      c206353d
    • Linus Torvalds's avatar
      Merge tag 'bitmap-6.5-rc1' of https://github.com/norov/linux · ad8258e8
      Linus Torvalds authored
      Pull bitmap updates from Yury Norov:
       "Fixes for different bitmap pieces:
      
         - lib/test_bitmap: increment failure counter properly
      
           The tests that don't use expect_eq() macro to determine that a test
           is failured must increment failed_tests explicitly.
      
         - lib/bitmap: drop optimization of bitmap_{from,to}_arr64
      
           bitmap_{from,to}_arr64() optimization is overly optimistic
           on 32-bit LE architectures when it's wired to
           bitmap_copy_clear_tail().
      
         - nodemask: Drop duplicate check in for_each_node_mask()
      
           As the return value type of first_node() became unsigned, the node
           >= 0 became unnecessary.
      
         - cpumask: fix function description kernel-doc notation
      
         - MAINTAINERS: Add bits.h and bitfield.h to the BITMAP API record
      
           Add linux/bits.h and linux/bitfield.h for visibility"
      
      * tag 'bitmap-6.5-rc1' of https://github.com/norov/linux:
        MAINTAINERS: Add bitfield.h to the BITMAP API record
        MAINTAINERS: Add bits.h to the BITMAP API record
        cpumask: fix function description kernel-doc notation
        nodemask: Drop duplicate check in for_each_node_mask()
        lib/bitmap: drop optimization of bitmap_{from,to}_arr64
        lib/test_bitmap: increment failure counter properly
      ad8258e8
    • Geert Uytterhoeven's avatar
      lib: dhry: fix sleeping allocations inside non-preemptable section · 8ba388c0
      Geert Uytterhoeven authored
      The Smatch static checker reports the following warnings:
      
          lib/dhry_run.c:38 dhry_benchmark() warn: sleeping in atomic context
          lib/dhry_run.c:43 dhry_benchmark() warn: sleeping in atomic context
      
      Indeed, dhry() does sleeping allocations inside the non-preemptable
      section delimited by get_cpu()/put_cpu().
      
      Fix this by using atomic allocations instead.
      Add error handling, as atomic these allocations may fail.
      
      Link: https://lkml.kernel.org/r/bac6d517818a7cd8efe217c1ad649fffab9cc371.1688568764.git.geert+renesas@glider.be
      Fixes: 13684e96 ("lib: dhry: fix unstable smp_processor_id(_) usage")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Closes: https://lore.kernel.org/r/0469eb3a-02eb-4b41-b189-de20b931fa56@moroto.mountainSigned-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ba388c0
    • Andrey Konovalov's avatar
      kasan, slub: fix HW_TAGS zeroing with slub_debug · fdb54d96
      Andrey Konovalov authored
      Commit 946fa0db ("mm/slub: extend redzone check to extra allocated
      kmalloc space than requested") added precise kmalloc redzone poisoning to
      the slub_debug functionality.
      
      However, this commit didn't account for HW_TAGS KASAN fully initializing
      the object via its built-in memory initialization feature.  Even though
      HW_TAGS KASAN memory initialization contains special memory initialization
      handling for when slub_debug is enabled, it does not account for in-object
      slub_debug redzones.  As a result, HW_TAGS KASAN can overwrite these
      redzones and cause false-positive slub_debug reports.
      
      To fix the issue, avoid HW_TAGS KASAN memory initialization when
      slub_debug is enabled altogether.  Implement this by moving the
      __slub_debug_enabled check to slab_post_alloc_hook.  Common slab code
      seems like a more appropriate place for a slub_debug check anyway.
      
      Link: https://lkml.kernel.org/r/678ac92ab790dba9198f9ca14f405651b97c8502.1688561016.git.andreyknvl@google.com
      Fixes: 946fa0db ("mm/slub: extend redzone check to extra allocated kmalloc space than requested")
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reported-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: kasan-dev@googlegroups.com
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fdb54d96
    • Andrey Konovalov's avatar
      kasan: fix type cast in memory_is_poisoned_n · 05c56e7b
      Andrey Konovalov authored
      Commit bb6e04a1 ("kasan: use internal prototypes matching gcc-13
      builtins") introduced a bug into the memory_is_poisoned_n implementation:
      it effectively removed the cast to a signed integer type after applying
      KASAN_GRANULE_MASK.
      
      As a result, KASAN started failing to properly check memset, memcpy, and
      other similar functions.
      
      Fix the bug by adding the cast back (through an additional signed integer
      variable to make the code more readable).
      
      Link: https://lkml.kernel.org/r/8c9e0251c2b8b81016255709d4ec42942dcaf018.1688431866.git.andreyknvl@google.com
      Fixes: bb6e04a1 ("kasan: use internal prototypes matching gcc-13 builtins")
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      05c56e7b