1. 10 May, 2024 2 commits
  2. 06 May, 2024 16 commits
  3. 25 Apr, 2024 11 commits
  4. 16 Apr, 2024 11 commits
    • Jeongjun Park's avatar
      nilfs2: fix OOB in nilfs_set_de_type · c4a7dc95
      Jeongjun Park authored
      The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is
      defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function,
      which uses this array, specifies the index to read from the array in the
      same way as "(mode & S_IFMT) >> S_SHIFT".
      
      static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode
       *inode)
      {
      	umode_t mode = inode->i_mode;
      
      	de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob
      }
      
      However, when the index is determined this way, an out-of-bounds (OOB)
      error occurs by referring to an index that is 1 larger than the array size
      when the condition "mode & S_IFMT == S_IFMT" is satisfied.  Therefore, a
      patch to resize the nilfs_type_by_mode array should be applied to prevent
      OOB errors.
      
      Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com
      Reported-by: syzbot+2e22057de05b9f3b30d8@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8
      Fixes: 2ba466d7 ("nilfs2: directory entry operations")
      Signed-off-by: default avatarJeongjun Park <aha310510@gmail.com>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4a7dc95
    • Naoya Horiguchi's avatar
      MAINTAINERS: update Naoya Horiguchi's email address · 8247bf1d
      Naoya Horiguchi authored
      My old NEC address has been removed, so update MAINTAINERS and .mailmap to
      map it to my gmail address.
      
      Link: https://lkml.kernel.org/r/20240412181720.18452-1-nao.horiguchi@gmail.comSigned-off-by: default avatarNaoya Horiguchi <nao.horiguchi@gmail.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8247bf1d
    • Miaohe Lin's avatar
      fork: defer linking file vma until vma is fully initialized · 35e35178
      Miaohe Lin authored
      Thorvald reported a WARNING [1]. And the root cause is below race:
      
       CPU 1					CPU 2
       fork					hugetlbfs_fallocate
        dup_mmap				 hugetlbfs_punch_hole
         i_mmap_lock_write(mapping);
         vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
         i_mmap_unlock_write(mapping);
         hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
      					 i_mmap_lock_write(mapping);
         					 hugetlb_vmdelete_list
      					  vma_interval_tree_foreach
      					   hugetlb_vma_trylock_write -- Vma_lock is cleared.
         tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
      					   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
      					 i_mmap_unlock_write(mapping);
      
      hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
      i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
      by deferring linking file vma until vma is fully initialized.  Those vmas
      should be initialized first before they can be used.
      
      Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reported-by: default avatarThorvald Natvig <thorvald@google.com>
      Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peng Zhang <zhangpeng.00@bytedance.com>
      Cc: Tycho Andersen <tandersen@netflix.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      35e35178
    • Sumanth Korikkar's avatar
      mm/shmem: inline shmem_is_huge() for disabled transparent hugepages · 1f737846
      Sumanth Korikkar authored
      In order to  minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y),
      compiler might choose to make a regular function call (out-of-line) for
      shmem_is_huge() instead of inlining it. When transparent hugepages are
      disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation
      error.
      
      mm/shmem.c: In function `shmem_getattr':
      ./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG'
        383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
            |                           ^~~~~~~~~
      mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE'
       1148 |                 stat->blksize = HPAGE_PMD_SIZE;
      
      To prevent the possible error, always inline shmem_is_huge() when
      transparent hugepages are disabled.
      
      Link: https://lkml.kernel.org/r/20240409155407.2322714-1-sumanthk@linux.ibm.comSigned-off-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f737846
    • Oscar Salvador's avatar
      mm,page_owner: defer enablement of static branch · 0b2cf0a4
      Oscar Salvador authored
      Kefeng Wang reported that he was seeing some memory leaks with kmemleak
      with page_owner enabled.
      
      The reason is that we enable the page_owner_inited static branch and then
      proceed with the linking of stack_list struct to dummy_stack, which means
      that exists a race window between these two steps where we can have pages
      already being allocated calling add_stack_record_to_list(), allocating
      objects and linking them to stack_list, but then we set stack_list
      pointing to dummy_stack in init_page_owner.  Which means that the objects
      that have been allocated during that time window are unreferenced and
      lost.
      
      Fix this by deferring the enablement of the branch until we have properly
      set up the list.
      
      Link: https://lkml.kernel.org/r/20240409131715.13632-1-osalvador@suse.de
      Fixes: 4bedfb31 ("mm,page_owner: maintain own list of stack_records structs")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Closes: https://lore.kernel.org/linux-mm/74b147b0-718d-4d50-be75-d6afc801cd24@huawei.com/Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b2cf0a4
    • Phillip Lougher's avatar
      Squashfs: check the inode number is not the invalid value of zero · 9253c54e
      Phillip Lougher authored
      Syskiller has produced an out of bounds access in fill_meta_index().
      
      That out of bounds access is ultimately caused because the inode
      has an inode number with the invalid value of zero, which was not checked.
      
      The reason this causes the out of bounds access is due to following
      sequence of events:
      
      1. Fill_meta_index() is called to allocate (via empty_meta_index())
         and fill a metadata index.  It however suffers a data read error
         and aborts, invalidating the newly returned empty metadata index.
         It does this by setting the inode number of the index to zero,
         which means unused (zero is not a valid inode number).
      
      2. When fill_meta_index() is subsequently called again on another
         read operation, locate_meta_index() returns the previous index
         because it matches the inode number of 0.  Because this index
         has been returned it is expected to have been filled, and because
         it hasn't been, an out of bounds access is performed.
      
      This patch adds a sanity check which checks that the inode number
      is not zero when the inode is created and returns -EINVAL if it is.
      
      [phillip@squashfs.org.uk: whitespace fix]
        Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk
      Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.ukSigned-off-by: default avatarPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: default avatar"Ubisectech Sirius" <bugreport@ubisectech.com>
      Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport@ubisectech.com/
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9253c54e
    • Oscar Salvador's avatar
      mm,swapops: update check in is_pfn_swap_entry for hwpoison entries · 07a57a33
      Oscar Salvador authored
      Tony reported that the Machine check recovery was broken in v6.9-rc1, as
      he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to
      DRAM.
      
      After some more digging and debugging on his side, he realized that this
      went back to v6.1, with the introduction of 'commit 0d206b5d
      ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'.  That
      commit, among other things, introduced swp_offset_pfn(), replacing
      hwpoison_entry_to_pfn() in its favour.
      
      The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but
      is_pfn_swap_entry() never got updated to cover hwpoison entries, which
      means that we would hit the VM_BUG_ON whenever we would call
      swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM
      set.  Fix this by updating the check to cover hwpoison entries as well,
      and update the comment while we are it.
      
      Link: https://lkml.kernel.org/r/20240407130537.16977-1-osalvador@suse.de
      Fixes: 0d206b5d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarTony Luck <tony.luck@intel.com>
      Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: <stable@vger.kernel.org>	[6.1.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07a57a33
    • Miaohe Lin's avatar
      mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled · 1983184c
      Miaohe Lin authored
      When I did hard offline test with hugetlb pages, below deadlock occurs:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.8.0-11409-gf6cef5f8 #1 Not tainted
      ------------------------------------------------------
      bash/46904 is trying to acquire lock:
      ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60
      
      but task is already holding lock:
      ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
             __mutex_lock+0x6c/0x770
             page_alloc_cpu_online+0x3c/0x70
             cpuhp_invoke_callback+0x397/0x5f0
             __cpuhp_invoke_callback_range+0x71/0xe0
             _cpu_up+0xeb/0x210
             cpu_up+0x91/0xe0
             cpuhp_bringup_mask+0x49/0xb0
             bringup_nonboot_cpus+0xb7/0xe0
             smp_init+0x25/0xa0
             kernel_init_freeable+0x15f/0x3e0
             kernel_init+0x15/0x1b0
             ret_from_fork+0x2f/0x50
             ret_from_fork_asm+0x1a/0x30
      
      -> #0 (cpu_hotplug_lock){++++}-{0:0}:
             __lock_acquire+0x1298/0x1cd0
             lock_acquire+0xc0/0x2b0
             cpus_read_lock+0x2a/0xc0
             static_key_slow_dec+0x16/0x60
             __hugetlb_vmemmap_restore_folio+0x1b9/0x200
             dissolve_free_huge_page+0x211/0x260
             __page_handle_poison+0x45/0xc0
             memory_failure+0x65e/0xc70
             hard_offline_page_store+0x55/0xa0
             kernfs_fop_write_iter+0x12c/0x1d0
             vfs_write+0x387/0x550
             ksys_write+0x64/0xe0
             do_syscall_64+0xca/0x1e0
             entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcp_batch_high_lock);
                                     lock(cpu_hotplug_lock);
                                     lock(pcp_batch_high_lock);
        rlock(cpu_hotplug_lock);
      
       *** DEADLOCK ***
      
      5 locks held by bash/46904:
       #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
       #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
       #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
       #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
       #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      stack backtrace:
      CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x68/0xa0
       check_noncircular+0x129/0x140
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      RIP: 0033:0x7fc862314887
      Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
      RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
      RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00
      
      In short, below scene breaks the lock dependency chain:
      
       memory_failure
        __page_handle_poison
         zone_pcp_disable -- lock(pcp_batch_high_lock)
         dissolve_free_huge_page
          __hugetlb_vmemmap_restore_folio
           static_key_slow_dec
            cpus_read_lock -- rlock(cpu_hotplug_lock)
      
      Fix this by calling drain_all_pages() instead.
      
      This issue won't occur until commit a6b40850 ("mm: hugetlb: replace
      hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
      rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
      lock(pcp_batch_high_lock) is already in the __page_handle_poison().
      
      [linmiaohe@huawei.com: extend comment per Oscar]
      [akpm@linux-foundation.org: reflow block comment]
      Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
      Fixes: a6b40850 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1983184c
    • Peter Xu's avatar
      mm/userfaultfd: allow hugetlb change protection upon poison entry · c5977c95
      Peter Xu authored
      After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either
      the POISON one or UFFD_WP one.
      
      Allow change protection to run on a poisoned marker just like !hugetlb
      cases, ignoring the marker irrelevant of the permission.
      
      Here the two bits are mutual exclusive.  For example, when install a
      poisoned entry it must not be UFFD_WP already (by checking pte_none()
      before such install).  And it also means if UFFD_WP is set there must have
      no POISON bit set.  It makes sense because UFFD_WP is a bit to reflect
      permission, and permissions do not apply if the pte is poisoned and
      destined to sigbus.
      
      So here we simply check uffd_wp bit set first, do nothing otherwise.
      
      Attach the Fixes to UFFDIO_POISON work, as before that it should not be
      possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap,
      so no chance of swapin errors).
      
      Link: https://lkml.kernel.org/r/20240405231920.1772199-1-peterx@redhat.com
      Link: https://lore.kernel.org/r/000000000000920d5e0615602dd1@google.com
      Fixes: fc71884a ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: syzbot+b07c8ac8eee3d4d8440f@syzkaller.appspotmail.com
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5977c95
    • Oscar Salvador's avatar
      mm,page_owner: fix printing of stack records · 74017458
      Oscar Salvador authored
      When seq_* code sees that its buffer overflowed, it re-allocates a bigger
      onecand calls seq_operations->start() callback again.  stack_start()
      naively though that if it got called again, it meant that the old record
      got already printed so it returned the next object, but that is not true.
      
      The consequence of that is that every time stack_stop() -> stack_start()
      get called because we needed a bigger buffer, stack_start() will skip
      entries, and those will not be printed.
      
      Fix it by not advancing to the next object in stack_start().
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-5-osalvador@suse.de
      Fixes: 765973a0 ("mm,page_owner: display all stacks and their count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74017458
    • Oscar Salvador's avatar
      mm,page_owner: fix accounting of pages when migrating · 718b1f33
      Oscar Salvador authored
      Upon migration, new allocated pages are being given the handle of the old
      pages.  This is problematic because it means that for the stack which
      allocated the old page, we will be substracting the old page + the new one
      when that page is freed, creating an accounting imbalance.
      
      There is an interest in keeping it that way, as otherwise the output will
      biased towards migration stacks should those operations occur often, but
      that is not really helpful.
      
      The link from the new page to the old stack is being performed by calling
      __update_page_owner_handle() in __folio_copy_owner().  The only thing that
      is left is to link the migrate stack to the old page, so the old page will
      be subtracted from the migrate stack, avoiding by doing so any possible
      imbalance.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-4-osalvador@suse.de
      Fixes: 217b2119 ("mm,page_owner: implement the tracking of the stacks count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      718b1f33