1. 22 Oct, 2021 1 commit
  2. 20 Oct, 2021 9 commits
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client · 2f111a6f
      Linus Torvalds authored
      Pull ceph fixes from Ilya Dryomov:
       "Two important filesystem fixes, marked for stable.
      
        The blocklisted superblocks issue was particularly annoying because
        for unexperienced users it essentially exacted a reboot to establish a
        new functional mount in that scenario"
      
      * tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client:
        ceph: fix handling of "meta" errors
        ceph: skip existing superblocks that are blocklisted or shut down when mounting
      2f111a6f
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.15-2' of git://git.infradead.org/users/hch/dma-mapping · 515dcc2e
      Linus Torvalds authored
      Pull dma-mapping fixes from Christoph Hellwig:
      
       - fix more dma-debug fallout (Gerald Schaefer, Hamza Mahfooz)
      
       - fix a kerneldoc warning (Logan Gunthorpe)
      
      * tag 'dma-mapping-5.15-2' of git://git.infradead.org/users/hch/dma-mapping:
        dma-debug: teach add_dma_entry() about DMA_ATTR_SKIP_CPU_SYNC
        dma-debug: fix sg checks in debug_dma_map_sg()
        dma-mapping: fix the kerneldoc for dma_map_sgtable()
      515dcc2e
    • Linus Torvalds's avatar
      Merge tag 'sound-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 8e37395c
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Again it became bigger than wished, unfortunately, as this contains
        quite a few ASoC fixes that came up a bit late. It also includes yet
        more HD- and USB-audio quirks: I decided to merge them now, as those
        are for stable, and we'll need them sooner or later.
      
        Although the volumes are a bit high, all changes are device-specific
        (and reasonably small) fixes, so it should be safe for the late rc"
      
      * tag 'sound-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: usb-audio: Fix microphone sound on Jieli webcam.
        ALSA: hda/realtek: Fixes HP Spectre x360 15-eb1xxx speakers
        ALSA: usb-audio: Provide quirk for Sennheiser GSP670 Headset
        ALSA: hda/realtek: Add quirk for Clevo PC50HS
        ALSA: usb-audio: add Schiit Hel device to quirk table
        ASoC: wm8960: Fix clock configuration on slave mode
        ASoC: cs42l42: Ensure 0dB full scale volume is used for headsets
        ASoC: soc-core: fix null-ptr-deref in snd_soc_del_component_unlocked()
        ASoC: codec: wcd938x: Add irq config support
        ASoC: DAPM: Fix missing kctl change notifications
        ASoC: Intel: bytcht_es8316: Utilize dev_err_probe() to avoid log saturation
        ASoC: Intel: bytcht_es8316: Switch to use gpiod_get_optional()
        ASoC: Intel: bytcht_es8316: Use temporary variable for struct device
        ASoC: Intel: bytcht_es8316: Get platform data via dev_get_platdata()
        ASoC: wcd938x: Fix jack detection issue
        ASoC: nau8824: Fix headphone vs headset, button-press detection no longer working
        ASoC: cs4341: Add SPI device ID table
        ASoC: pcm179x: Add missing entries SPI to device ID table
        ASoC: fsl_xcvr: Fix channel swap issue with ARC
        ASoC: pcm512x: Mend accesses to the I2S_1 and I2S_2 registers
      8e37395c
    • Linus Torvalds's avatar
      Merge tag 'audit-pr-20211019' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit · 6da52dea
      Linus Torvalds authored
      Pull audit fix from Paul Moore:
       "One small audit patch to add a pointer NULL check"
      
      * tag 'audit-pr-20211019' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
        audit: fix possible null-pointer dereference in audit_filter_rules
      6da52dea
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · fc9b2893
      Linus Torvalds authored
      Pull tracing fix from Steven Rostedt:
       "Recursion fix for tracing.
      
        While cleaning up some of the tracing recursion protection logic, I
        discovered a scenario that the current design would miss, and would
        allow an infinite recursion. Removing an optimization trick that
        opened the hole fixes the issue and cleans up the code as well"
      
      * tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tracing: Have all levels of checks prevent recursion
      fc9b2893
    • Linus Torvalds's avatar
      Merge tag 'nios2_fixes_for_v5.15_part2' of... · 1e599774
      Linus Torvalds authored
      Merge tag 'nios2_fixes_for_v5.15_part2' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux
      
      Pull nios2 fix from Dinh Nguyen:
      
       - Renamed CTL_STATUS to CTL_FSTATUS to fix a redefined warning
      
      * tag 'nios2_fixes_for_v5.15_part2' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
        NIOS2: irqflags: rename a redefined register name
      1e599774
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 0afe64be
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "Tools:
         - kvm_stat: do not show halt_wait_ns since it is not a cumulative statistic
      
        x86:
         - clean ups and fixes for bus lock vmexit and lazy allocation of rmaps
         - two fixes for SEV-ES (one more coming as soon as I get reviews)
         - fix for static_key underflow
      
        ARM:
         - Properly refcount pages used as a concatenated stage-2 PGD
         - Fix missing unlock when detecting the use of MTE+VM_SHARED"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: SEV-ES: reduce ghcb_sa_len to 32 bits
        KVM: VMX: Remove redundant handling of bus lock vmexit
        KVM: kvm_stat: do not show halt_wait_ns
        KVM: x86: WARN if APIC HW/SW disable static keys are non-zero on unload
        Revert "KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET"
        KVM: SEV-ES: Set guest_state_protected after VMSA update
        KVM: X86: fix lazy allocation of rmaps
        KVM: SEV-ES: fix length of string I/O
        KVM: arm64: Release mmap_lock when using VM_SHARED with MTE
        KVM: arm64: Report corrupted refcount at EL2
        KVM: arm64: Fix host stage-2 PGD refcount
        KVM: s390: Function documentation fixes
      0afe64be
    • Nathan Lynch's avatar
      powerpc/smp: do not decrement idle task preempt count in CPU offline · 787252a1
      Nathan Lynch authored
      With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
      get:
      
      BUG: scheduling while atomic: swapper/1/0/0x00000000
      no locks held by swapper/1/0.
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
      Call Trace:
       dump_stack_lvl+0xac/0x108
       __schedule_bug+0xac/0xe0
       __schedule+0xcf8/0x10d0
       schedule_idle+0x3c/0x70
       do_idle+0x2d8/0x4a0
       cpu_startup_entry+0x38/0x40
       start_secondary+0x2ec/0x3a0
       start_secondary_prolog+0x10/0x14
      
      This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
      preempt count, for reasons explained in commit a7c2bb82 ("powerpc:
      Re-enable preemption before cpu_die()"), specifically "start_secondary()
      expects a preempt_count() of 0."
      
      However, since commit 2c669ef6 ("powerpc/preempt: Don't touch the idle
      task's preempt_count during hotplug") and commit f1a0a376 ("sched/core:
      Initialize the idle task with preemption disabled"), that justification no
      longer holds.
      
      The idle task isn't supposed to re-enable preemption, so remove the
      vestigial preempt_enable() from the CPU offline path.
      
      Tested with pseries and powernv in qemu, and pseries on PowerVM.
      
      Fixes: 2c669ef6 ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com
      787252a1
    • Michael Ellerman's avatar
      powerpc/idle: Don't corrupt back chain when going idle · 496c5fe2
      Michael Ellerman authored
      In isa206_idle_insn_mayloss() we store various registers into the stack
      red zone, which is allowed.
      
      However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again,
      to 0(r1), which corrupts the stack back chain.
      
      We used to do the same in isa206_idle_insn_mayloss() itself, but we
      fixed that in 73287caa ("powerpc64/idle: Fix SP offsets when saving
      GPRs"), however we missed that the macro also corrupts the back chain.
      
      Corrupting the back chain is bad for debuggability but doesn't
      necessarily cause a bug.
      
      However we recently changed the stack handling in some KVM code, and it
      now relies on the stack back chain being valid when it returns. The
      corruption causes that code to return with r1 pointing somewhere in
      kernel data, at some point LR is restored from the stack and we branch
      to NULL or somewhere else invalid.
      
      Only affects Power8 hosts running KVM guests, with dynamic_mt_modes
      enabled (which it is by default).
      
      The fixes tag below points to the commit that changed the KVM stack
      handling, exposing this bug. The actual corruption of the back chain has
      always existed since 948cf67c ("powerpc: Add NAP mode support on
      Power7 in HV mode").
      
      Fixes: 9b4416c5 ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211020094826.3222052-1-mpe@ellerman.id.au
      496c5fe2
  3. 19 Oct, 2021 23 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · d9abdee5
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "19 patches.
      
        Subsystems affected by this patch series: mm (userfaultfd, migration,
        memblock, mempolicy, slub, secretmem, and thp), ocfs2, binfmt, vfs,
        and misc"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mailmap: add Andrej Shadura
        mm/thp: decrease nr_thps in file's mapping on THP split
        mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem()
        vfs: check fd has read access in kernel_read_file_from_fd()
        elfcore: correct reference to CONFIG_UML
        mm, slub: fix incorrect memcg slab count for bulk free
        mm, slub: fix potential use-after-free in slab_debugfs_fops
        mm, slub: fix potential memoryleak in kmem_cache_open()
        mm, slub: fix mismatch between reconstructed freelist depth and cnt
        mm, slub: fix two bugs in slab_debug_trace_open()
        mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()
        memblock: check memory total_size
        ocfs2: mount fails with buffer overflow in strlen
        ocfs2: fix data corruption after conversion from inline format
        mm/migrate: fix CPUHP state to update node demotion order
        mm/migrate: add CPU hotplug to demotion #ifdef
        mm/migrate: optimize hotplug-time demotion order updates
        userfaultfd: fix a race between writeprotect and exit_mmap()
        mm/userfaultfd: selftests: fix memory corruption with thp enabled
      d9abdee5
    • Jeff Layton's avatar
      ceph: fix handling of "meta" errors · 1bd85aa6
      Jeff Layton authored
      Currently, we check the wb_err too early for directories, before all of
      the unsafe child requests have been waited on. In order to fix that we
      need to check the mapping->wb_err later nearer to the end of ceph_fsync.
      
      We also have an overly-complex method for tracking errors after
      blocklisting. The errors recorded in cleanup_session_requests go to a
      completely separate field in the inode, but we end up reporting them the
      same way we would for any other error (in fsync).
      
      There's no real benefit to tracking these errors in two different
      places, since the only reporting mechanism for them is in fsync, and
      we'd need to advance them both every time.
      
      Given that, we can just remove i_meta_err, and convert the places that
      used it to instead just use mapping->wb_err instead. That also fixes
      the original problem by ensuring that we do a check_and_advance of the
      wb_err at the end of the fsync op.
      
      Cc: stable@vger.kernel.org
      URL: https://tracker.ceph.com/issues/52864Reported-by: default avatarPatrick Donnelly <pdonnell@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarXiubo Li <xiubli@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      1bd85aa6
    • Jeff Layton's avatar
      ceph: skip existing superblocks that are blocklisted or shut down when mounting · 98d0a6fb
      Jeff Layton authored
      Currently when mounting, we may end up finding an existing superblock
      that corresponds to a blocklisted MDS client. This means that the new
      mount ends up being unusable.
      
      If we've found an existing superblock with a client that is already
      blocklisted, and the client is not configured to recover on its own,
      fail the match. Ditto if the superblock has been forcibly unmounted.
      
      While we're in here, also rename "other" to the more conventional "fsc".
      
      Cc: stable@vger.kernel.org
      URL: https://bugzilla.redhat.com/show_bug.cgi?id=1901499Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarXiubo Li <xiubli@redhat.com>
      Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      98d0a6fb
    • Andrej Shadura's avatar
      mailmap: add Andrej Shadura · 362d5dfc
      Andrej Shadura authored
      Add a mapping for my old work email for BelDisplayTech to the personal
      email, and make sure the Collabora email has the correct spelling of the
      first name.
      
      Link: https://lkml.kernel.org/r/20210917091016.30232-1-andrew.shadura@collabora.co.ukSigned-off-by: default avatarAndrej Shadura <andrew.shadura@collabora.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      362d5dfc
    • Marek Szyprowski's avatar
      mm/thp: decrease nr_thps in file's mapping on THP split · 1ca7554d
      Marek Szyprowski authored
      Decrease nr_thps counter in file's mapping to ensure that the page cache
      won't be dropped excessively on file write access if page has been
      already split.
      
      I've tried a test scenario running a big binary, kernel remaps it with
      THPs, then force a THP split with /sys/kernel/debug/split_huge_pages.
      During any further open of that binary with O_RDWR or O_WRITEONLY kernel
      drops page cache for it, because of non-zero thps counter.
      
      Link: https://lkml.kernel.org/r/20211012120237.2600-1-m.szyprowski@samsung.comSigned-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Fixes: 09d91cda ("mm,thp: avoid writes to file with THP in pagecache")
      Fixes: 06d3eff6 ("mm/thp: fix node page state in split_huge_page_to_list()")
      Acked-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: <sfoon.kim@samsung.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ca7554d
    • Sean Christopherson's avatar
      mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem() · 79f9bc58
      Sean Christopherson authored
      Check for a NULL page->mapping before dereferencing the mapping in
      page_is_secretmem(), as the page's mapping can be nullified while gup()
      is running, e.g.  by reclaim or truncation.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000068
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 6 PID: 4173897 Comm: CPU 3/KVM Tainted: G        W
        RIP: 0010:internal_get_user_pages_fast+0x621/0x9d0
        Code: <48> 81 7a 68 80 08 04 bc 0f 85 21 ff ff 8 89 c7 be
        RSP: 0018:ffffaa90087679b0 EFLAGS: 00010046
        RAX: ffffe3f37905b900 RBX: 00007f2dd561e000 RCX: ffffe3f37905b934
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe3f37905b900
        ...
        CR2: 0000000000000068 CR3: 00000004c5898003 CR4: 00000000001726e0
        Call Trace:
         get_user_pages_fast_only+0x13/0x20
         hva_to_pfn+0xa9/0x3e0
         try_async_pf+0xa1/0x270
         direct_page_fault+0x113/0xad0
         kvm_mmu_page_fault+0x69/0x680
         vmx_handle_exit+0xe1/0x5d0
         kvm_arch_vcpu_ioctl_run+0xd81/0x1c70
         kvm_vcpu_ioctl+0x267/0x670
         __x64_sys_ioctl+0x83/0xa0
         do_syscall_64+0x56/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Link: https://lkml.kernel.org/r/20211007231502.3552715-1-seanjc@google.com
      Fixes: 1507f512 ("mm: introduce memfd_secret system call to create "secret" memory areas")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reported-by: default avatarStephen <stephenackerman16@gmail.com>
      Tested-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79f9bc58
    • Matthew Wilcox (Oracle)'s avatar
      vfs: check fd has read access in kernel_read_file_from_fd() · 032146cd
      Matthew Wilcox (Oracle) authored
      If we open a file without read access and then pass the fd to a syscall
      whose implementation calls kernel_read_file_from_fd(), we get a warning
      from __kernel_read():
      
              if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))
      
      This currently affects both finit_module() and kexec_file_load(), but it
      could affect other syscalls in the future.
      
      Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org
      Fixes: b844f0ec ("vfs: define kernel_copy_file_from_fd()")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mimi Zohar <zohar@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      032146cd
    • Lukas Bulwahn's avatar
      elfcore: correct reference to CONFIG_UML · b0e90128
      Lukas Bulwahn authored
      Commit 6e7b64b9 ("elfcore: fix building with clang") introduces
      special handling for two architectures, ia64 and User Mode Linux.
      However, the wrong name, i.e., CONFIG_UM, for the intended Kconfig
      symbol for User-Mode Linux was used.
      
      Although the directory for User Mode Linux is ./arch/um; the Kconfig
      symbol for this architecture is called CONFIG_UML.
      
      Luckily, ./scripts/checkkconfigsymbols.py warns on non-existing configs:
      
        UM
        Referencing files: include/linux/elfcore.h
        Similar symbols: UML, NUMA
      
      Correct the name of the config to the intended one.
      
      [akpm@linux-foundation.org: fix um/x86_64, per Catalin]
        Link: https://lkml.kernel.org/r/20211006181119.2851441-1-catalin.marinas@arm.com
        Link: https://lkml.kernel.org/r/YV6pejGzLy5ppEpt@arm.com
      
      Link: https://lkml.kernel.org/r/20211006082209.417-1-lukas.bulwahn@gmail.com
      Fixes: 6e7b64b9 ("elfcore: fix building with clang")
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Barret Rhoden <brho@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0e90128
    • Miaohe Lin's avatar
      mm, slub: fix incorrect memcg slab count for bulk free · 3ddd6026
      Miaohe Lin authored
      kmem_cache_free_bulk() will call memcg_slab_free_hook() for all objects
      when doing bulk free.  So we shouldn't call memcg_slab_free_hook() again
      for bulk free to avoid incorrect memcg slab count.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-6-linmiaohe@huawei.com
      Fixes: d1b2cf6c ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ddd6026
    • Miaohe Lin's avatar
      mm, slub: fix potential use-after-free in slab_debugfs_fops · 67823a54
      Miaohe Lin authored
      When sysfs_slab_add failed, we shouldn't call debugfs_slab_add() for s
      because s will be freed soon.  And slab_debugfs_fops will use s later
      leading to a use-after-free.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-5-linmiaohe@huawei.com
      Fixes: 64dd6849 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67823a54
    • Miaohe Lin's avatar
      mm, slub: fix potential memoryleak in kmem_cache_open() · 9037c576
      Miaohe Lin authored
      In error path, the random_seq of slub cache might be leaked.  Fix this
      by using __kmem_cache_release() to release all the relevant resources.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-4-linmiaohe@huawei.com
      Fixes: 210e7a43 ("mm: SLUB freelist randomization")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9037c576
    • Miaohe Lin's avatar
      mm, slub: fix mismatch between reconstructed freelist depth and cnt · 899447f6
      Miaohe Lin authored
      If object's reuse is delayed, it will be excluded from the reconstructed
      freelist.  But we forgot to adjust the cnt accordingly.  So there will
      be a mismatch between reconstructed freelist depth and cnt.  This will
      lead to free_debug_processing() complaining about freelist count or a
      incorrect slub inuse count.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-3-linmiaohe@huawei.com
      Fixes: c3895391 ("kasan, slub: fix handling of kasan_slab_free hook")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      899447f6
    • Miaohe Lin's avatar
      mm, slub: fix two bugs in slab_debug_trace_open() · 2127d225
      Miaohe Lin authored
      Patch series "Fixups for slub".
      
      This series contains various bug fixes for slub.  We fix memoryleak,
      use-afer-free, NULL pointer dereferencing and so on in slub.  More
      details can be found in the respective changelogs.
      
      This patch (of 5):
      
      It's possible that __seq_open_private() will return NULL.  So we should
      check it before using lest dereferencing NULL pointer.  And in error
      paths, we forgot to release private buffer via seq_release_private().
      Memory will leak in these paths.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210916123920.48704-2-linmiaohe@huawei.com
      Fixes: 64dd6849 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2127d225
    • Eric Dumazet's avatar
      mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() · 6d2aec9e
      Eric Dumazet authored
      syzbot reported access to unitialized memory in mbind() [1]
      
      Issue came with commit bda420b9 ("numa balancing: migrate on fault
      among multiple bound nodes")
      
      This commit added a new bit in MPOL_MODE_FLAGS, but only checked valid
      combination (MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND) in
      do_set_mempolicy()
      
      This patch moves the check in sanitize_mpol_flags() so that it is also
      used by mbind()
      
        [1]
        BUG: KMSAN: uninit-value in __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         mpol_equal include/linux/mempolicy.h:105 [inline]
         vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
         mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
         do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        Uninit was created at:
         slab_alloc_node mm/slub.c:3221 [inline]
         slab_alloc mm/slub.c:3230 [inline]
         kmem_cache_alloc+0x751/0xff0 mm/slub.c:3235
         mpol_new mm/mempolicy.c:293 [inline]
         do_mbind+0x912/0x15f0 mm/mempolicy.c:1289
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        =====================================================
        Kernel panic - not syncing: panic_on_kmsan set ...
        CPU: 0 PID: 15049 Comm: syz-executor.0 Tainted: G    B             5.15.0-rc2-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x1ff/0x28e lib/dump_stack.c:106
         dump_stack+0x25/0x28 lib/dump_stack.c:113
         panic+0x44f/0xdeb kernel/panic.c:232
         kmsan_report+0x2ee/0x300 mm/kmsan/report.c:186
         __msan_warning+0xd7/0x150 mm/kmsan/instrumentation.c:208
         __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         mpol_equal include/linux/mempolicy.h:105 [inline]
         vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
         mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
         do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Link: https://lkml.kernel.org/r/20211001215630.810592-1-eric.dumazet@gmail.com
      Fixes: bda420b9 ("numa balancing: migrate on fault among multiple bound nodes")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d2aec9e
    • Peng Fan's avatar
      memblock: check memory total_size · 5173ed72
      Peng Fan authored
      mem=[X][G|M] is broken on ARM64 platform, there are cases that even
      type.cnt is 1, but total_size is not 0 because regions are merged into
      1.  So only check 'cnt' is not enough, total_size should be used,
      othersize bootargs 'mem=[X][G|B]' not work anymore.
      
      Link: https://lkml.kernel.org/r/20210930024437.32598-1-peng.fan@oss.nxp.com
      Fixes: e888fa7b ("memblock: Check memory add/cap ordering")
      Signed-off-by: default avatarPeng Fan <peng.fan@nxp.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5173ed72
    • Valentin Vidic's avatar
      ocfs2: mount fails with buffer overflow in strlen · b15fa922
      Valentin Vidic authored
      Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an
      ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the
      trace below.  Problem seems to be that strings for cluster stack and
      cluster name are not guaranteed to be null terminated in the disk
      representation, while strlcpy assumes that the source string is always
      null terminated.  This causes a read outside of the source string
      triggering the buffer overflow detection.
      
        detected buffer overflow in strlen
        ------------[ cut here ]------------
        kernel BUG at lib/string.c:1149!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1
          Debian 5.14.6-2
        RIP: 0010:fortify_panic+0xf/0x11
        ...
        Call Trace:
         ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2]
         ocfs2_fill_super+0x359/0x19b0 [ocfs2]
         mount_bdev+0x185/0x1b0
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         path_mount+0x454/0xa20
         __x64_sys_mount+0x103/0x140
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hrSigned-off-by: default avatarValentin Vidic <vvidic@valentin-vidic.from.hr>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b15fa922
    • Jan Kara's avatar
      ocfs2: fix data corruption after conversion from inline format · 5314454e
      Jan Kara authored
      Commit 6dbf7bb5 ("fs: Don't invalidate page buffers in
      block_write_full_page()") uncovered a latent bug in ocfs2 conversion
      from inline inode format to a normal inode format.
      
      The code in ocfs2_convert_inline_data_to_extents() attempts to zero out
      the whole cluster allocated for file data by grabbing, zeroing, and
      dirtying all pages covering this cluster.  However these pages are
      beyond i_size, thus writeback code generally ignores these dirty pages
      and no blocks were ever actually zeroed on the disk.
      
      This oversight was fixed by commit 693c241a ("ocfs2: No need to zero
      pages past i_size.") for standard ocfs2 write path, inline conversion
      path was apparently forgotten; the commit log also has a reasoning why
      the zeroing actually is not needed.
      
      After commit 6dbf7bb5, things became worse as writeback code stopped
      invalidating buffers on pages beyond i_size and thus these pages end up
      with clean PageDirty bit but with buffers attached to these pages being
      still dirty.  So when a file is converted from inline format, then
      writeback triggers, and then the file is grown so that these pages
      become valid, the invalid dirtiness state is preserved,
      mark_buffer_dirty() does nothing on these pages (buffers are already
      dirty) but page is never written back because it is clean.  So data
      written to these pages is lost once pages are reclaimed.
      
      Simple reproducer for the problem is:
      
        xfs_io -f -c "pwrite 0 2000" -c "pwrite 2000 2000" -c "fsync" \
          -c "pwrite 4000 2000" ocfs2_file
      
      After unmounting and mounting the fs again, you can observe that end of
      'ocfs2_file' has lost its contents.
      
      Fix the problem by not doing the pointless zeroing during conversion
      from inline format similarly as in the standard write path.
      
      [akpm@linux-foundation.org: fix whitespace, per Joseph]
      
      Link: https://lkml.kernel.org/r/20210930095405.21433-1-jack@suse.cz
      Fixes: 6dbf7bb5 ("fs: Don't invalidate page buffers in block_write_full_page()")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Tested-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: "Markov, Andrey" <Markov.Andrey@Dell.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5314454e
    • Huang Ying's avatar
      mm/migrate: fix CPUHP state to update node demotion order · a6a0251c
      Huang Ying authored
      The node demotion order needs to be updated during CPU hotplug.  Because
      whether a NUMA node has CPU may influence the demotion order.  The
      update function should be called during CPU online/offline after the
      node_states[N_CPU] has been updated.  That is done in
      CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during
      CPU offline.  But in commit 884a6e5d ("mm/migrate: update node
      demotion order on hotplug events"), the function to update node demotion
      order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline.  This
      doesn't satisfy the order requirement.
      
      For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
      and P2, P3 in S1), the demotion order is
      
       - S0 -> NUMA_NO_NODE
       - S1 -> NUMA_NO_NODE
      
      After P2 and P3 is offlined, because S1 has no CPU now, the demotion
      order should have been changed to
      
       - S0 -> S1
       - S1 -> NO_NODE
      
      but it isn't changed, because the order updating callback for CPU
      hotplug doesn't see the new nodemask.  After that, if P1 is offlined,
      the demotion order is changed to the expected order as above.
      
      So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
      CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
      CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
      update function on them.
      
      Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6a0251c
    • Dave Hansen's avatar
      mm/migrate: add CPU hotplug to demotion #ifdef · 76af6a05
      Dave Hansen authored
      Once upon a time, the node demotion updates were driven solely by memory
      hotplug events.  But now, there are handlers for both CPU and memory
      hotplug.
      
      However, the #ifdef around the code checks only memory hotplug.  A
      system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU
      hotplug events.
      
      Update the #ifdef around the common code.  Add memory and CPU-specific
      #ifdefs for their handlers.  These memory/CPU #ifdefs avoid unused
      function warnings when their Kconfig option is off.
      
      [arnd@arndb.de: rework hotplug_memory_notifier() stub]
        Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org
      
      Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76af6a05
    • Dave Hansen's avatar
      mm/migrate: optimize hotplug-time demotion order updates · 295be91f
      Dave Hansen authored
      Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2.
      
      This contains two fixes for the "automatic demotion" code which was
      merged into 5.15:
      
       * Fix memory hotplug performance regression by watching
         suppressing any real action on irrelevant hotplug events.
      
       * Ensure CPU hotplug handler is registered when memory hotplug
         is disabled.
      
      This patch (of 2):
      
      == tl;dr ==
      
      Automatic demotion opted for a simple, lazy approach to handling hotplug
      events.  This noticeably slows down memory hotplug[1].  Optimize away
      updates to the demotion order when memory hotplug events should have no
      effect.
      
      This has no effect on CPU hotplug.  There is no known problem on the CPU
      side and any work there will be in a separate series.
      
      == Background ==
      
      Automatic demotion is a memory migration strategy to ensure that new
      allocations have room in faster memory tiers on tiered memory systems.
      The kernel maintains an array (node_demotion[]) to drive these
      migrations.
      
      The node_demotion[] path is calculated by starting at nodes with CPUs
      and then "walking" to nodes with memory.  Only hotplug events which
      online or offline a node with memory (N_ONLINE) or CPUs (N_CPU) will
      actually affect the migration order.
      
      == Problem ==
      
      However, the current code is lazy.  It completely regenerates the
      migration order on *any* CPU or memory hotplug event.  The logic was
      that these events are extremely rare and that the overhead from
      indiscriminate order regeneration is minimal.
      
      Part of the update logic involves a synchronize_rcu(), which is a pretty
      big hammer.  Its overhead was large enough to be detected by some 0day
      tests that watch memory hotplug performance[1].
      
      == Solution ==
      
      Add a new helper (node_demotion_topo_changed()) which can differentiate
      between superfluous and impactful hotplug events.  Skip the expensive
      update operation for superfluous events.
      
      == Aside: Locking ==
      
      It took me a few moments to declare the locking to be safe enough for
      node_demotion_topo_changed() to work.  It all hinges on the memory
      hotplug lock:
      
      During memory hotplug events, 'mem_hotplug_lock' is held for write.
      This ensures that two memory hotplug events can not be called
      simultaneously.
      
      CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides
      mutual exclusion between CPU hotplug events.  In addition, the demotion
      code acquire and hold the mem_hotplug_lock for read during its CPU
      hotplug handlers.  This provides mutual exclusion between the demotion
      memory hotplug callbacks and the CPU hotplug callbacks.
      
      This effectively allows treating the migration target generation code to
      act as if it is single-threaded.
      
      1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/
      
      Link: https://lkml.kernel.org/r/20210924161251.093CCD06@davehans-spike.ostc.intel.com
      Link: https://lkml.kernel.org/r/20210924161253.D7673E31@davehans-spike.ostc.intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      295be91f
    • Nadav Amit's avatar
      userfaultfd: fix a race between writeprotect and exit_mmap() · cb185d5f
      Nadav Amit authored
      A race is possible when a process exits, its VMAs are removed by
      exit_mmap() and at the same time userfaultfd_writeprotect() is called.
      
      The race was detected by KASAN on a development kernel, but it appears
      to be possible on vanilla kernels as well.
      
      Use mmget_not_zero() to prevent the race as done in other userfaultfd
      operations.
      
      Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com
      Fixes: 63b2d417 ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Tested-by: default avatarLi  Wang <liwang@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb185d5f
    • Peter Xu's avatar
      mm/userfaultfd: selftests: fix memory corruption with thp enabled · 8913970c
      Peter Xu authored
      In RHEL's gating selftests we've encountered memory corruption in the
      uffd event test even with upstream kernel:
      
              # ./userfaultfd anon 128 4
              nr_pages: 32768, nr_pages_per_cpu: 32768
              bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729)
              bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877)
              bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699)
              bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196)
              testing uffd-wp with pagemap (pgsize=4096): done
              testing uffd-wp with pagemap (pgsize=2097152): done
              testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963)
              ERROR: faulting process failed (errno=0, line=1117)
      
      It can be easily reproduced when global thp enabled, which is the
      default for RHEL.
      
      It's also known as a side effect of commit 0db282ba ("selftest: use
      mmap instead of posix_memalign to allocate memory", 2021-07-23), which
      is imho right itself on using mmap() to make sure the addresses will be
      untagged even on arm.
      
      The problem is, for each test we allocate buffers using two
      allocate_area() calls.  We assumed these two buffers won't affect each
      other, however they could, because mmap() could have found that the two
      buffers are near each other and having the same VMA flags, so they got
      merged into one VMA.
      
      It won't be a big problem if thp is not enabled, but when thp is
      agressively enabled it means when initializing the src buffer it could
      accidentally setup part of the dest buffer too when there's a shared THP
      that overlaps the two regions.  Then some of the dest buffer won't be
      able to be trapped by userfaultfd missing mode, then it'll cause memory
      corruption as described.
      
      To fix it, do release_pages() after initializing the src buffer.
      
      Since the previous two release_pages() calls are after
      uffd_test_ctx_clear() which will unmap all the buffers anyway (which is
      stronger than release pages; as unmap() also tear town pgtables), drop
      them as they shouldn't really be anything useful.
      
      We can mark the Fixes tag upon 0db282ba as it's reported to only
      happen there, however the real "Fixes" IMHO should be 8ba6e864, as
      before that commit we'll always do explicit release_pages() before
      registration of uffd, and 8ba6e864 changed that logic by adding
      extra unmap/map and we didn't release the pages at the right place.
      Meanwhile I don't have a solid glue anyway on whether posix_memalign()
      could always avoid triggering this bug, hence it's safer to attach this
      fix to commit 8ba6e864.
      
      Link: https://lkml.kernel.org/r/20210923232512.210092-1-peterx@redhat.com
      Fixes: 8ba6e864 ("userfaultfd/selftests: reinitialize test context in each test")
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1994931Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatarLi Wang <liwan@redhat.com>
      Tested-by: default avatarLi Wang <liwang@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8913970c
    • Marco Giunta's avatar
      ALSA: usb-audio: Fix microphone sound on Jieli webcam. · 29664923
      Marco Giunta authored
      When a Jieli Technology USB Webcam is connected, the video part works
      well, but the mic sound is speeded up. On dmesg there are messages
      about different rates from the runtime rates, warnings about volume
      resolution and lastly, the log is filled, every 5 seconds, with
      retire_capture_urb error messages.
      
      The mic works only when ep packet size is set to wMaxPacketSize (normal
      sound and no more retire_capture_urb error messages). Skipping reading
      sample rate, fixes the messages about different rates and forcing a volume
      resolution, fixes warnings about volume range. I have arbitrarily choosed
      the value (16): I read in a comment that there should be no more than 255
      levels, so 4096 (max volume) / 16 = 0-255.
      Signed-off-by: default avatarMarco Giunta <giun7a@gmail.com>
      Link: https://lore.kernel.org/r/20211018162552.12082-1-giun7a@gmail.comSigned-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      29664923
  4. 18 Oct, 2021 7 commits
    • Gaosheng Cui's avatar
      audit: fix possible null-pointer dereference in audit_filter_rules · 6e3ee990
      Gaosheng Cui authored
      Fix  possible null-pointer dereference in audit_filter_rules.
      
      audit_filter_rules() error: we previously assumed 'ctx' could be null
      
      Cc: stable@vger.kernel.org
      Fixes: bf361231 ("audit: add saddr_fam filter field")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarGaosheng Cui <cuigaosheng1@huawei.com>
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      6e3ee990
    • Steven Rostedt (VMware)'s avatar
      tracing: Have all levels of checks prevent recursion · ed65df63
      Steven Rostedt (VMware) authored
      While writing an email explaining the "bit = 0" logic for a discussion on
      making ftrace_test_recursion_trylock() disable preemption, I discovered a
      path that makes the "not do the logic if bit is zero" unsafe.
      
      The recursion logic is done in hot paths like the function tracer. Thus,
      any code executed causes noticeable overhead. Thus, tricks are done to try
      to limit the amount of code executed. This included the recursion testing
      logic.
      
      Having recursion testing is important, as there are many paths that can
      end up in an infinite recursion cycle when tracing every function in the
      kernel. Thus protection is needed to prevent that from happening.
      
      Because it is OK to recurse due to different running context levels (e.g.
      an interrupt preempts a trace, and then a trace occurs in the interrupt
      handler), a set of bits are used to know which context one is in (normal,
      softirq, irq and NMI). If a recursion occurs in the same level, it is
      prevented*.
      
      Then there are infrastructure levels of recursion as well. When more than
      one callback is attached to the same function to trace, it calls a loop
      function to iterate over all the callbacks. Both the callbacks and the
      loop function have recursion protection. The callbacks use the
      "ftrace_test_recursion_trylock()" which has a "function" set of context
      bits to test, and the loop function calls the internal
      trace_test_and_set_recursion() directly, with an "internal" set of bits.
      
      If an architecture does not implement all the features supported by ftrace
      then the callbacks are never called directly, and the loop function is
      called instead, which will implement the features of ftrace.
      
      Since both the loop function and the callbacks do recursion protection, it
      was seemed unnecessary to do it in both locations. Thus, a trick was made
      to have the internal set of recursion bits at a more significant bit
      location than the function bits. Then, if any of the higher bits were set,
      the logic of the function bits could be skipped, as any new recursion
      would first have to go through the loop function.
      
      This is true for architectures that do not support all the ftrace
      features, because all functions being traced must first go through the
      loop function before going to the callbacks. But this is not true for
      architectures that support all the ftrace features. That's because the
      loop function could be called due to two callbacks attached to the same
      function, but then a recursion function inside the callback could be
      called that does not share any other callback, and it will be called
      directly.
      
      i.e.
      
       traced_function_1: [ more than one callback tracing it ]
         call loop_func
      
       loop_func:
         trace_recursion set internal bit
         call callback
      
       callback:
         trace_recursion [ skipped because internal bit is set, return 0 ]
         call traced_function_2
      
       traced_function_2: [ only traced by above callback ]
         call callback
      
       callback:
         trace_recursion [ skipped because internal bit is set, return 0 ]
         call traced_function_2
      
       [ wash, rinse, repeat, BOOM! out of shampoo! ]
      
      Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
      call for all functions.
      
      Since we want to encourage architectures to implement all ftrace features,
      having them slow down due to this extra logic may encourage the
      maintainers to update to the latest ftrace features. And because this
      logic is only safe for them, remove it completely.
      
       [*] There is on layer of recursion that is allowed, and that is to allow
           for the transition between interrupt context (normal -> softirq ->
           irq -> NMI), because a trace may occur before the context update is
           visible to the trace recursion logic.
      
      Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
      Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Jisheng Zhang <jszhang@kernel.org>
      Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: edc15caf ("tracing: Avoid unnecessary multiple recursion checks")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      ed65df63
    • Paolo Bonzini's avatar
      KVM: SEV-ES: reduce ghcb_sa_len to 32 bits · 9f1ee7b1
      Paolo Bonzini authored
      The size of the GHCB scratch area is limited to 16 KiB (GHCB_SCRATCH_AREA_LIMIT),
      so there is no need for it to be a u64.  This fixes a build error on 32-bit
      systems:
      
      i686-linux-gnu-ld: arch/x86/kvm/svm/sev.o: in function `sev_es_string_io:
      sev.c:(.text+0x110f): undefined reference to `__udivdi3'
      
      Cc: stable@vger.kernel.org
      Fixes: 019057bd ("KVM: SEV-ES: fix length of string I/O")
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9f1ee7b1
    • Hao Xiang's avatar
      KVM: VMX: Remove redundant handling of bus lock vmexit · d61863c6
      Hao Xiang authored
      Hardware may or may not set exit_reason.bus_lock_detected on BUS_LOCK
      VM-Exits. Dealing with KVM_RUN_X86_BUS_LOCK in handle_bus_lock_vmexit
      could be redundant when exit_reason.basic is EXIT_REASON_BUS_LOCK.
      
      We can remove redundant handling of bus lock vmexit. Unconditionally Set
      exit_reason.bus_lock_detected in handle_bus_lock_vmexit(), and deal with
      KVM_RUN_X86_BUS_LOCK only in vmx_handle_exit().
      Signed-off-by: default avatarHao Xiang <hao.xiang@linux.alibaba.com>
      Message-Id: <1634299161-30101-1-git-send-email-hao.xiang@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d61863c6
    • Christian Borntraeger's avatar
      KVM: kvm_stat: do not show halt_wait_ns · 01c7d267
      Christian Borntraeger authored
      Similar to commit 111d0bda ("tools/kvm_stat: Exempt time-based
      counters"), we should not show timer values in kvm_stat. Remove the new
      halt_wait_ns.
      
      Fixes: 87bcc5fa ("KVM: stats: Add halt_wait_ns stats for all architectures")
      Cc: Jing Zhang <jingzhangos@google.com>
      Cc: Stefan Raspl <raspl@de.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarStefan Raspl <raspl@linux.ibm.com>
      Message-Id: <20211006121724.4154-1-borntraeger@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      01c7d267
    • Sean Christopherson's avatar
      KVM: x86: WARN if APIC HW/SW disable static keys are non-zero on unload · 9139a7a6
      Sean Christopherson authored
      WARN if the static keys used to track if any vCPU has disabled its APIC
      are left elevated at module exit.  Unlike the underflow case, nothing in
      the static key infrastructure will complain if a key is left elevated,
      and because an elevated key only affects performance, nothing in KVM will
      fail if either key is improperly incremented.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211013003554.47705-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9139a7a6
    • Sean Christopherson's avatar
      Revert "KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET" · f7d8a19f
      Sean Christopherson authored
      Revert a change to open code bits of kvm_lapic_set_base() when emulating
      APIC RESET to fix an apic_hw_disabled underflow bug due to arch.apic_base
      and apic_hw_disabled being unsyncrhonized when the APIC is created.  If
      kvm_arch_vcpu_create() fails after creating the APIC, kvm_free_lapic()
      will see the initialized-to-zero vcpu->arch.apic_base and decrement
      apic_hw_disabled without KVM ever having incremented apic_hw_disabled.
      
      Using kvm_lapic_set_base() in kvm_lapic_reset() is also desirable for a
      potential future where KVM supports RESET outside of vCPU creation, in
      which case all the side effects of kvm_lapic_set_base() are needed, e.g.
      to handle the transition from x2APIC => xAPIC.
      
      Alternatively, KVM could temporarily increment apic_hw_disabled (and call
      kvm_lapic_set_base() at RESET), but that's a waste of cycles and would
      impact the performance of other vCPUs and VMs.  The other subtle side
      effect is that updating the xAPIC ID needs to be done at RESET regardless
      of whether the APIC was previously enabled, i.e. kvm_lapic_reset() needs
      an explicit call to kvm_apic_set_xapic_id() regardless of whether or not
      kvm_lapic_set_base() also performs the update.  That makes stuffing the
      enable bit at vCPU creation slightly more palatable, as doing so affects
      only the apic_hw_disabled key.
      
      Opportunistically tweak the comment to explicitly call out the connection
      between vcpu->arch.apic_base and apic_hw_disabled, and add a comment to
      call out the need to always do kvm_apic_set_xapic_id() at RESET.
      
      Underflow scenario:
      
        kvm_vm_ioctl() {
          kvm_vm_ioctl_create_vcpu() {
            kvm_arch_vcpu_create() {
              if (something_went_wrong)
                goto fail_free_lapic;
              /* vcpu->arch.apic_base is initialized when something_went_wrong is false. */
              kvm_vcpu_reset() {
                kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) {
                  vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
                }
              }
              return 0;
            fail_free_lapic:
              kvm_free_lapic() {
                /* vcpu->arch.apic_base is not yet initialized when something_went_wrong is true. */
                if (!(vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE))
                  static_branch_slow_dec_deferred(&apic_hw_disabled); // <= underflow bug.
              }
              return r;
            }
          }
        }
      
      This (mostly) reverts commit 42122123.
      
      Fixes: 42122123 ("KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET")
      Reported-by: syzbot+9fc046ab2b0cf295a063@syzkaller.appspotmail.com
      Debugged-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211013003554.47705-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7d8a19f