1. 08 Apr, 2023 4 commits
    • Linus Torvalds's avatar
      Merge tag 'io_uring-6.3-2023-04-06' of git://git.kernel.dk/linux · d3f05a4c
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "Just two minor fixes for provided buffers - one where we could
        potentially leak a buffer, and one where the returned values was
        off-by-one in some cases"
      
      * tag 'io_uring-6.3-2023-04-06' of git://git.kernel.dk/linux:
        io_uring: fix memory leak when removing provided buffers
        io_uring: fix return value when removing provided buffers
      d3f05a4c
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-6.3-2023-04-08' of git://git.infradead.org/users/hch/dma-mapping · 973ad544
      Linus Torvalds authored
      Pull dma-mapping fix from Christoph Hellwig:
      
       - fix a braino in the swiotlb alignment check fix (Petr Tesarik)
      
      * tag 'dma-mapping-6.3-2023-04-08' of git://git.infradead.org/users/hch/dma-mapping:
        swiotlb: fix a braino in the alignment check fix
      973ad544
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.3-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 1a8a804a
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
       "A couple more minor fixes:
      
         - Reset direct->addr back to its original value on error in updating
           the direct trampoline code
      
         - Make lastcmd_mutex static"
      
      * tag 'trace-v6.3-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracing/synthetic: Make lastcmd_mutex static
        ftrace: Fix issue that 'direct->addr' not restored in modify_ftrace_direct()
      1a8a804a
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2023-04-07-16-23' of... · 6fda0bb8
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2023-04-07-16-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull MM fixes from Andrew Morton:
       "28 hotfixes.
      
        23 are cc:stable and the other five address issues which were
        introduced during this merge cycle.
      
        20 are for MM and the remainder are for other subsystems"
      
      * tag 'mm-hotfixes-stable-2023-04-07-16-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits)
        maple_tree: fix a potential concurrency bug in RCU mode
        maple_tree: fix get wrong data_end in mtree_lookup_walk()
        mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()
        nilfs2: fix sysfs interface lifetime
        mm: take a page reference when removing device exclusive entries
        mm: vmalloc: avoid warn_alloc noise caused by fatal signal
        nilfs2: initialize "struct nilfs_binfo_dat"->bi_pad field
        nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread()
        zsmalloc: document freeable stats
        zsmalloc: document new fullness grouping
        fsdax: force clear dirty mark if CoW
        mm/hugetlb: fix uffd wr-protection for CoW optimization path
        mm: enable maple tree RCU mode by default
        maple_tree: add RCU lock checking to rcu callback functions
        maple_tree: add smp_rmb() to dead node detection
        maple_tree: fix write memory barrier of nodes once dead for RCU mode
        maple_tree: remove extra smp_wmb() from mas_dead_leaves()
        maple_tree: fix freeing of nodes in rcu mode
        maple_tree: detect dead nodes in mas_start()
        maple_tree: be more cautious about dead nodes
        ...
      6fda0bb8
  2. 07 Apr, 2023 4 commits
  3. 06 Apr, 2023 32 commits
    • Steven Rostedt (Google)'s avatar
      tracing/synthetic: Make lastcmd_mutex static · 31c68396
      Steven Rostedt (Google) authored
      The lastcmd_mutex is only used in trace_events_synth.c and should be
      static.
      
      Link: https://lore.kernel.org/linux-trace-kernel/202304062033.cRStgOuP-lkp@intel.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20230406111033.6e26de93@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
      Fixes: 4ccf11c4 ("tracing/synthetic: Fix races on freeing last_cmd")
      Reviewed-by: default avatarMukesh Ojha <quic_mojha@quicinc.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      31c68396
    • Linus Torvalds's avatar
      Merge tag 'net-6.3-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f2afccfe
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from wireless and can.
      
        Current release - regressions:
      
         - wifi: mac80211:
            - fix potential null pointer dereference
            - fix receiving mesh packets in forwarding=0 networks
            - fix mesh forwarding
      
        Current release - new code bugs:
      
         - virtio/vsock: fix leaks due to missing skb owner
      
        Previous releases - regressions:
      
         - raw: fix NULL deref in raw_get_next().
      
         - sctp: check send stream number after wait_for_sndbuf
      
         - qrtr:
            - fix a refcount bug in qrtr_recvmsg()
            - do not do DEL_SERVER broadcast after DEL_CLIENT
      
         - wifi: brcmfmac: fix SDIO suspend/resume regression
      
         - wifi: mt76: fix use-after-free in fw features query.
      
         - can: fix race between isotp_sendsmg() and isotp_release()
      
         - eth: mtk_eth_soc: fix remaining throughput regression
      
         - eth: ice: reset FDIR counter in FDIR init stage
      
        Previous releases - always broken:
      
         - core: don't let netpoll invoke NAPI if in xmit context
      
         - icmp: guard against too small mtu
      
         - ipv6: fix an uninit variable access bug in __ip6_make_skb()
      
         - wifi: mac80211: fix the size calculation of
           ieee80211_ie_len_eht_cap()
      
         - can: fix poll() to not report false EPOLLOUT events
      
         - eth: gve: secure enough bytes in the first TX desc for all TCP
           pkts"
      
      * tag 'net-6.3-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
        net: stmmac: check fwnode for phy device before scanning for phy
        net: stmmac: Add queue reset into stmmac_xdp_open() function
        selftests: net: rps_default_mask.sh: delete veth link specifically
        net: fec: make use of MDIO C45 quirk
        can: isotp: fix race between isotp_sendsmg() and isotp_release()
        can: isotp: isotp_ops: fix poll() to not report false EPOLLOUT events
        can: isotp: isotp_recvmsg(): use sock_recv_cmsgs() to get SOCK_RXQ_OVFL infos
        can: j1939: j1939_tp_tx_dat_new(): fix out-of-bounds memory access
        gve: Secure enough bytes in the first TX desc for all TCP pkts
        netlink: annotate lockless accesses to nlk->max_recvmsg_len
        ethtool: reset #lanes when lanes is omitted
        ping: Fix potentail NULL deref for /proc/net/icmp.
        raw: Fix NULL deref in raw_get_next().
        ice: Reset FDIR counter in FDIR init stage
        ice: fix wrong fallback logic for FDIR
        net: stmmac: fix up RX flow hash indirection table when setting channels
        net: ethernet: ti: am65-cpsw: Fix mdio cleanup in probe
        wifi: mt76: ignore key disable commands
        wifi: ath11k: reduce the MHI timeout to 20s
        ipv6: Fix an uninit variable access bug in __ip6_make_skb()
        ...
      f2afccfe
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-6.3-rc6' of... · 8f2e1a85
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
       "One single fix to mount_setattr_test build failure"
      
      * tag 'linux-kselftest-fixes-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests mount: Fix mount_setattr_test builds failed
      8f2e1a85
    • Linus Torvalds's avatar
      Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd · 105b64c8
      Linus Torvalds authored
      Pull iommufd fixes from Jason Gunthorpe:
      
       - An invalid VA range can be be put in a pages and eventually trigger
         WARN_ON, reject it early
      
       - Use of the wrong start index value when doing the complex batch carry
         scheme
      
       - Wrong store ordering resulting in corrupting data used in a later
         calculation that corrupted the batch structure during carry
      
      * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
        iommufd: Do not corrupt the pfn list when doing batch carry
        iommufd: Fix unpinning of pages when an access is present
        iommufd: Check for uptr overflow
      105b64c8
    • Linus Torvalds's avatar
      Merge tag 'pwm/for-6.3-rc6' of... · ae52f797
      Linus Torvalds authored
      Merge tag 'pwm/for-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
      
      Pull pwm fixes from Thierry Reding:
       "These are some fixes to make sure the PWM state structure is always
        initialized to a known state.
      
        Prior to this it could happen in some situations that random data from
        the stack would leak into the data structure and cause subtle bugs"
      
      * tag 'pwm/for-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
        pwm: Zero-initialize the pwm_state passed to driver's .get_state()
        pwm: meson: Explicitly set .polarity in .get_state()
        pwm: sprd: Explicitly set .polarity in .get_state()
        pwm: iqs620a: Explicitly set .polarity in .get_state()
        pwm: cros-ec: Explicitly set .polarity in .get_state()
        pwm: hibvt: Explicitly set .polarity in .get_state()
      ae52f797
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2023-04-06' of git://anongit.freedesktop.org/drm/drm · ac6c0433
      Linus Torvalds authored
      Pull drm fixes from Daniel Vetter:
       "Mostly i915 fixes: dp mst for compression/dsc, perf ioctl uaf, ctx rpm
        accounting, gt reset vs huc loading.
      
        And a few individual driver fixes: ivpu dma fence&suspend, panfrost
        mmap, nouveau color depth"
      
      * tag 'drm-fixes-2023-04-06' of git://anongit.freedesktop.org/drm/drm:
        accel/ivpu: Fix S3 system suspend when not idle
        accel/ivpu: Add dma fence to command buffers only
        drm/i915: Fix context runtime accounting
        drm/i915: fix race condition UAF in i915_perf_add_config_ioctl
        drm/i915: Use compressed bpp when calculating m/n value for DP MST DSC
        drm/i915/huc: Cancel HuC delayed load timer on reset.
        drm/i915/ttm: fix sparse warning
        drm/panfrost: Fix the panfrost_mmu_map_fault_addr() error path
        drm/nouveau/disp: Support more modes by checking with lower bpc
      ac6c0433
    • Linus Torvalds's avatar
      Merge tag 'sound-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 2a28a8b3
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "The majority of changes here are various fixes for Intel drivers,
        and there is a change in ASoC PCM core for the format constraints.
      
        In addition, a workaround for HD-audio HDMI regressions and usual
        HD-audio quirks are found"
      
      * tag 'sound-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda/hdmi: Preserve the previous PCM device upon re-enablement
        ALSA: hda/realtek: Add quirk for Clevo X370SNW
        ALSA: hda/realtek: fix mute/micmute LEDs for a HP ProBook
        ASoC: SOF: avoid a NULL dereference with unsupported widgets
        ASoC: da7213.c: add missing pm_runtime_disable()
        ASoC: hdac_hdmi: use set_stream() instead of set_tdm_slots()
        ASoC: codecs: lpass: fix the order or clks turn off during suspend
        ASoC: Intel: bytcr_rt5640: Add quirk for the Acer Iconia One 7 B1-750
        ASoC: SOF: ipc4: Ensure DSP is in D0I0 during sof_ipc4_set_get_data()
        ASoC: amd: yc: Add DMI entries to support Victus by HP Laptop 16-e1xxx (8A22)
        ASoC: soc-pcm: fix hw->formats cleared by soc_pcm_hw_init() for dpcm
        ASoC: Intel: soc-acpi: add table for Intel 'Rooks County' NUC M15
        ASOC: Intel: sof_sdw: add quirk for Intel 'Rooks County' NUC M15
      2a28a8b3
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v6.3-5' of... · 8dfab523
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v6.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
      
      Pull x86 platform driver fixes from Hans de Goede:
      
       -  more think-lmi fixes
      
       -  one DMI quirk addition
      
      * tag 'platform-drivers-x86-v6.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
        platform/x86: thinkpad_acpi: Add missing T14s Gen1 type to s2idle quirk list
        platform/x86: think-lmi: Clean up display of current_value on Thinkstation
        platform/x86: think-lmi: Fix memory leaks when parsing ThinkStation WMI strings
        platform/x86: think-lmi: Fix memory leak when showing current settings
      8dfab523
    • Linus Torvalds's avatar
      Merge tag 'asm-generic-fixes-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic · fcff5f99
      Linus Torvalds authored
      Pull asm-generic fixes from Arnd Bergmann:
       "These are minor fixes to address false-positive build warnings:
      
        Some of the less common I/O accessors are missing __force casts and
        cause sparse warnings for their implied byteswap, and a recent change
        to __generic_cmpxchg_local() causes a warning about constant integer
        truncation"
      
      * tag 'asm-generic-fixes-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
        asm-generic: avoid __generic_cmpxchg_local warnings
        asm-generic/io.h: suppress endianness warnings for relaxed accessors
        asm-generic/io.h: suppress endianness warnings for readq() and writeq()
      fcff5f99
    • Michael Sit Wei Hong's avatar
      net: stmmac: check fwnode for phy device before scanning for phy · 8fbc10b9
      Michael Sit Wei Hong authored
      Some DT devices already have phy device configured in the DT/ACPI.
      Current implementation scans for a phy unconditionally even though
      there is a phy listed in the DT/ACPI and already attached.
      
      We should check the fwnode if there is any phy device listed in
      fwnode and decide whether to scan for a phy to attach to.
      
      Fixes: fe2cfbc9 ("net: stmmac: check if MAC needs to attach to a PHY")
      Reported-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Link: https://lore.kernel.org/lkml/20230403212434.296975-1-martin.blumenstingl@googlemail.com/Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Tested-by: default avatarShahab Vahedi <shahab@synopsys.com>
      Tested-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Tested-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Suggested-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarMichael Sit Wei Hong <michael.wei.hong.sit@intel.com>
      Link: https://lore.kernel.org/r/20230406024541.3556305-1-michael.wei.hong.sit@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8fbc10b9
    • Zheng Yejian's avatar
      ftrace: Fix issue that 'direct->addr' not restored in modify_ftrace_direct() · 2a2d8c51
      Zheng Yejian authored
      Syzkaller report a WARNING: "WARN_ON(!direct)" in modify_ftrace_direct().
      
      Root cause is 'direct->addr' was changed from 'old_addr' to 'new_addr' but
      not restored if error happened on calling ftrace_modify_direct_caller().
      Then it can no longer find 'direct' by that 'old_addr'.
      
      To fix it, restore 'direct->addr' to 'old_addr' explicitly in error path.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230330025223.1046087-1-zhengyejian1@huawei.com
      
      Cc: stable@vger.kernel.org
      Cc: <mhiramat@kernel.org>
      Cc: <mark.rutland@arm.com>
      Cc: <ast@kernel.org>
      Cc: <daniel@iogearbox.net>
      Fixes: 8a141dd7 ("ftrace: Fix modify_ftrace_direct.")
      Signed-off-by: default avatarZheng Yejian <zhengyejian1@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2a2d8c51
    • Petr Tesarik's avatar
      swiotlb: fix a braino in the alignment check fix · bbb73a10
      Petr Tesarik authored
      The alignment mask in swiotlb_do_find_slots() masks off the high
      bits which are not relevant for the alignment, so multiple
      requirements are combined with a bitwise OR rather than AND.
      In plain English, the stricter the alignment, the more bits must
      be set in iotlb_align_mask.
      
      Confusion may arise from the fact that the same variable is also
      used to mask off the offset within a swiotlb slot, which is
      achieved with a bitwise AND.
      
      Fixes: 0eee5ae1 ("swiotlb: fix slot alignment checks")
      Reported-by: default avatarDexuan Cui <decui@microsoft.com>
      Link: https://lore.kernel.org/all/CAA42JLa1y9jJ7BgQvXeUYQh-K2mDNHd2BYZ4iZUz33r5zY7oAQ@mail.gmail.com/Reported-by: default avatarKelsey Steele <kelseysteele@linux.microsoft.com>
      Link: https://lore.kernel.org/all/20230405003549.GA21326@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/Signed-off-by: default avatarPetr Tesarik <petr.tesarik.ext@huawei.com>
      Tested-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      bbb73a10
    • Song Yoong Siang's avatar
      net: stmmac: Add queue reset into stmmac_xdp_open() function · 24e3fce0
      Song Yoong Siang authored
      Queue reset was moved out from __init_dma_rx_desc_rings() and
      __init_dma_tx_desc_rings() functions. Thus, the driver fails to transmit
      and receive packet after XDP prog setup.
      
      This commit adds the missing queue reset into stmmac_xdp_open() function.
      
      Fixes: f9ec5723 ("net: ethernet: stmicro: stmmac: move queue reset to dedicated functions")
      Cc: <stable@vger.kernel.org> # 6.0+
      Signed-off-by: default avatarSong Yoong Siang <yoong.siang.song@intel.com>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Link: https://lore.kernel.org/r/20230404044823.3226144-1-yoong.siang.song@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      24e3fce0
    • Hangbin Liu's avatar
      selftests: net: rps_default_mask.sh: delete veth link specifically · 38e058cc
      Hangbin Liu authored
      When deleting the netns and recreating a new one while re-adding the
      veth interface, there is a small window of time during which the old
      veth interface has not yet been removed. This can cause the new addition
      to fail. To resolve this issue, we can either wait for a short while to
      ensure that the old veth interface is deleted, or we can specifically
      remove the veth interface.
      
      Before this patch:
        # ./rps_default_mask.sh
        empty rps_default_mask                                      [ ok ]
        changing rps_default_mask dont affect existing devices      [ ok ]
        changing rps_default_mask dont affect existing netns        [ ok ]
        changing rps_default_mask affect newly created devices      [ ok ]
        changing rps_default_mask don't affect newly child netns[II][ ok ]
        rps_default_mask is 0 by default in child netns             [ ok ]
        RTNETLINK answers: File exists
        changing rps_default_mask in child ns don't affect the main one[ ok ]
        cat: /sys/class/net/vethC11an1/queues/rx-0/rps_cpus: No such file or directory
        changing rps_default_mask in child ns affects new childns devices./rps_default_mask.sh: line 36: [: -eq: unary operator expected
        [fail] expected 1 found
        changing rps_default_mask in child ns don't affect existing devices[ ok ]
      
      After this patch:
        # ./rps_default_mask.sh
        empty rps_default_mask                                      [ ok ]
        changing rps_default_mask dont affect existing devices      [ ok ]
        changing rps_default_mask dont affect existing netns        [ ok ]
        changing rps_default_mask affect newly created devices      [ ok ]
        changing rps_default_mask don't affect newly child netns[II][ ok ]
        rps_default_mask is 0 by default in child netns             [ ok ]
        changing rps_default_mask in child ns don't affect the main one[ ok ]
        changing rps_default_mask in child ns affects new childns devices[ ok ]
        changing rps_default_mask in child ns don't affect existing devices[ ok ]
      
      Fixes: 3a7d84ea ("self-tests: more rps self tests")
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20230404072411.879476-1-liuhangbin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      38e058cc
    • Greg Ungerer's avatar
      net: fec: make use of MDIO C45 quirk · abc33494
      Greg Ungerer authored
      Not all fec MDIO bus drivers support C45 mode transactions. The older fec
      hardware block in many ColdFire SoCs does not appear to support them, at
      least according to most of the different ColdFire SoC reference manuals.
      The bits used to generate C45 access on the iMX parts, in the OP field
      of the MMFR register, are documented as generating non-compliant MII
      frames (it is not documented as to exactly how they are non-compliant).
      
      Commit 8d03ad1a ("net: fec: Separate C22 and C45 transactions")
      means the fec driver will always register c45 MDIO read and write
      methods. During probe these will always be accessed now generating
      non-compliant MII accesses on ColdFire based devices.
      
      Add a quirk define, FEC_QUIRK_HAS_MDIO_C45, that can be used to
      distinguish silicon that supports MDIO C45 framing or not. Add this to
      all the existing iMX quirks, so they will be behave as they do now (*).
      
      (*) it seems that some iMX parts may not support C45 transactions either.
          The iMX25 and iMX50 Reference Manuals contain similar wording to
          the ColdFire Reference Manuals on this.
      
      Fixes: 8d03ad1a ("net: fec: Separate C22 and C45 transactions")
      Signed-off-by: default avatarGreg Ungerer <gerg@linux-m68k.org>
      Reviewed-by: default avatarWei Fang <wei.fang@nxp.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20230404052207.3064861-1-gerg@linux-m68k.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      abc33494
    • Peng Zhang's avatar
      maple_tree: fix a potential concurrency bug in RCU mode · c45ea315
      Peng Zhang authored
      There is a concurrency bug that may cause the wrong value to be loaded
      when a CPU is modifying the maple tree.
      
      CPU1:
      mtree_insert_range()
        mas_insert()
          mas_store_root()
            ...
            mas_root_expand()
              ...
              rcu_assign_pointer(mas->tree->ma_root, mte_mk_root(mas->node));
              ma_set_meta(node, maple_leaf_64, 0, slot);    <---IP
      
      CPU2:
      mtree_load()
        mtree_lookup_walk()
          ma_data_end();
      
      When CPU1 is about to execute the instruction pointed to by IP, the
      ma_data_end() executed by CPU2 may return the wrong end position, which
      will cause the value loaded by mtree_load() to be wrong.
      
      An example of triggering the bug:
      
      Add mdelay(100) between rcu_assign_pointer() and ma_set_meta() in
      mas_root_expand().
      
      static DEFINE_MTREE(tree);
      int work(void *p) {
      	unsigned long val;
      	for (int i = 0 ; i< 30; ++i) {
      		val = (unsigned long)mtree_load(&tree, 8);
      		mdelay(5);
      		pr_info("%lu",val);
      	}
      	return 0;
      }
      
      mt_init_flags(&tree, MT_FLAGS_USE_RCU);
      mtree_insert(&tree, 0, (void*)12345, GFP_KERNEL);
      run_thread(work)
      mtree_insert(&tree, 1, (void*)56789, GFP_KERNEL);
      
      In RCU mode, mtree_load() should always return the value before or after
      the data structure is modified, and in this example mtree_load(&tree, 8)
      may return 56789 which is not expected, it should always return NULL.  Fix
      it by put ma_set_meta() before rcu_assign_pointer().
      
      Link: https://lkml.kernel.org/r/20230314124203.91572-4-zhangpeng.00@bytedance.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c45ea315
    • Peng Zhang's avatar
      maple_tree: fix get wrong data_end in mtree_lookup_walk() · ec07967d
      Peng Zhang authored
      if (likely(offset > end))
      	max = pivots[offset];
      
      The above code should be changed to if (likely(offset < end)), which is
      correct.  This affects the correctness of ma_data_end().  Now it seems
      that the final result will not be wrong, but it is best to change it. 
      This patch does not change the code as above, because it simplifies the
      code by the way.
      
      Link: https://lkml.kernel.org/r/20230314124203.91572-1-zhangpeng.00@bytedance.com
      Link: https://lkml.kernel.org/r/20230314124203.91572-2-zhangpeng.00@bytedance.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec07967d
    • Rongwei Wang's avatar
      mm/swap: fix swap_info_struct race between swapoff and get_swap_pages() · 6fe7d6b9
      Rongwei Wang authored
      The si->lock must be held when deleting the si from the available list. 
      Otherwise, another thread can re-add the si to the available list, which
      can lead to memory corruption.  The only place we have found where this
      happens is in the swapoff path.  This case can be described as below:
      
      core 0                       core 1
      swapoff
      
      del_from_avail_list(si)      waiting
      
      try lock si->lock            acquire swap_avail_lock
                                   and re-add si into
                                   swap_avail_head
      
      acquire si->lock but missing si already being added again, and continuing
      to clear SWP_WRITEOK, etc.
      
      It can be easily found that a massive warning messages can be triggered
      inside get_swap_pages() by some special cases, for example, we call
      madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile,
      run much swapon-swapoff operations (e.g.  stress-ng-swap).
      
      However, in the worst case, panic can be caused by the above scene.  In
      swapoff(), the memory used by si could be kept in swap_info[] after
      turning off a swap.  This means memory corruption will not be caused
      immediately until allocated and reset for a new swap in the swapon path. 
      A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
      
      ------------[ cut here ]------------
      top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
      prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
      next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
      WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
      Modules linked in: rfkill(E) crct10dif_ce(E)...
      CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
      Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
      pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
      pc : plist_check_prev_next_node+0x50/0x70
      lr : plist_check_prev_next_node+0x50/0x70
      sp : ffff0018009d3c30
      x29: ffff0018009d3c40 x28: ffff800011b32a98
      x27: 0000000000000000 x26: ffff001803908000
      x25: ffff8000128ea088 x24: ffff800011b32a48
      x23: 0000000000000028 x22: ffff001800875c00
      x21: ffff800010f9e520 x20: ffff001800875c00
      x19: ffff001800fdc6e0 x18: 0000000000000030
      x17: 0000000000000000 x16: 0000000000000000
      x15: 0736076307640766 x14: 0730073007380731
      x13: 0736076307640766 x12: 0730073007380731
      x11: 000000000004058d x10: 0000000085a85b76
      x9 : ffff8000101436e4 x8 : ffff800011c8ce08
      x7 : 0000000000000000 x6 : 0000000000000001
      x5 : ffff0017df9ed338 x4 : 0000000000000001
      x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
      x1 : 0000000000000000 x0 : 0000000000000000
      Call trace:
       plist_check_prev_next_node+0x50/0x70
       plist_check_head+0x80/0xf0
       plist_add+0x28/0x140
       add_to_avail_list+0x9c/0xf0
       _enable_swap_info+0x78/0xb4
       __do_sys_swapon+0x918/0xa10
       __arm64_sys_swapon+0x20/0x30
       el0_svc_common+0x8c/0x220
       do_el0_svc+0x2c/0x90
       el0_svc+0x1c/0x30
       el0_sync_handler+0xa8/0xb0
       el0_sync+0x148/0x180
      irq event stamp: 2082270
      
      Now, si->lock locked before calling 'del_from_avail_list()' to make sure
      other thread see the si had been deleted and SWP_WRITEOK cleared together,
      will not reinsert again.
      
      This problem exists in versions after stable 5.10.y.
      
      Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba.com
      Fixes: a2468cc9 ("swap: choose swap device according to numa node") 
      Tested-by: default avatarYongchen Yin <wb-yyc939293@alibaba-inc.com>
      Signed-off-by: default avatarRongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6fe7d6b9
    • Ryusuke Konishi's avatar
      nilfs2: fix sysfs interface lifetime · 42560f9c
      Ryusuke Konishi authored
      The current nilfs2 sysfs support has issues with the timing of creation
      and deletion of sysfs entries, potentially leading to null pointer
      dereferences, use-after-free, and lockdep warnings.
      
      Some of the sysfs attributes for nilfs2 per-filesystem instance refer to
      metadata file "cpfile", "sufile", or "dat", but
      nilfs_sysfs_create_device_group that creates those attributes is executed
      before the inodes for these metadata files are loaded, and
      nilfs_sysfs_delete_device_group which deletes these sysfs entries is
      called after releasing their metadata file inodes.
      
      Therefore, access to some of these sysfs attributes may occur outside of
      the lifetime of these metadata files, resulting in inode NULL pointer
      dereferences or use-after-free.
      
      In addition, the call to nilfs_sysfs_create_device_group() is made during
      the locking period of the semaphore "ns_sem" of nilfs object, so the
      shrinker call caused by the memory allocation for the sysfs entries, may
      derive lock dependencies "ns_sem" -> (shrinker) -> "locks acquired in
      nilfs_evict_inode()".
      
      Since nilfs2 may acquire "ns_sem" deep in the call stack holding other
      locks via its error handler __nilfs_error(), this causes lockdep to report
      circular locking.  This is a false positive and no circular locking
      actually occurs as no inodes exist yet when
      nilfs_sysfs_create_device_group() is called.  Fortunately, the lockdep
      warnings can be resolved by simply moving the call to
      nilfs_sysfs_create_device_group() out of "ns_sem".
      
      This fixes these sysfs issues by revising where the device's sysfs
      interface is created/deleted and keeping its lifetime within the lifetime
      of the metadata files above.
      
      Link: https://lkml.kernel.org/r/20230330205515.6167-1-konishi.ryusuke@gmail.com
      Fixes: dd70edbd ("nilfs2: integrate sysfs support into driver")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: syzbot+979fa7f9c0d086fdc282@syzkaller.appspotmail.com
        Link: https://lkml.kernel.org/r/0000000000003414b505f7885f7e@google.com
      Reported-by: syzbot+5b7d542076d9bddc3c6a@syzkaller.appspotmail.com
        Link: https://lkml.kernel.org/r/0000000000006ac86605f5f44eb9@google.com
      Cc: Viacheslav Dubeyko <slava@dubeyko.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      42560f9c
    • Alistair Popple's avatar
      mm: take a page reference when removing device exclusive entries · 7c7b9629
      Alistair Popple authored
      Device exclusive page table entries are used to prevent CPU access to a
      page whilst it is being accessed from a device.  Typically this is used to
      implement atomic operations when the underlying bus does not support
      atomic access.  When a CPU thread encounters a device exclusive entry it
      locks the page and restores the original entry after calling mmu notifiers
      to signal drivers that exclusive access is no longer available.
      
      The device exclusive entry holds a reference to the page making it safe to
      access the struct page whilst the entry is present.  However the fault
      handling code does not hold the PTL when taking the page lock.  This means
      if there are multiple threads faulting concurrently on the device
      exclusive entry one will remove the entry whilst others will wait on the
      page lock without holding a reference.
      
      This can lead to threads locking or waiting on a folio with a zero
      refcount.  Whilst mmap_lock prevents the pages getting freed via munmap()
      they may still be freed by a migration.  This leads to warnings such as
      PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount
      drops to zero.
      
      Fix this by trying to take a reference on the folio before locking it. 
      The code already checks the PTE under the PTL and aborts if the entry is
      no longer there.  It is also possible the folio has been unmapped, freed
      and re-allocated allowing a reference to be taken on an unrelated folio. 
      This case is also detected by the PTE check and the folio is unlocked
      without further changes.
      
      Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com
      Fixes: b756a3b5 ("mm: device exclusive memory access")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7c7b9629
    • Yafang Shao's avatar
      mm: vmalloc: avoid warn_alloc noise caused by fatal signal · f349b15e
      Yafang Shao authored
      There're some suspicious warn_alloc on my test serer, for example,
      
      [13366.518837] warn_alloc: 81 callbacks suppressed
      [13366.518841] test_verifier: vmalloc error: size 4096, page order 0, failed to allocate pages, mode:0x500dc2(GFP_HIGHUSER|__GFP_ZERO|__GFP_ACCOUNT), nodemask=(null),cpuset=/,mems_allowed=0-1
      [13366.522240] CPU: 30 PID: 722463 Comm: test_verifier Kdump: loaded Tainted: G        W  O       6.2.0+ #638
      [13366.524216] Call Trace:
      [13366.524702]  <TASK>
      [13366.525148]  dump_stack_lvl+0x6c/0x80
      [13366.525712]  dump_stack+0x10/0x20
      [13366.526239]  warn_alloc+0x119/0x190
      [13366.526783]  ? alloc_pages_bulk_array_mempolicy+0x9e/0x2a0
      [13366.527470]  __vmalloc_area_node+0x546/0x5b0
      [13366.528066]  __vmalloc_node_range+0xc2/0x210
      [13366.528660]  __vmalloc_node+0x42/0x50
      [13366.529186]  ? bpf_prog_realloc+0x53/0xc0
      [13366.529743]  __vmalloc+0x1e/0x30
      [13366.530235]  bpf_prog_realloc+0x53/0xc0
      [13366.530771]  bpf_patch_insn_single+0x80/0x1b0
      [13366.531351]  bpf_jit_blind_constants+0xe9/0x1c0
      [13366.531932]  ? __free_pages+0xee/0x100
      [13366.532457]  ? free_large_kmalloc+0x58/0xb0
      [13366.533002]  bpf_int_jit_compile+0x8c/0x5e0
      [13366.533546]  bpf_prog_select_runtime+0xb4/0x100
      [13366.534108]  bpf_prog_load+0x6b1/0xa50
      [13366.534610]  ? perf_event_task_tick+0x96/0xb0
      [13366.535151]  ? security_capable+0x3a/0x60
      [13366.535663]  __sys_bpf+0xb38/0x2190
      [13366.536120]  ? kvm_clock_get_cycles+0x9/0x10
      [13366.536643]  __x64_sys_bpf+0x1c/0x30
      [13366.537094]  do_syscall_64+0x38/0x90
      [13366.537554]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
      [13366.538107] RIP: 0033:0x7f78310f8e29
      [13366.538561] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89 01 48
      [13366.540286] RSP: 002b:00007ffe2a61fff8 EFLAGS: 00000206 ORIG_RAX: 0000000000000141
      [13366.541031] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f78310f8e29
      [13366.541749] RDX: 0000000000000080 RSI: 00007ffe2a6200b0 RDI: 0000000000000005
      [13366.542470] RBP: 00007ffe2a620010 R08: 00007ffe2a6202a0 R09: 00007ffe2a6200b0
      [13366.543183] R10: 00000000000f423e R11: 0000000000000206 R12: 0000000000407800
      [13366.543900] R13: 00007ffe2a620540 R14: 0000000000000000 R15: 0000000000000000
      [13366.544623]  </TASK>
      [13366.545260] Mem-Info:
      [13366.546121] active_anon:81319 inactive_anon:20733 isolated_anon:0
       active_file:69450 inactive_file:5624 isolated_file:0
       unevictable:0 dirty:10 writeback:0
       slab_reclaimable:69649 slab_unreclaimable:48930
       mapped:27400 shmem:12868 pagetables:4929
       sec_pagetables:0 bounce:0
       kernel_misc_reclaimable:0
       free:15870308 free_pcp:142935 free_cma:0
      [13366.551886] Node 0 active_anon:224836kB inactive_anon:33528kB active_file:175692kB inactive_file:13752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:59248kB dirty:32kB writeback:0kB shmem:18252kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:4616kB pagetables:10664kB sec_pagetables:0kB all_unreclaimable? no
      [13366.555184] Node 1 active_anon:100440kB inactive_anon:49404kB active_file:102108kB inactive_file:8744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:50352kB dirty:8kB writeback:0kB shmem:33220kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3896kB pagetables:9052kB sec_pagetables:0kB all_unreclaimable? no
      [13366.558262] Node 0 DMA free:15360kB boost:0kB min:304kB low:380kB high:456kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      [13366.560821] lowmem_reserve[]: 0 2735 31873 31873 31873
      [13366.561981] Node 0 DMA32 free:2790904kB boost:0kB min:56028kB low:70032kB high:84036kB reserved_highatomic:0KB active_anon:1936kB inactive_anon:20kB active_file:396kB inactive_file:344kB unevictable:0kB writepending:0kB present:3129200kB managed:2801520kB mlocked:0kB bounce:0kB free_pcp:5188kB local_pcp:0kB free_cma:0kB
      [13366.565148] lowmem_reserve[]: 0 0 29137 29137 29137
      [13366.566168] Node 0 Normal free:28533824kB boost:0kB min:596740kB low:745924kB high:895108kB reserved_highatomic:28672KB active_anon:222900kB inactive_anon:33508kB active_file:175296kB inactive_file:13408kB unevictable:0kB writepending:32kB present:30408704kB managed:29837172kB mlocked:0kB bounce:0kB free_pcp:295724kB local_pcp:0kB free_cma:0kB
      [13366.569485] lowmem_reserve[]: 0 0 0 0 0
      [13366.570416] Node 1 Normal free:32141144kB boost:0kB min:660504kB low:825628kB high:990752kB reserved_highatomic:69632KB active_anon:100440kB inactive_anon:49404kB active_file:102108kB inactive_file:8744kB unevictable:0kB writepending:8kB present:33554432kB managed:33025372kB mlocked:0kB bounce:0kB free_pcp:270880kB local_pcp:46860kB free_cma:0kB
      [13366.573403] lowmem_reserve[]: 0 0 0 0 0
      [13366.574015] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
      [13366.575474] Node 0 DMA32: 782*4kB (UME) 756*8kB (UME) 736*16kB (UME) 745*32kB (UME) 694*64kB (UME) 653*128kB (UME) 595*256kB (UME) 552*512kB (UME) 454*1024kB (UME) 347*2048kB (UME) 246*4096kB (UME) = 2790904kB
      [13366.577442] Node 0 Normal: 33856*4kB (UMEH) 51815*8kB (UMEH) 42418*16kB (UMEH) 36272*32kB (UMEH) 22195*64kB (UMEH) 10296*128kB (UMEH) 7238*256kB (UMEH) 5638*512kB (UEH) 5337*1024kB (UMEH) 3506*2048kB (UMEH) 1470*4096kB (UME) = 28533784kB
      [13366.580460] Node 1 Normal: 15776*4kB (UMEH) 37485*8kB (UMEH) 29509*16kB (UMEH) 21420*32kB (UMEH) 14818*64kB (UMEH) 13051*128kB (UMEH) 9918*256kB (UMEH) 7374*512kB (UMEH) 5397*1024kB (UMEH) 3887*2048kB (UMEH) 2002*4096kB (UME) = 32141240kB
      [13366.583027] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [13366.584380] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [13366.585702] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [13366.587042] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [13366.588372] 87386 total pagecache pages
      [13366.589266] 0 pages in swap cache
      [13366.590327] Free swap  = 0kB
      [13366.591227] Total swap = 0kB
      [13366.592142] 16777082 pages RAM
      [13366.593057] 0 pages HighMem/MovableOnly
      [13366.594037] 357226 pages reserved
      [13366.594979] 0 pages hwpoisoned
      
      This failure really confuse me as there're still lots of available pages. 
      Finally I figured out it was caused by a fatal signal.  When a process is
      allocating memory via vm_area_alloc_pages(), it will break directly even
      if it hasn't allocated the requested pages when it receives a fatal
      signal.  In that case, we shouldn't show this warn_alloc, as it is
      useless.  We only need to show this warning when there're really no enough
      pages.
      
      Link: https://lkml.kernel.org/r/20230330162625.13604-1-laoar.shao@gmail.comSigned-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f349b15e
    • Tetsuo Handa's avatar
      nilfs2: initialize "struct nilfs_binfo_dat"->bi_pad field · 73970316
      Tetsuo Handa authored
      nilfs_btree_assign_p() and nilfs_direct_assign_p() are not initializing
      "struct nilfs_binfo_dat"->bi_pad field, causing uninit-value reports when
      being passed to CRC function.
      
      Link: https://lkml.kernel.org/r/20230326152146.15872-1-konishi.ryusuke@gmail.comReported-by: default avatarsyzbot <syzbot+048585f3f4227bb2b49b@syzkaller.appspotmail.com>
        Link: https://syzkaller.appspot.com/bug?extid=048585f3f4227bb2b49bReported-by: default avatarDipanjan Das <mail.dipanjan.das@gmail.com>
        Link: https://lkml.kernel.org/r/CANX2M5bVbzRi6zH3PTcNE_31TzerstOXUa9Bay4E6y6dX23_pg@mail.gmail.comSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73970316
    • Ryusuke Konishi's avatar
      nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread() · 6be49d10
      Ryusuke Konishi authored
      The finalization of nilfs_segctor_thread() can race with
      nilfs_segctor_kill_thread() which terminates that thread, potentially
      causing a use-after-free BUG as KASAN detected.
      
      At the end of nilfs_segctor_thread(), it assigns NULL to "sc_task" member
      of "struct nilfs_sc_info" to indicate the thread has finished, and then
      notifies nilfs_segctor_kill_thread() of this using waitqueue
      "sc_wait_task" on the struct nilfs_sc_info.
      
      However, here, immediately after the NULL assignment to "sc_task", it is
      possible that nilfs_segctor_kill_thread() will detect it and return to
      continue the deallocation, freeing the nilfs_sc_info structure before the
      thread does the notification.
      
      This fixes the issue by protecting the NULL assignment to "sc_task" and
      its notification, with spinlock "sc_state_lock" of the struct
      nilfs_sc_info.  Since nilfs_segctor_kill_thread() does a final check to
      see if "sc_task" is NULL with "sc_state_lock" locked, this can eliminate
      the race.
      
      Link: https://lkml.kernel.org/r/20230327175318.8060-1-konishi.ryusuke@gmail.com
      Reported-by: syzbot+b08ebcc22f8f3e6be43a@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/00000000000000660d05f7dfa877@google.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6be49d10
    • Sergey Senozhatsky's avatar
      zsmalloc: document freeable stats · 618a8a91
      Sergey Senozhatsky authored
      When freeable class stat was added to classes file (back in 2016) we
      forgot to update zsmalloc documentation.  Fix that.
      
      Link: https://lkml.kernel.org/r/20230325024631.2817153-3-senozhatsky@chromium.org
      Fixes: 1120ed54 ("mm/zsmalloc: add `freeable' column to pool stat")
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      618a8a91
    • Sergey Senozhatsky's avatar
      zsmalloc: document new fullness grouping · 119b57ea
      Sergey Senozhatsky authored
      Patch series "zsmalloc: minor documentation updates".
      
      Two minor patches that bring zsmalloc documentation up to date.
      
      
      This patch (of 2):
      
      Update documentation and reflect new zspages fullness grouping (we don't
      use almost_empty and almost_full anymore).
      
      Link: https://lkml.kernel.org/r/20230325024631.2817153-1-senozhatsky@chromium.org
      Link: https://lkml.kernel.org/r/20230325024631.2817153-2-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Fixes: 67e157eb3639 ("zsmalloc: show per fullness group class stats")
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      119b57ea
    • Shiyang Ruan's avatar
      fsdax: force clear dirty mark if CoW · f76b3a32
      Shiyang Ruan authored
      XFS allows CoW on non-shared extents to combat fragmentation[1].  The old
      non-shared extent could be mwrited before, its dax entry is marked dirty. 
      
      This results in a WARNing:
      
      [   28.512349] ------------[ cut here ]------------
      [   28.512622] WARNING: CPU: 2 PID: 5255 at fs/dax.c:390 dax_insert_entry+0x342/0x390
      [   28.513050] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables
      [   28.515462] CPU: 2 PID: 5255 Comm: fsstress Kdump: loaded Not tainted 6.3.0-rc1-00001-g85e1481e19c1-dirty #117
      [   28.515902] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.1-1-1 04/01/2014
      [   28.516307] RIP: 0010:dax_insert_entry+0x342/0x390
      [   28.516536] Code: 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 48 8b 45 20 48 83 c0 01 e9 e2 fe ff ff 48 8b 45 20 48 83 c0 01 e9 cd fe ff ff <0f> 0b e9 53 ff ff ff 48 8b 7c 24 08 31 f6 e8 1b 61 a1 00 eb 8c 48
      [   28.517417] RSP: 0000:ffffc9000845fb18 EFLAGS: 00010086
      [   28.517721] RAX: 0000000000000053 RBX: 0000000000000155 RCX: 000000000018824b
      [   28.518113] RDX: 0000000000000000 RSI: ffffffff827525a6 RDI: 00000000ffffffff
      [   28.518515] RBP: ffffea00062092c0 R08: 0000000000000000 R09: ffffc9000845f9c8
      [   28.518905] R10: 0000000000000003 R11: ffffffff82ddb7e8 R12: 0000000000000155
      [   28.519301] R13: 0000000000000000 R14: 000000000018824b R15: ffff88810cfa76b8
      [   28.519703] FS:  00007f14a0c94740(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000
      [   28.520148] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   28.520472] CR2: 00007f14a0c8d000 CR3: 000000010321c004 CR4: 0000000000770ee0
      [   28.520863] PKRU: 55555554
      [   28.521043] Call Trace:
      [   28.521219]  <TASK>
      [   28.521368]  dax_fault_iter+0x196/0x390
      [   28.521595]  dax_iomap_pte_fault+0x19b/0x3d0
      [   28.521852]  __xfs_filemap_fault+0x234/0x2b0
      [   28.522116]  __do_fault+0x30/0x130
      [   28.522334]  do_fault+0x193/0x340
      [   28.522586]  __handle_mm_fault+0x2d3/0x690
      [   28.522975]  handle_mm_fault+0xe6/0x2c0
      [   28.523259]  do_user_addr_fault+0x1bc/0x6f0
      [   28.523521]  exc_page_fault+0x60/0x140
      [   28.523763]  asm_exc_page_fault+0x22/0x30
      [   28.524001] RIP: 0033:0x7f14a0b589ca
      [   28.524225] Code: c5 fe 7f 07 c5 fe 7f 47 20 c5 fe 7f 47 40 c5 fe 7f 47 60 c5 f8 77 c3 66 0f 1f 84 00 00 00 00 00 40 0f b6 c6 48 89 d1 48 89 fa <f3> aa 48 89 d0 c5 f8 77 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90
      [   28.525198] RSP: 002b:00007fff1dea1c98 EFLAGS: 00010202
      [   28.525505] RAX: 000000000000001e RBX: 000000000014a000 RCX: 0000000000006046
      [   28.525895] RDX: 00007f14a0c82000 RSI: 000000000000001e RDI: 00007f14a0c8d000
      [   28.526290] RBP: 000000000000006f R08: 0000000000000004 R09: 000000000014a000
      [   28.526681] R10: 0000000000000008 R11: 0000000000000246 R12: 028f5c28f5c28f5c
      [   28.527067] R13: 8f5c28f5c28f5c29 R14: 0000000000011046 R15: 00007f14a0c946c0
      [   28.527449]  </TASK>
      [   28.527600] ---[ end trace 0000000000000000 ]---
      
      
      To be able to delete this entry, clear its dirty mark before
      invalidate_inode_pages2_range().
      
      [1] https://lore.kernel.org/linux-xfs/20230321151339.GA11376@frogsfrogsfrogs/
      
      Link: https://lkml.kernel.org/r/1679653680-2-1-git-send-email-ruansy.fnst@fujitsu.com
      Fixes: f80e1668 ("fsdax: invalidate pages when CoW")
      Signed-off-by: default avatarShiyang Ruan <ruansy.fnst@fujitsu.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f76b3a32
    • Peter Xu's avatar
      mm/hugetlb: fix uffd wr-protection for CoW optimization path · 60d5b473
      Peter Xu authored
      This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be
      writable even with uffd-wp bit set.  It only happens with hugetlb private
      mappings, when someone firstly wr-protects a missing pte (which will
      install a pte marker), then a write to the same page without any prior
      access to the page.
      
      Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before
      reaching hugetlb_wp() to avoid taking more locks that userfault won't
      need.  However there's one CoW optimization path that can trigger
      hugetlb_wp() inside hugetlb_no_page(), which will bypass the trap.
      
      This patch skips hugetlb_wp() for CoW and retries the fault if uffd-wp bit
      is detected.  The new path will only trigger in the CoW optimization path
      because generic hugetlb_fault() (e.g.  when a present pte was
      wr-protected) will resolve the uffd-wp bit already.  Also make sure
      anonymous UNSHARE won't be affected and can still be resolved, IOW only
      skip CoW not CoR.
      
      This patch will be needed for v5.19+ hence copy stable.
      
      [peterx@redhat.com: v2]
        Link: https://lkml.kernel.org/r/ZBzOqwF2wrHgBVZb@x1n
      [peterx@redhat.com: v3]
        Link: https://lkml.kernel.org/r/20230324142620.2344140-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20230321191840.1897940-1-peterx@redhat.com
      Fixes: 166f3ecc ("mm/hugetlb: hook page faults for uffd write protection")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Tested-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      60d5b473
    • Liam R. Howlett's avatar
      mm: enable maple tree RCU mode by default · 3dd44325
      Liam R. Howlett authored
      Use the maple tree in RCU mode for VMA tracking.
      
      The maple tree tracks the stack and is able to update the pivot
      (lower/upper boundary) in-place to allow the page fault handler to write
      to the tree while holding just the mmap read lock.  This is safe as the
      writes to the stack have a guard VMA which ensures there will always be a
      NULL in the direction of the growth and thus will only update a pivot.
      
      It is possible, but not recommended, to have VMAs that grow up/down
      without guard VMAs.  syzbot has constructed a testcase which sets up a VMA
      to grow and consume the empty space.  Overwriting the entire NULL entry
      causes the tree to be altered in a way that is not safe for concurrent
      readers; the readers may see a node being rewritten or one that does not
      match the maple state they are using.
      
      Enabling RCU mode allows the concurrent readers to see a stable node and
      will return the expected result.
      
      [Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
      Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
      Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
      Fixes: d4af56c5 ("mm: start tracking VMAs with maple tree")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3dd44325
    • Liam R. Howlett's avatar
      maple_tree: add RCU lock checking to rcu callback functions · 790e1fa8
      Liam R. Howlett authored
      Dereferencing RCU objects within the RCU callback without the RCU check
      has caused lockdep to complain.  Fix the RCU dereferencing by using the
      RCU callback lock to ensure the operation is safe.
      
      Also stop creating a new lock to use for dereferencing during destruction
      of the tree or subtree.  Instead, pass through a pointer to the tree that
      has the lock that is held for RCU dereferencing checking.  It also does
      not make sense to use the maple state in the freeing scenario as the tree
      walk is a special case where the tree no longer has the normal encodings
      and parent pointers.
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-8-surenb@google.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      790e1fa8
    • Liam R. Howlett's avatar
      maple_tree: add smp_rmb() to dead node detection · 0a2b18d9
      Liam R. Howlett authored
      Add an smp_rmb() before reading the parent pointer to ensure that anything
      read from the node prior to the parent pointer hasn't been reordered ahead
      of this check.
      
      The is necessary for RCU mode.
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-7-surenb@google.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a2b18d9
    • Liam R. Howlett's avatar
      maple_tree: fix write memory barrier of nodes once dead for RCU mode · c13af03d
      Liam R. Howlett authored
      During the development of the maple tree, the strategy of freeing multiple
      nodes changed and, in the process, the pivots were reused to store
      pointers to dead nodes.  To ensure the readers see accurate pivots, the
      writers need to mark the nodes as dead and call smp_wmb() to ensure any
      readers can identify the node as dead before using the pivot values.
      
      There were two places where the old method of marking the node as dead
      without smp_wmb() were being used, which resulted in RCU readers seeing
      the wrong pivot value before seeing the node was dead.  Fix this race
      condition by using mte_set_node_dead() which has the smp_wmb() call to
      ensure the race is closed.
      
      Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being freed
      are marked as dead to ensure there are no other call paths besides the two
      updated paths.
      
      This is necessary for the RCU mode of the maple tree.
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-6-surenb@google.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c13af03d
    • Liam Howlett's avatar
      maple_tree: remove extra smp_wmb() from mas_dead_leaves() · 8372f4d8
      Liam Howlett authored
      The call to mte_set_dead_node() before the smp_wmb() already calls
      smp_wmb() so this is not needed.  This is an optimization for the RCU mode
      of the maple tree.
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-5-surenb@google.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8372f4d8