1. 31 Mar, 2022 1 commit
    • Andrew Price's avatar
      gfs2: Make sure FITRIM minlen is rounded up to fs block size · 27ca8273
      Andrew Price authored
      Per fstrim(8) we must round up the minlen argument to the fs block size.
      The current calculation doesn't take into account devices that have a
      discard granularity and requested minlen less than 1 fs block, so the
      value can get shifted away to zero in the translation to fs blocks.
      
      The zero minlen passed to gfs2_rgrp_send_discards() then allows
      sb_issue_discard() to be called with nr_sects == 0 which returns -EINVAL
      and results in gfs2_rgrp_send_discards() returning -EIO.
      
      Make sure minlen is never < 1 fs block by taking the max of the
      requested minlen and the fs block size before comparing to the device's
      discard granularity and shifting to fs blocks.
      
      Fixes: 076f0faa ("GFS2: Fix FITRIM argument handling")
      Signed-off-by: default avatarAndrew Price <anprice@redhat.com>
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      27ca8273
  2. 24 Mar, 2022 3 commits
  3. 23 Mar, 2022 3 commits
    • Andreas Gruenbacher's avatar
      gfs2: Minor retry logic cleanup · 124c458a
      Andreas Gruenbacher authored
      Clean up the retry logic in the read and write functions somewhat.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      124c458a
    • Andreas Gruenbacher's avatar
      gfs2: Disable page faults during lockless buffered reads · 52f3f033
      Andreas Gruenbacher authored
      During lockless buffered reads, filemap_read() holds page cache page
      references while trying to copy data to the user-space buffer.  The
      calling process isn't holding the inode glock, but the page references
      it holds prevent those pages from being removed from the page cache, and
      that prevents the underlying inode glock from being moved to another
      node.  Thus, we can end up in the same kinds of distributed deadlock
      situations as with normal (non-lockless) buffered reads.
      
      Fix that by disabling page faults during lockless reads as well.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      52f3f033
    • Andreas Gruenbacher's avatar
      gfs2: Fix should_fault_in_pages() logic · bb7f5d96
      Andreas Gruenbacher authored
      Fix the fault-in window size logic:
      * Use a maximum window size of 1 MiB instead of BIO_MAX_VECS * PAGE_SIZE.
        The previous window size was always one page because the pages variable
        was accidentally being defined and then redefined in
        should_fault_in_pages().
      * The nr_dirtied heuristic for guessing when there might be memory
        pressure often results in very small window sizes.  Don't let
        nr_dirtied drop below 8 pages (as btrfs does).
      * Compute the window size in units of bytes, not pages.
      * Account for page overlap (unaligned iterators).
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      bb7f5d96
  4. 03 Mar, 2022 1 commit
  5. 15 Feb, 2022 7 commits
    • Andreas Gruenbacher's avatar
      gfs2: Initialize gh_error in gfs2_glock_nq · a4e8145e
      Andreas Gruenbacher authored
      The gh_error field if a glock holder is initialized to zero in
      gfs2_holder_init().  When a locking operation fails, gh_error is set to
      an error code; when it succeeds, the gh_error value is left unchanged.
      The field isn't initialized in gfs2_holder_reinit(), which is a problem.
      Instead of fixing that directly, initialize gh_error in gfs2_glock_nq().
      That also obsoletes the assignment in do_flock().
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      a4e8145e
    • Andreas Gruenbacher's avatar
      gfs2: Make use of list_is_first · 5a27a43e
      Andreas Gruenbacher authored
      Use list_is_first() instead of open-coding it.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      5a27a43e
    • Andreas Gruenbacher's avatar
      gfs2: Switch lock order of inode and iopen glock · 29464ee3
      Andreas Gruenbacher authored
      This patch tries to fix the continual ABBA deadlocks we keep having
      between the iopen and inode glocks. This switches the lock order in
      gfs2_inode_lookup and gfs2_create_inode so the iopen glock is always
      locked first.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      29464ee3
    • Andreas Gruenbacher's avatar
      gfs2: cancel timed-out glock requests · 1fc05c8d
      Andreas Gruenbacher authored
      The gfs2 evict code tries to upgrade the iopen glock from SH to EX. If
      the attempt to upgrade times out, gfs2 needs to tell dlm to cancel the
      lock request or it can deadlock. We also need to wake up the process
      waiting for the lock when dlm sends its AST back to gfs2.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      1fc05c8d
    • Andreas Gruenbacher's avatar
      gfs2: Expect -EBUSY after canceling dlm locking requests · a892b123
      Andreas Gruenbacher authored
      Due to the asynchronous nature of the dlm api, when we request a pending
      locking request to be canceled with dlm_unlock(DLM_LKF_CANCEL), the
      locking request will either complete before it could be canceled, or the
      cancellation will succeed.  In either case, gdlm_ast will be called once
      and the status will indicate the outcome of the locking request, with
      -DLM_ECANCEL indicating a canceled request.
      
      Inside dlm, when a locking request completes before its cancel request
      could be processed, gdlm_ast will be called, but the lock will still be
      considered busy until a DLM_MSG_CANCEL_REPLY message completes the
      cancel request.  During that time, successive dlm_lock() or dlm_unlock()
      requests for that lock will return -EBUSY.  In other words, waiting for
      the gdlm_ast call before issuing the next locking request is not enough.
      There is no way of waiting for a cancel request to actually complete,
      either.
      
      We rarely cancel locking requests, but when we do, we don't know when
      the next locking request for that lock will occur.  This means that any
      dlm_lock() or dlm_unlock() call can potentially return -EBUSY.  When
      that happens, this patch simply repeats the request after a short pause.
      
      This workaround could be improved upon by tracking for which dlm locks
      cancel requests have been issued, but that isn't strictly necessary and
      it would complicate the code.  We haven't seen -EBUSY errors from dlm
      without cancel requests.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      a892b123
    • Andreas Gruenbacher's avatar
      gfs2: gfs2_setattr_size error path fix · 7336905a
      Andreas Gruenbacher authored
      When gfs2_setattr_size() fails, it calls gfs2_rs_delete(ip, NULL) to get
      rid of any reservations the inode may have.  Instead, it should pass in
      the inode's write count as the second parameter to allow
      gfs2_rs_delete() to figure out if the inode has any writers left.
      
      In a next step, there are two instances of gfs2_rs_delete(ip, NULL) left
      where we know that there can be no other users of the inode.  Replace
      those with gfs2_rs_deltree(&ip->i_res) to avoid the unnecessary write
      count check.
      
      With that, gfs2_rs_delete() is only called with the inode's actual write
      count, so get rid of the second parameter.
      
      Fixes: a097dc7e ("GFS2: Make rgrp reservations part of the gfs2_inode structure")
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      7336905a
    • Bob Peterson's avatar
      gfs2: assign rgrp glock before compute_bitstructs · 428f651c
      Bob Peterson authored
      Before this patch, function read_rindex_entry called compute_bitstructs
      before it allocated a glock for the rgrp. But if compute_bitstructs found
      a problem with the rgrp, it called gfs2_consist_rgrpd, and that called
      gfs2_dump_glock for rgd->rd_gl which had not yet been assigned.
      
      read_rindex_entry
         compute_bitstructs
            gfs2_consist_rgrpd
               gfs2_dump_glock <---------rgd->rd_gl was not set.
      
      This patch changes read_rindex_entry so it assigns an rgrp glock before
      calling compute_bitstructs so gfs2_dump_glock does not reference an
      unassigned pointer. If an error is discovered, the glock must also be
      put, so a new goto and label were added.
      
      Reported-by: syzbot+c6fd14145e2f62ca0784@syzkaller.appspotmail.com
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      428f651c
  6. 13 Feb, 2022 9 commits
  7. 12 Feb, 2022 16 commits
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · b81b1829
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Two minor fixes in the lpfc driver. One changing the classification of
        trace messages and the other fixing a build issue when NVME_FC is
        disabled"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: lpfc: Reduce log messages seen after firmware download
        scsi: lpfc: Remove NVMe support if kernel has NVME_FC disabled
      b81b1829
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 080eba78
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are a small number of char/misc driver fixes for 5.17-rc4 for
        reported issues. They contain:
      
         - phy driver fixes
      
         - iio driver fix
      
         - eeprom driver fix
      
         - speakup regression fix
      
         - fastrpc fix
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'char-misc-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        iio: buffer: Fix file related error handling in IIO_BUFFER_GET_FD_IOCTL
        speakup-dectlk: Restore pitch setting
        bus: mhi: pci_generic: Add mru_default for Cinterion MV31-W
        bus: mhi: pci_generic: Add mru_default for Foxconn SDX55
        eeprom: ee1004: limit i2c reads to I2C_SMBUS_BLOCK_MAX
        misc: fastrpc: avoid double fput() on failed usercopy
        phy: dphy: Correct clk_pre parameter
        phy: phy-mtk-tphy: Fix duplicated argument in phy-mtk-tphy
        phy: stm32: fix a refcount leak in stm32_usbphyc_pll_enable()
        phy: xilinx: zynqmp: Fix bus width setting for SGMII
        phy: cadence: Sierra: fix error handling bugs in probe()
        phy: ti: Fix missing sentinel for clk_div_table
        phy: broadcom: Kconfig: Fix PHY_BRCM_USB config option
        phy: usb: Leave some clocks running during suspend
      080eba78
    • Linus Torvalds's avatar
      Merge tag 'staging-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · dcd72f54
      Linus Torvalds authored
      Pullstaging driver fixes from Greg KH:
       "Here are two staging driver fixes for 5.17-rc4.  These are:
      
         - fbtft error path fix
      
         - vc04_services rcu dereference fix
      
        Both of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'staging-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: fbtft: Fix error path in fbtft_driver_module_init()
        staging: vc04_services: Fix RCU dereference check
      dcd72f54
    • Linus Torvalds's avatar
      Merge tag 'tty-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 522e7d03
      Linus Torvalds authored
      Pull tty/serial fixes from Greg KH:
       "Here are four small tty/serial fixes for 5.17-rc4.  They are:
      
         - 8250_pericom change revert to fix a reported regression
      
         - two speculation fixes for vt_ioctl
      
         - n_tty regression fix for polling
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'tty-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        vt_ioctl: add array_index_nospec to VT_ACTIVATE
        vt_ioctl: fix array_index_nospec in vt_setactivate
        serial: 8250_pericom: Revert "Re-enable higher baud rates"
        n_tty: wake up poll(POLLRDNORM) on receiving data
      522e7d03
    • Linus Torvalds's avatar
      Merge tag 'usb-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 85187378
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB driver fixes for 5.17-rc4 that resolve some
        reported issues and add new device ids:
      
         - usb-serial new device ids
      
         - ulpi cleanup fixes
      
         - f_fs use-after-free fix
      
         - dwc3 driver fixes
      
         - ax88179_178a usb network driver fix
      
         - usb gadget fixes
      
        There is a revert at the end of this series to resolve a build problem
        that 0-day found yesterday. Most of these have been in linux-next,
        except for the last few, and all have now passed 0-day tests"
      
      * tag 'usb-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        Revert "usb: dwc2: drd: fix soft connect when gadget is unconfigured"
        usb: dwc2: drd: fix soft connect when gadget is unconfigured
        usb: gadget: rndis: check size of RNDIS_MSG_SET command
        USB: gadget: validate interface OS descriptor requests
        usb: core: Unregister device on component_add() failure
        net: usb: ax88179_178a: Fix out-of-bounds accesses in RX fixup
        usb: dwc3: gadget: Prevent core from processing stale TRBs
        USB: serial: cp210x: add CPI Bulk Coin Recycler id
        USB: serial: cp210x: add NCR Retail IO box id
        USB: serial: ftdi_sio: add support for Brainboxes US-159/235/320
        usb: gadget: f_uac2: Define specific wTerminalType
        usb: gadget: udc: renesas_usb3: Fix host to USB_ROLE_NONE transition
        usb: raw-gadget: fix handling of dual-direction-capable endpoints
        usb: usb251xb: add boost-up property support
        usb: ulpi: Call of_node_put correctly
        usb: ulpi: Move of_node_put to ulpi_dev_release
        USB: serial: option: add ZTE MF286D modem
        USB: serial: ch341: add support for GW Instek USB2.0-Serial devices
        usb: f_fs: Fix use-after-free for epfile
        usb: dwc3: xilinx: fix uninitialized return value
      85187378
    • Linus Torvalds's avatar
      Merge tag 's390-5.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · a4fd49cd
      Linus Torvalds authored
      Pull s390 updates from Vasily Gorbik:
       "Maintainers and reviewers changes:
      
          - Add Alexander Gordeev as maintainer for s390.
      
          - Christian Borntraeger will focus on s390 KVM maintainership and
            stays as s390 reviewer.
      
        Fixes:
      
         - Fix clang build of modules loader KUnit test.
      
         - Fix kernel panic in CIO code on FCES path-event when no driver is
           attached to a device or the driver does not provide the path_event
           function"
      
      * tag 's390-5.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/cio: verify the driver availability for path_event call
        s390/module: fix building test_modules_helpers.o with clang
        MAINTAINERS: downgrade myself to Reviewer for s390
        MAINTAINERS: add Alexander Gordeev as maintainer for s390
      a4fd49cd
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.17a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 4a387c98
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - Two small cleanups
      
       - Another fix for addressing the EFI framebuffer above 4GB when running
         as Xen dom0
      
       - A patch to let Xen guests use reserved bits in MSI- and IO-APIC-
         registers for extended APIC-IDs the same way KVM guests are doing it
         already
      
      * tag 'for-linus-5.17a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/pci: Make use of the helper macro LIST_HEAD()
        xen/x2apic: Fix inconsistent indenting
        xen/x86: detect support for extended destination ID
        xen/x86: obtain full video frame buffer address for Dom0 also under EFI
      4a387c98
    • Linus Torvalds's avatar
      Merge tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · eef8cffc
      Linus Torvalds authored
      Pull seccomp fixes from Kees Cook:
       "This fixes a corner case of fatal SIGSYS being ignored since v5.15.
        Along with the signal fix is a change to seccomp so that seeing
        another syscall after a fatal filter result will cause seccomp to kill
        the process harder.
      
        Summary:
      
         - Force HANDLER_EXIT even for SIGNAL_UNKILLABLE
      
         - Make seccomp self-destruct after fatal filter results
      
         - Update seccomp samples for easier behavioral demonstration"
      
      * tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        samples/seccomp: Adjust sample to also provide kill option
        seccomp: Invalidate seccomp mode to catch death failures
        signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
      eef8cffc
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 9917ff5f
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "5 patches.
      
        Subsystems affected by this patch series: binfmt, procfs, and mm
        (vmscan, memcg, and kfence)"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        kfence: make test case compatible with run time set sample interval
        mm: memcg: synchronize objcg lists with a dedicated spinlock
        mm: vmscan: remove deadlock due to throttling failing to make progress
        fs/proc: task_mmu.c: don't read mapcount for migration entry
        fs/binfmt_elf: fix PT_LOAD p_align values for loaders
      9917ff5f
    • Jing Leng's avatar
      kconfig: fix failing to generate auto.conf · 1b9e740a
      Jing Leng authored
      When the KCONFIG_AUTOCONFIG is specified (e.g. export \
      KCONFIG_AUTOCONFIG=output/config/auto.conf), the directory of
      include/config/ will not be created, so kconfig can't create deps
      files in it and auto.conf can't be generated.
      Signed-off-by: default avatarJing Leng <jleng@ambarella.com>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      1b9e740a
    • Greg Kroah-Hartman's avatar
      Revert "usb: dwc2: drd: fix soft connect when gadget is unconfigured" · 736e8d89
      Greg Kroah-Hartman authored
      This reverts commit 269cbcf7.
      
      It causes build errors as reported by the kernel test robot.
      
      Link: https://lore.kernel.org/r/202202112236.AwoOTtHO-lkp@intel.comReported-by: default avatarkernel test robot <lkp@intel.com>
      Fixes: 269cbcf7 ("usb: dwc2: drd: fix soft connect when gadget is unconfigured")
      Cc: stable@kernel.org
      Cc: Amelie Delaunay <amelie.delaunay@foss.st.com>
      Cc: Minas Harutyunyan <Minas.Harutyunyan@synopsys.com>
      Cc: Fabrice Gasnier <fabrice.gasnier@foss.st.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      736e8d89
    • Peng Liu's avatar
      kfence: make test case compatible with run time set sample interval · 8913c610
      Peng Liu authored
      The parameter kfence_sample_interval can be set via boot parameter and
      late shell command, which is convenient for automated tests and KFENCE
      parameter optimization.  However, KFENCE test case just uses
      compile-time CONFIG_KFENCE_SAMPLE_INTERVAL, which will make KFENCE test
      case not run as users desired.  Export kfence_sample_interval, so that
      KFENCE test case can use run-time-set sample interval.
      
      Link: https://lkml.kernel.org/r/20220207034432.185532-1-liupeng256@huawei.comSigned-off-by: default avatarPeng Liu <liupeng256@huawei.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Christian Knig <christian.koenig@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8913c610
    • Roman Gushchin's avatar
      mm: memcg: synchronize objcg lists with a dedicated spinlock · 0764db9b
      Roman Gushchin authored
      Alexander reported a circular lock dependency revealed by the mmap1 ltp
      test:
      
        LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
                WARNING: possible circular locking dependency detected
                5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
                ------------------------------------------------------
                mmap1/202299 is trying to acquire lock:
                00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
                but task is already holding lock:
                00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                which lock already depends on the new lock.
                the existing dependency chain (in reverse order) is:
                -> #1 (&sighand->siglock){-.-.}-{2:2}:
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       __lock_task_sighand+0x90/0x190
                       cgroup_freeze_task+0x2e/0x90
                       cgroup_migrate_execute+0x11c/0x608
                       cgroup_update_dfl_csses+0x246/0x270
                       cgroup_subtree_control_write+0x238/0x518
                       kernfs_fop_write_iter+0x13e/0x1e0
                       new_sync_write+0x100/0x190
                       vfs_write+0x22c/0x2d8
                       ksys_write+0x6c/0xf8
                       __do_syscall+0x1da/0x208
                       system_call+0x82/0xb0
                -> #0 (css_set_lock){..-.}-{2:2}:
                       check_prev_add+0xe0/0xed8
                       validate_chain+0x736/0xb20
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       obj_cgroup_release+0x4a/0xe0
                       percpu_ref_put_many.constprop.0+0x150/0x168
                       drain_obj_stock+0x94/0xe8
                       refill_obj_stock+0x94/0x278
                       obj_cgroup_charge+0x164/0x1d8
                       kmem_cache_alloc+0xac/0x528
                       __sigqueue_alloc+0x150/0x308
                       __send_signal+0x260/0x550
                       send_signal+0x7e/0x348
                       force_sig_info_to_task+0x104/0x180
                       force_sig_fault+0x48/0x58
                       __do_pgm_check+0x120/0x1f0
                       pgm_check_handler+0x11e/0x180
                other info that might help us debug this:
                 Possible unsafe locking scenario:
                       CPU0                    CPU1
                       ----                    ----
                  lock(&sighand->siglock);
                                               lock(css_set_lock);
                                               lock(&sighand->siglock);
                  lock(css_set_lock);
                 *** DEADLOCK ***
                2 locks held by mmap1/202299:
                 #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                 #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
                stack backtrace:
                CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
                Hardware name: IBM 3906 M04 704 (LPAR)
                Call Trace:
                  dump_stack_lvl+0x76/0x98
                  check_noncircular+0x136/0x158
                  check_prev_add+0xe0/0xed8
                  validate_chain+0x736/0xb20
                  __lock_acquire+0x604/0xbd8
                  lock_acquire.part.0+0xe2/0x238
                  lock_acquire+0xb0/0x200
                  _raw_spin_lock_irqsave+0x6a/0xd8
                  obj_cgroup_release+0x4a/0xe0
                  percpu_ref_put_many.constprop.0+0x150/0x168
                  drain_obj_stock+0x94/0xe8
                  refill_obj_stock+0x94/0x278
                  obj_cgroup_charge+0x164/0x1d8
                  kmem_cache_alloc+0xac/0x528
                  __sigqueue_alloc+0x150/0x308
                  __send_signal+0x260/0x550
                  send_signal+0x7e/0x348
                  force_sig_info_to_task+0x104/0x180
                  force_sig_fault+0x48/0x58
                  __do_pgm_check+0x120/0x1f0
                  pgm_check_handler+0x11e/0x180
                INFO: lockdep is turned off.
      
      In this example a slab allocation from __send_signal() caused a
      refilling and draining of a percpu objcg stock, resulted in a releasing
      of another non-related objcg.  Objcg release path requires taking the
      css_set_lock, which is used to synchronize objcg lists.
      
      This can create a circular dependency with the sighandler lock, which is
      taken with the locked css_set_lock by the freezer code (to freeze a
      task).
      
      In general it seems that using css_set_lock to synchronize objcg lists
      makes any slab allocations and deallocation with the locked css_set_lock
      and any intervened locks risky.
      
      To fix the problem and make the code more robust let's stop using
      css_set_lock to synchronize objcg lists and use a new dedicated spinlock
      instead.
      
      Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
      Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reported-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Tested-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Tested-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0764db9b
    • Mel Gorman's avatar
      mm: vmscan: remove deadlock due to throttling failing to make progress · b485c6f1
      Mel Gorman authored
      A soft lockup bug in kcompactd was reported in a private bugzilla with
      the following visible in dmesg;
      
        watchdog: BUG: soft lockup - CPU#33 stuck for 26s! [kcompactd0:479]
        watchdog: BUG: soft lockup - CPU#33 stuck for 52s! [kcompactd0:479]
        watchdog: BUG: soft lockup - CPU#33 stuck for 78s! [kcompactd0:479]
        watchdog: BUG: soft lockup - CPU#33 stuck for 104s! [kcompactd0:479]
      
      The machine had 256G of RAM with no swap and an earlier failed
      allocation indicated that node 0 where kcompactd was run was potentially
      unreclaimable;
      
        Node 0 active_anon:29355112kB inactive_anon:2913528kB active_file:0kB
          inactive_file:0kB unevictable:64kB isolated(anon):0kB isolated(file):0kB
          mapped:8kB dirty:0kB writeback:0kB shmem:26780kB shmem_thp:
          0kB shmem_pmdmapped: 0kB anon_thp: 23480320kB writeback_tmp:0kB
          kernel_stack:2272kB pagetables:24500kB all_unreclaimable? yes
      
      Vlastimil Babka investigated a crash dump and found that a task
      migrating pages was trying to drain PCP lists;
      
        PID: 52922  TASK: ffff969f820e5000  CPU: 19  COMMAND: "kworker/u128:3"
        Call Trace:
           __schedule
           schedule
           schedule_timeout
           wait_for_completion
           __flush_work
           __drain_all_pages
           __alloc_pages_slowpath.constprop.114
           __alloc_pages
           alloc_migration_target
           migrate_pages
           migrate_to_node
           do_migrate_pages
           cpuset_migrate_mm_workfn
           process_one_work
           worker_thread
           kthread
           ret_from_fork
      
      This failure is specific to CONFIG_PREEMPT=n builds.  The root of the
      problem is that kcompact0 is not rescheduling on a CPU while a task that
      has isolated a large number of the pages from the LRU is waiting on
      kcompact0 to reschedule so the pages can be released.  While
      shrink_inactive_list() only loops once around too_many_isolated, reclaim
      can continue without rescheduling if sc->skipped_deactivate == 1 which
      could happen if there was no file LRU and the inactive anon list was not
      low.
      
      Link: https://lkml.kernel.org/r/20220203100326.GD3301@suse.de
      Fixes: d818fca1 ("mm/vmscan: throttle reclaim and compaction when too may pages are isolated")
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Debugged-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b485c6f1
    • Yang Shi's avatar
      fs/proc: task_mmu.c: don't read mapcount for migration entry · 24d7275c
      Yang Shi authored
      The syzbot reported the below BUG:
      
        kernel BUG at include/linux/page-flags.h:785!
        invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline]
        RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744
        Call Trace:
          page_mapcount include/linux/mm.h:837 [inline]
          smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466
          smaps_pte_entry fs/proc/task_mmu.c:538 [inline]
          smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601
          walk_pmd_range mm/pagewalk.c:128 [inline]
          walk_pud_range mm/pagewalk.c:205 [inline]
          walk_p4d_range mm/pagewalk.c:240 [inline]
          walk_pgd_range mm/pagewalk.c:277 [inline]
          __walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379
          walk_page_vma+0x277/0x350 mm/pagewalk.c:530
          smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768
          smap_gather_stats fs/proc/task_mmu.c:741 [inline]
          show_smap+0xc6/0x440 fs/proc/task_mmu.c:822
          seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272
          seq_read+0x3e0/0x5b0 fs/seq_file.c:162
          vfs_read+0x1b5/0x600 fs/read_write.c:479
          ksys_read+0x12d/0x250 fs/read_write.c:619
          do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The reproducer was trying to read /proc/$PID/smaps when calling
      MADV_FREE at the mean time.  MADV_FREE may split THPs if it is called
      for partial THP.  It may trigger the below race:
      
                 CPU A                         CPU B
                 -----                         -----
        smaps walk:                      MADV_FREE:
        page_mapcount()
          PageCompound()
                                         split_huge_page()
          page = compound_head(page)
          PageDoubleMap(page)
      
      When calling PageDoubleMap() this page is not a tail page of THP anymore
      so the BUG is triggered.
      
      This could be fixed by elevated refcount of the page before calling
      mapcount, but that would prevent it from counting migration entries, and
      it seems overkilling because the race just could happen when PMD is
      split so all PTE entries of tail pages are actually migration entries,
      and smaps_account() does treat migration entries as mapcount == 1 as
      Kirill pointed out.
      
      Add a new parameter for smaps_account() to tell this entry is migration
      entry then skip calling page_mapcount().  Don't skip getting mapcount
      for device private entries since they do track references with mapcount.
      
      Pagemap also has the similar issue although it was not reported.  Fixed
      it as well.
      
      [shy828301@gmail.com: v4]
        Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com
      [nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()]
        Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org
      Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com
      Fixes: e9b61f19 ("thp: reintroduce split_huge_page()")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reported-by: syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24d7275c
    • Mike Rapoport's avatar
      fs/binfmt_elf: fix PT_LOAD p_align values for loaders · 925346c1
      Mike Rapoport authored
      Rui Salvaterra reported that Aisleroit solitaire crashes with "Wrong
      __data_start/_end pair" assertion from libgc after update to v5.17-rc1.
      
      Bisection pointed to commit 9630f0d6 ("fs/binfmt_elf: use PT_LOAD
      p_align values for static PIE") that fixed handling of static PIEs, but
      made the condition that guards load_bias calculation to exclude loader
      binaries.
      
      Restoring the check for presence of interpreter fixes the problem.
      
      Link: https://lkml.kernel.org/r/20220202121433.3697146-1-rppt@kernel.org
      Fixes: 9630f0d6 ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reported-by: default avatarRui Salvaterra <rsalvaterra@gmail.com>
      Tested-by: default avatarRui Salvaterra <rsalvaterra@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: "H.J. Lu" <hjl.tools@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      925346c1