1. 08 May, 2022 11 commits
  2. 07 May, 2022 3 commits
  3. 06 May, 2022 18 commits
    • Linus Torvalds's avatar
      Merge tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 4b97bac0
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "Regression fixes in zone activation:
      
         - move a loop invariant out of the loop to avoid checking space
           status
      
         - properly handle unlimited activation
      
        Other fixes:
      
         - for subpage, force the free space v2 mount to avoid a warning and
           make it easy to switch a filesystem on different page size systems
      
         - export sysfs status of exclusive operation 'balance paused', so the
           user space tools can recognize it and allow adding a device with
           paused balance
      
         - fix assertion failure when logging directory key range item"
      
      * tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: sysfs: export the balance paused state of exclusive operation
        btrfs: fix assertion failure when logging directory key range item
        btrfs: zoned: activate block group properly on unlimited active zone device
        btrfs: zoned: move non-changing condition check out of the loop
        btrfs: force v2 space cache usage for subpage mount
      4b97bac0
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · adcffc17
      Linus Torvalds authored
      Pull NFS client fixes from Trond Myklebust:
       "Highlights include:
      
        Stable fixes:
      
         - Fix a socket leak when setting up an AF_LOCAL RPC client
      
         - Ensure that knfsd connects to the gss-proxy daemon on setup
      
        Bugfixes:
      
         - Fix a refcount leak when migrating a task off an offlined transport
      
         - Don't gratuitously invalidate inode attributes on delegation return
      
         - Don't leak sockets in xs_local_connect()
      
         - Ensure timely close of disconnected AF_LOCAL sockets"
      
      * tag 'nfs-for-5.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        Revert "SUNRPC: attempt AF_LOCAL connect on setup"
        SUNRPC: Ensure gss-proxy connects on setup
        SUNRPC: Ensure timely close of disconnected AF_LOCAL sockets
        SUNRPC: Don't leak sockets in xs_local_connect()
        NFSv4: Don't invalidate inode attributes on delegation return
        SUNRPC release the transport of a relocated task with an assigned transport
      adcffc17
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · bce58da1
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "x86:
      
         - Account for family 17h event renumberings in AMD PMU emulation
      
         - Remove CPUID leaf 0xA on AMD processors
      
         - Fix lockdep issue with locking all vCPUs
      
         - Fix loss of A/D bits in SPTEs
      
         - Fix syzkaller issue with invalid guest state"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: VMX: Exit to userspace if vCPU has injected exception and invalid state
        KVM: SEV: Mark nested locking of vcpu->lock
        kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU
        KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id
        KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
        KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()
        KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)
      bce58da1
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 497fe3bb
      Linus Torvalds authored
      Pull RISC-V fix from Palmer Dabbelt:
      
       - A fix to relocate the DTB early in boot, in cases where the
         bootloader doesn't put the DTB in a region that will end up
         mapped by the kernel.
      
         This manifests as a crash early in boot on a handful of
         configurations.
      
      * tag 'riscv-for-linus-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        RISC-V: relocate DTB if it's outside memory region
      497fe3bb
    • Sean Christopherson's avatar
      KVM: VMX: Exit to userspace if vCPU has injected exception and invalid state · 053d2290
      Sean Christopherson authored
      Exit to userspace with an emulation error if KVM encounters an injected
      exception with invalid guest state, in addition to the existing check of
      bailing if there's a pending exception (KVM doesn't support emulating
      exceptions except when emulating real mode via vm86).
      
      In theory, KVM should never get to such a situation as KVM is supposed to
      exit to userspace before injecting an exception with invalid guest state.
      But in practice, userspace can intervene and manually inject an exception
      and/or stuff registers to force invalid guest state while a previously
      injected exception is awaiting reinjection.
      
      Fixes: fc4fad79 ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception")
      Reported-by: syzbot+cfafed3bb76d3e37581b@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220502221850.131873-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      053d2290
    • Peter Gonda's avatar
      KVM: SEV: Mark nested locking of vcpu->lock · 0c2c7c06
      Peter Gonda authored
      svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all
      source and target vcpu->locks. Unfortunately there is an 8 subclass
      limit, so a new subclass cannot be used for each vCPU. Instead maintain
      ownership of the first vcpu's mutex.dep_map using a role specific
      subclass: source vs target. Release the other vcpu's mutex.dep_maps.
      
      Fixes: b5663931 ("KVM: SEV: Add support for SEV intra host migration")
      Reported-by: John Sperbeck<jsperbeck@google.com>
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      
      Message-Id: <20220502165807.529624-1-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0c2c7c06
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 4df22ca8
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "A few recent regressions in rxe's multicast code, and some old driver
        bugs:
      
         - Error case unwind bug in rxe for rkeys
      
         - Dot not call netdev functions under a spinlock in rxe multicast
           code
      
         - Use the proper BH lock type in rxe multicast code
      
         - Fix idrma deadlock and crash
      
         - Add a missing flush to drain irdma QPs when in error
      
         - Fix high userspace latency in irdma during destroy due to
           synchronize_rcu()
      
         - Rare race in siw MPA processing"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/rxe: Change mcg_lock to a _bh lock
        RDMA/rxe: Do not call  dev_mc_add/del() under a spinlock
        RDMA/siw: Fix a condition race issue in MPA request processing
        RDMA/irdma: Fix possible crash due to NULL netdev in notifier
        RDMA/irdma: Reduce iWARP QP destroy time
        RDMA/irdma: Flush iWARP QP if modified to ERR from RTR state
        RDMA/rxe: Recheck the MR in when generating a READ reply
        RDMA/irdma: Fix deadlock in irdma_cleanup_cm_core()
        RDMA/rxe: Fix "Replace mr by rkey in responder resources"
      4df22ca8
    • Linus Torvalds's avatar
      Merge tag 'mmc-v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · 64267926
      Linus Torvalds authored
      Pull mmc fixes from Ulf Hansson:
       "MMC core:
      
         - Fix initialization for eMMC's HS200/HS400 mode
      
        MMC host:
      
         - sdhci-msm: Reset GCC_SDCC_BCR register to prevent timeout issues
      
         - sunxi-mmc: Fix DMA descriptors allocated above 32 bits"
      
      * tag 'mmc-v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: sdhci-msm: Reset GCC_SDCC_BCR register for SDHC
        mmc: sunxi-mmc: Fix DMA descriptors allocated above 32 bits
        mmc: core: Set HS clock speed before sending HS CMD13
      64267926
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2022-05-06' of git://anongit.freedesktop.org/drm/drm · 5fa576d7
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "A pretty quiet week, one fbdev, msm, kconfig, and two amdgpu fixes,
        about what I'd expect for rc6.
      
        fbdev:
      
         - hotunplugging fix
      
        amdgpu:
      
         - Fix a xen dom0 regression on APUs
      
         - Fix a potential array overflow if a receiver were to send an
           erroneous audio channel count
      
        msm:
      
         - lockdep fix.
      
        it6505:
      
         - kconfig fix"
      
      * tag 'drm-fixes-2022-05-06' of git://anongit.freedesktop.org/drm/drm:
        drm/amd/display: Avoid reading audio pattern past AUDIO_CHANNELS_COUNT
        drm/amdgpu: do not use passthrough mode in Xen dom0
        drm/bridge: ite-it6505: add missing Kconfig option select
        fbdev: Make fb_release() return -ENODEV if fbdev was unregistered
        drm/msm/dp: remove fail safe mode related code
      5fa576d7
    • Puyou Lu's avatar
      gpio: pca953x: fix irq_stat not updated when irq is disabled (irq_mask not set) · dba78579
      Puyou Lu authored
      When one port's input state get inverted (eg. from low to hight) after
      pca953x_irq_setup but before setting irq_mask (by some other driver such as
      "gpio-keys"), the next inversion of this port (eg. from hight to low) will not
      be triggered any more (because irq_stat is not updated at the first time). Issue
      should be fixed after this commit.
      
      Fixes: 89ea8bbe ("gpio: pca953x.c: add interrupt handling capability")
      Signed-off-by: default avatarPuyou Lu <puyou.lu@gmail.com>
      Signed-off-by: default avatarBartosz Golaszewski <brgl@bgdev.pl>
      dba78579
    • Kajol Jain's avatar
      powerpc/papr_scm: Fix buffer overflow issue with CONFIG_FORTIFY_SOURCE · 348c7134
      Kajol Jain authored
      With CONFIG_FORTIFY_SOURCE enabled, string functions will also perform
      dynamic checks for string size which can panic the kernel, like incase
      of overflow detection.
      
      In papr_scm, papr_scm_pmu_check_events function uses stat->stat_id with
      string operations, to populate the nvdimm_events_map array. Since
      stat_id variable is not NULL terminated, the kernel panics with
      CONFIG_FORTIFY_SOURCE enabled at boot time.
      
      Below are the logs of kernel panic:
      
        detected buffer overflow in __fortify_strlen
        ------------[ cut here ]------------
        kernel BUG at lib/string_helpers.c:980!
        Oops: Exception in kernel mode, sig: 5 [#1]
        NIP [c00000000077dad0] fortify_panic+0x28/0x38
        LR [c00000000077dacc] fortify_panic+0x24/0x38
        Call Trace:
        [c0000022d77836e0] [c00000000077dacc] fortify_panic+0x24/0x38 (unreliable)
        [c00800000deb2660] papr_scm_pmu_check_events.constprop.0+0x118/0x220 [papr_scm]
        [c00800000deb2cb0] papr_scm_probe+0x288/0x62c [papr_scm]
        [c0000000009b46a8] platform_probe+0x98/0x150
      
      Fix this issue by using kmemdup_nul() to copy the content of
      stat->stat_id directly to the nvdimm_events_map array.
      
      mpe: stat->stat_id comes from the hypervisor, not userspace, so there is
      no security exposure.
      
      Fixes: 4c08d4bb ("powerpc/papr_scm: Add perf interface support")
      Signed-off-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220505153451.35503-1-kjain@linux.ibm.com
      348c7134
    • Haowen Bai's avatar
      s390/dasd: Use kzalloc instead of kmalloc/memset · f1c8781a
      Haowen Bai authored
      Use kzalloc rather than duplicating its implementation, which
      makes code simple and easy to understand.
      Signed-off-by: default avatarHaowen Bai <baihaowen@meizu.com>
      Reviewed-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220505141733.1989450-6-sth@linux.ibm.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f1c8781a
    • Jan Höppner's avatar
      s390/dasd: Fix read inconsistency for ESE DASD devices · b9c10f68
      Jan Höppner authored
      Read requests that return with NRF error are partially completed in
      dasd_eckd_ese_read(). The function keeps track of the amount of
      processed bytes and the driver will eventually return this information
      back to the block layer for further processing via __dasd_cleanup_cqr()
      when the request is in the final stage of processing (from the driver's
      perspective).
      
      For this, blk_update_request() is used which requires the number of
      bytes to complete the request. As per documentation the nr_bytes
      parameter is described as follows:
         "number of bytes to complete for @req".
      
      This was mistakenly interpreted as "number of bytes _left_ for @req"
      leading to new requests with incorrect data length. The consequence are
      inconsistent and completely wrong read requests as data from random
      memory areas are read back.
      
      Fix this by correctly specifying the amount of bytes that should be used
      to complete the request.
      
      Fixes: 5e6bdd37 ("s390/dasd: fix data corruption for thin provisioned devices")
      Cc: stable@vger.kernel.org # 5.3+
      Signed-off-by: default avatarJan Höppner <hoeppner@linux.ibm.com>
      Reviewed-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220505141733.1989450-5-sth@linux.ibm.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b9c10f68
    • Jan Höppner's avatar
      s390/dasd: Fix read for ESE with blksize < 4k · cd68c48e
      Jan Höppner authored
      When reading unformatted tracks on ESE devices, the corresponding memory
      areas are simply set to zero for each segment. This is done incorrectly
      for blocksizes < 4096.
      
      There are two problems. First, the increment of dst is done using the
      counter of the loop (off), which is increased by blksize every
      iteration. This leads to a much bigger increment for dst as actually
      intended. Second, the increment of dst is done before the memory area
      is set to 0, skipping a significant amount of bytes of memory.
      
      This leads to illegal overwriting of memory and ultimately to a kernel
      panic.
      
      This is not a problem with 4k blocksize because
      blk_queue_max_segment_size is set to PAGE_SIZE, always resulting in a
      single iteration for the inner segment loop (bv.bv_len == blksize). The
      incorrectly used 'off' value to increment dst is 0 and the correct
      memory area is used.
      
      In order to fix this for blksize < 4k, increment dst correctly using the
      blksize and only do it at the end of the loop.
      
      Fixes: 5e2b17e7 ("s390/dasd: Add dynamic formatting support for ESE volumes")
      Cc: stable@vger.kernel.org # v5.3+
      Signed-off-by: default avatarJan Höppner <hoeppner@linux.ibm.com>
      Reviewed-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220505141733.1989450-4-sth@linux.ibm.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cd68c48e
    • Stefan Haberland's avatar
      s390/dasd: prevent double format of tracks for ESE devices · 71f38716
      Stefan Haberland authored
      For ESE devices we get an error for write operations on an unformatted
      track. Afterwards the track will be formatted and the IO operation
      restarted.
      When using alias devices a track might be accessed by multiple requests
      simultaneously and there is a race window that a track gets formatted
      twice resulting in data loss.
      
      Prevent this by remembering the amount of formatted tracks when starting
      a request and comparing this number before actually formatting a track
      on the fly. If the number has changed there is a chance that the current
      track was finally formatted in between. As a result do not format the
      track and restart the current IO to check.
      
      The number of formatted tracks does not match the overall number of
      formatted tracks on the device and it might wrap around but this is no
      problem. It is only needed to recognize that a track has been formatted at
      all in between.
      
      Fixes: 5e2b17e7 ("s390/dasd: Add dynamic formatting support for ESE volumes")
      Cc: stable@vger.kernel.org # 5.3+
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Reviewed-by: default avatarJan Hoeppner <hoeppner@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220505141733.1989450-3-sth@linux.ibm.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      71f38716
    • Stefan Haberland's avatar
      s390/dasd: fix data corruption for ESE devices · 5b53a405
      Stefan Haberland authored
      For ESE devices we get an error when accessing an unformatted track.
      The handling of this error will return zero data for read requests and
      format the track on demand before writing to it. To do this the code needs
      to distinguish between read and write requests. This is done with data from
      the blocklayer request. A pointer to the blocklayer request is stored in
      the CQR.
      
      If there is an error on the device an ERP request is built to do error
      recovery. While the ERP request is mostly a copy of the original CQR the
      pointer to the blocklayer request is not copied to not accidentally pass
      it back to the blocklayer without cleanup.
      
      This leads to the error that during ESE handling after an ERP request was
      built it is not possible to determine the IO direction. This leads to the
      formatting of a track for read requests which might in turn lead to data
      corruption.
      
      Fixes: 5e2b17e7 ("s390/dasd: Add dynamic formatting support for ESE volumes")
      Cc: stable@vger.kernel.org # 5.3+
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Reviewed-by: default avatarJan Hoeppner <hoeppner@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220505141733.1989450-2-sth@linux.ibm.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5b53a405
    • Dave Airlie's avatar
    • Dave Airlie's avatar
      Merge tag 'drm-misc-fixes-2022-05-05' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes · ca5e2f4d
      Dave Airlie authored
      drm-misc-fixes for v5.18-rc6:
      - Small fix for hot-unplugging fb devices.
      - Kconfig fix for it6505.
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      
      From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/69e51773-8c6f-4ff7-9a06-5c2922a43999@linux.intel.com
      ca5e2f4d
  4. 05 May, 2022 8 commits
    • Dave Airlie's avatar
      Merge tag 'amd-drm-fixes-5.18-2022-05-04' of... · ebbc04bd
      Dave Airlie authored
      Merge tag 'amd-drm-fixes-5.18-2022-05-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
      
      amd-drm-fixes-5.18-2022-05-04:
      
      amdgpu:
      - Fix a xen dom0 regression on APUs
      - Fix a potential array overflow if a receiver were to
        send an erroneous audio channel count
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Alex Deucher <alexander.deucher@amd.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20220504190439.5723-1-alexander.deucher@amd.com
      ebbc04bd
    • Linus Torvalds's avatar
      Merge tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache · fe27d189
      Linus Torvalds authored
      Pull folio fixes from Matthew Wilcox:
       "Two folio fixes for 5.18.
      
        Darrick and Brian have done amazing work debugging the race I created
        in the folio BIO iterator. The readahead problem was deterministic, so
        easy to fix.
      
         - Fix a race when we were calling folio_next() in the BIO folio iter
           without holding a reference, meaning the folio could be split or
           freed, and we'd jump to the next page instead of the intended next
           folio.
      
         - Fix readahead creating single-page folios instead of the intended
           large folios when doing reads that are not a power of two in size"
      
      * tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache:
        mm/readahead: Fix readahead with large folios
        block: Do not call folio_next() on an unreferenced folio
      fe27d189
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · f47c960e
      Linus Torvalds authored
      Pull devicetree fixes from Rob Herring:
      
       - Drop unused 'max-link-speed' in Apple PCIe
      
       - More redundant 'maxItems/minItems' schema fixes
      
       - Support values for pinctrl 'drive-push-pull' and 'drive-open-drain'
      
       - Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding
      
       - Add missing 'power-domains' property to Cadence UFSHC
      
      * tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        dt-bindings: pci: apple,pcie: Drop max-link-speed from example
        dt-bindings: Drop redundant 'maxItems/minItems' in if/then schemas
        dt-bindings: pinctrl: Allow values for drive-push-pull and drive-open-drain
        dt-bindings: leds-mt6360: Drop redundant 'unevaluatedProperties'
        dt-bindings: ufs: cdns,ufshc: Add power-domains
      f47c960e
    • David Sterba's avatar
      btrfs: sysfs: export the balance paused state of exclusive operation · 3e1ad196
      David Sterba authored
      The new state allowing device addition with paused balance is not
      exported to user space so it can't recognize it and actually start the
      operation.
      
      Fixes: efc0e69c ("btrfs: introduce exclusive operation BALANCE_PAUSED state")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e1ad196
    • Filipe Manana's avatar
      btrfs: fix assertion failure when logging directory key range item · 750ee454
      Filipe Manana authored
      When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
      a directory, we don't expect the insertion to fail with -EEXIST, because
      we are holding the directory's log_mutex and we have dropped all existing
      BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
      the directory. However it's possible that during the logging we attempt
      to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
      happen we need to race with insertions of items from other inodes in the
      subvolume's tree while we are logging a directory. Here's how this can
      happen:
      
      1) We are logging a directory with inode number 1000 that has its items
         spread across 3 leaves in the subvolume's tree:
      
         leaf A - has index keys from the range 2 to 20 for example. The last
         item in the leaf corresponds to a dir item for index number 20. All
         these dir items were created in a past transaction.
      
         leaf B - has index keys from the range 22 to 100 for example. It has
         no keys from other inodes, all its keys are dir index keys for our
         directory inode number 1000. Its first key is for the dir item with
         a sequence number of 22. All these dir items were also created in a
         past transaction.
      
         leaf C - has index keys for our directory for the range 101 to 120 for
         example. This leaf also has items from other inodes, and its first
         item corresponds to the dir item for index number 101 for our directory
         with inode number 1000;
      
      2) When we finish processing the items from leaf A at log_dir_items(),
         we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
         offset of 21, meaning the log is authoritative for the index range
         from 21 to 21 (a single sequence number). At this point leaf B was
         not yet modified in the current transaction;
      
      3) When we return from log_dir_items() we have released our read lock on
         leaf B, and have set *last_offset_ret to 21 (index number of the first
         item on leaf B minus 1);
      
      4) Some other task inserts an item for other inode (inode number 1001 for
         example) into leaf C. That resulted in pushing some items from leaf C
         into leaf B, in order to make room for the new item, so now leaf B
         has dir index keys for the sequence number range from 22 to 102 and
         leaf C has the dir items for the sequence number range 103 to 120;
      
      5) At log_directory_changes() we call log_dir_items() again, passing it
         a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
         plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
         of leaf B, since leaf B was modified in the current transaction.
      
         We have also initialized 'last_old_dentry_offset' to 20 after calling
         btrfs_previous_item() at log_dir_items(), as it left us at the last
         item of leaf A, which refers to the dir item with sequence number 20;
      
      6) We then call process_dir_items_leaf() to process the dir items of
         leaf B, and when we process the first item, corresponding to slot 0,
         sequence number 22, we notice the dir item was created in a past
         transaction and its sequence number is greater than the value of
         *last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
         BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
         of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
         error from insert_dir_log_key(), as we have already inserted that
         key at step 2, triggering the assertion at process_dir_items_leaf().
      
      The trace produced in dmesg is like the following:
      
      assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
      [198255.980839][ T7460] ------------[ cut here ]------------
      [198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
      [198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      [198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
      [198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
      [198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
      [198255.989465][ T7460] Code: 8b 4c 89 (...)
      [198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
      [198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
      [198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
      [198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
      [198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
      [198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
      [198256.007082][ T7460] FS:  00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
      [198256.009939][ T7460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
      [198256.015239][ T7460] Call Trace:
      [198256.015674][ T7460]  <TASK>
      [198256.016313][ T7460]  log_dir_items.cold+0x16/0x2c
      [198256.018858][ T7460]  ? replay_one_extent+0xbf0/0xbf0
      [198256.025932][ T7460]  ? release_extent_buffer+0x1d2/0x270
      [198256.029658][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.031114][ T7460]  ? lock_acquired+0xbe/0x660
      [198256.032633][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.034386][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.036152][ T7460]  log_directory_changes+0xf9/0x170
      [198256.036993][ T7460]  ? log_dir_items+0xba0/0xba0
      [198256.037661][ T7460]  ? do_raw_write_unlock+0x7d/0xe0
      [198256.038680][ T7460]  btrfs_log_inode+0x233b/0x26d0
      [198256.041294][ T7460]  ? log_directory_changes+0x170/0x170
      [198256.042864][ T7460]  ? btrfs_attach_transaction_barrier+0x60/0x60
      [198256.045130][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.046568][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.047504][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.048712][ T7460]  ? ilookup5_nowait+0x81/0xa0
      [198256.049747][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.050652][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
      [198256.051618][ T7460]  ? __might_resched+0x128/0x1c0
      [198256.052511][ T7460]  ? __might_sleep+0x66/0xc0
      [198256.053442][ T7460]  ? __kasan_check_read+0x11/0x20
      [198256.054251][ T7460]  ? iget5_locked+0xbd/0x150
      [198256.054986][ T7460]  ? run_delayed_iput_locked+0x110/0x110
      [198256.055929][ T7460]  ? btrfs_iget+0xc7/0x150
      [198256.056630][ T7460]  ? btrfs_orphan_cleanup+0x4a0/0x4a0
      [198256.057502][ T7460]  ? free_extent_buffer+0x13/0x20
      [198256.058322][ T7460]  btrfs_log_inode+0x2654/0x26d0
      [198256.059137][ T7460]  ? log_directory_changes+0x170/0x170
      [198256.060020][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.060930][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.061905][ T7460]  ? lock_contended+0x770/0x770
      [198256.062682][ T7460]  ? btrfs_log_inode_parent+0xd04/0x1750
      [198256.063582][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.064432][ T7460]  ? preempt_count_sub+0x18/0xc0
      [198256.065550][ T7460]  ? __mutex_lock+0x580/0xdc0
      [198256.066654][ T7460]  ? stack_trace_save+0x94/0xc0
      [198256.068008][ T7460]  ? __kasan_check_write+0x14/0x20
      [198256.072149][ T7460]  ? __mutex_unlock_slowpath+0x12a/0x430
      [198256.073145][ T7460]  ? mutex_lock_io_nested+0xcd0/0xcd0
      [198256.074341][ T7460]  ? wait_for_completion_io_timeout+0x20/0x20
      [198256.075345][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.076142][ T7460]  ? lock_contended+0x770/0x770
      [198256.076939][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
      [198256.078401][ T7460]  ? btrfs_sync_file+0x5e6/0xa40
      [198256.080598][ T7460]  btrfs_log_inode_parent+0x523/0x1750
      [198256.081991][ T7460]  ? wait_current_trans+0xc8/0x240
      [198256.083320][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.085450][ T7460]  ? btrfs_end_log_trans+0x70/0x70
      [198256.086362][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.087544][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.088305][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.090375][ T7460]  ? dget_parent+0x8e/0x300
      [198256.093538][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
      [198256.094918][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.097815][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
      [198256.101822][ T7460]  ? dget_parent+0xb7/0x300
      [198256.103345][ T7460]  btrfs_log_dentry_safe+0x48/0x60
      [198256.105052][ T7460]  btrfs_sync_file+0x629/0xa40
      [198256.106829][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
      [198256.109655][ T7460]  ? __fget_files+0x161/0x230
      [198256.110760][ T7460]  vfs_fsync_range+0x6d/0x110
      [198256.111923][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
      [198256.113556][ T7460]  __x64_sys_fsync+0x45/0x70
      [198256.114323][ T7460]  do_syscall_64+0x5c/0xc0
      [198256.115084][ T7460]  ? syscall_exit_to_user_mode+0x3b/0x50
      [198256.116030][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.116768][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.117555][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.118324][ T7460]  ? sysvec_call_function_single+0x57/0xc0
      [198256.119308][ T7460]  ? asm_sysvec_call_function_single+0xa/0x20
      [198256.120363][ T7460]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
      [198256.122067][ T7460] Code: 0f 05 48 (...)
      [198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
      [198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
      [198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
      [198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
      [198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
      [198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
      [198256.133403][ T7460]  </TASK>
      
      Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
      it update the item with an end offset corresponding to the maximum between
      the previously logged end offset and the new requested end offset. The end
      offsets may be different due to dir index key deletions that happened as
      part of unlink operations while we are logging a directory (triggered when
      fsyncing some other inode parented by the directory) or during renames
      which always attempt to log a single dir index deletion.
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
      Fixes: 732d591a ("btrfs: stop copying old dir items when logging a directory")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      750ee454
    • Naohiro Aota's avatar
      btrfs: zoned: activate block group properly on unlimited active zone device · ceb4f608
      Naohiro Aota authored
      btrfs_zone_activate() checks if it activated all the underlying zones in
      the loop. However, that check never hit on an unlimited activate zone
      device (max_active_zones == 0).
      
      Fortunately, it still works without ENOSPC because btrfs_zone_activate()
      returns true in the end, even if block_group->zone_is_active == 0. But, it
      is confusing to have non zone_is_active block group still usable for
      allocation. Also, we are wasting CPU time to iterate the loop every time
      btrfs_zone_activate() is called for the blog groups.
      
      Since error case in the loop is handled by out_unlock, we can just set
      zone_is_active and do the list stuff after the loop.
      
      Fixes: f9a912a3 ("btrfs: zoned: make zone activation multi stripe capable")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ceb4f608
    • Naohiro Aota's avatar
      btrfs: zoned: move non-changing condition check out of the loop · 54957712
      Naohiro Aota authored
      btrfs_zone_activate() checks if block_group->alloc_offset ==
      block_group->zone_capacity every time it iterates the loop. But, it is
      not depending on the index. Move out the check and do it only once.
      
      Fixes: f9a912a3 ("btrfs: zoned: make zone activation multi stripe capable")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54957712
    • Qu Wenruo's avatar
      btrfs: force v2 space cache usage for subpage mount · 9f73f1ae
      Qu Wenruo authored
      [BUG]
      For a 4K sector sized btrfs with v1 cache enabled and only mounted on
      systems with 4K page size, if it's mounted on subpage (64K page size)
      systems, it can cause the following warning on v1 space cache:
      
       BTRFS error (device dm-1): csum mismatch on free space cache
       BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now
      
      Although not a big deal, as kernel can rebuild it without problem, such
      warning will bother end users, especially if they want to switch the
      same btrfs seamlessly between different page sized systems.
      
      [CAUSE]
      V1 free space cache is still using fixed PAGE_SIZE for various bitmap,
      like BITS_PER_BITMAP.
      
      Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1
      cache size to checksum.
      
      Thus kernel will always reject v1 cache with a different PAGE_SIZE with
      csum mismatch.
      
      [FIX]
      Although we should fix v1 cache, it's already going to be marked
      deprecated soon.
      
      And we have v2 cache based on metadata (which is already fully subpage
      compatible), and it has almost everything superior than v1 cache.
      
      So just force subpage mount to use v2 cache on mount.
      Reported-by: default avatarMatt Corallo <blnxfsl@bluematt.me>
      CC: stable@vger.kernel.org # 5.15+
      Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f73f1ae