1. 13 Mar, 2024 5 commits
    • Nicolin Chen's avatar
      iommu/dma: Force swiotlb_max_mapping_size on an untrusted device · afc5aa46
      Nicolin Chen authored
      The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
      On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
      an NVME device can map a size between 300KB~512KB, which certainly failed
      the swiotlb mappings, though the default pool of swiotlb has many slots:
          systemd[1]: Started Journal Service.
       => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
          note: journal-offline[392] exited with irqs disabled
          note: journal-offline[392] exited with preempt_count 1
      
      Call trace:
      [    3.099918]  swiotlb_tbl_map_single+0x214/0x240
      [    3.099921]  iommu_dma_map_page+0x218/0x328
      [    3.099928]  dma_map_page_attrs+0x2e8/0x3a0
      [    3.101985]  nvme_prep_rq.part.0+0x408/0x878 [nvme]
      [    3.102308]  nvme_queue_rqs+0xc0/0x300 [nvme]
      [    3.102313]  blk_mq_flush_plug_list.part.0+0x57c/0x600
      [    3.102321]  blk_add_rq_to_plug+0x180/0x2a0
      [    3.102323]  blk_mq_submit_bio+0x4c8/0x6b8
      [    3.103463]  __submit_bio+0x44/0x220
      [    3.103468]  submit_bio_noacct_nocheck+0x2b8/0x360
      [    3.103470]  submit_bio_noacct+0x180/0x6c8
      [    3.103471]  submit_bio+0x34/0x130
      [    3.103473]  ext4_bio_write_folio+0x5a4/0x8c8
      [    3.104766]  mpage_submit_folio+0xa0/0x100
      [    3.104769]  mpage_map_and_submit_buffers+0x1a4/0x400
      [    3.104771]  ext4_do_writepages+0x6a0/0xd78
      [    3.105615]  ext4_writepages+0x80/0x118
      [    3.105616]  do_writepages+0x90/0x1e8
      [    3.105619]  filemap_fdatawrite_wbc+0x94/0xe0
      [    3.105622]  __filemap_fdatawrite_range+0x68/0xb8
      [    3.106656]  file_write_and_wait_range+0x84/0x120
      [    3.106658]  ext4_sync_file+0x7c/0x4c0
      [    3.106660]  vfs_fsync_range+0x3c/0xa8
      [    3.106663]  do_fsync+0x44/0xc0
      
      Since untrusted devices might go down the swiotlb pathway with dma-iommu,
      these devices should not map a size larger than swiotlb_max_mapping_size.
      
      To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
      take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
      the iommu_dma_opt_mapping_size().
      
      Fixes: 82612d66 ("iommu: Allow the dma-iommu api to use bounce buffers")
      Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.comSigned-off-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      [will: Drop redundant is_swiotlb_active(dev) check]
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Acked-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      afc5aa46
    • Will Deacon's avatar
      swiotlb: Fix alignment checks when both allocation and DMA masks are present · 51b30ecb
      Will Deacon authored
      Nicolin reports that swiotlb buffer allocations fail for an NVME device
      behind an IOMMU using 64KiB pages. This is because we end up with a
      minimum allocation alignment of 64KiB (for the IOMMU to map the buffer
      safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME
      page (i.e. preserving the 4KiB page offset from the original allocation).
      If the original address is not 4KiB-aligned, the allocation will fail
      because swiotlb_search_pool_area() erroneously compares these unmasked
      bits with the 64KiB-aligned candidate allocation.
      
      Tweak swiotlb_search_pool_area() so that the DMA alignment mask is
      reduced based on the required alignment of the allocation.
      
      Fixes: 82612d66 ("iommu: Allow the dma-iommu api to use bounce buffers")
      Link: https://lore.kernel.org/r/cover.1707851466.git.nicolinc@nvidia.comReported-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      51b30ecb
    • Will Deacon's avatar
      swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() · cbf53074
      Will Deacon authored
      core-api/dma-api-howto.rst states the following properties of
      dma_alloc_coherent():
      
        | The CPU virtual address and the DMA address are both guaranteed to
        | be aligned to the smallest PAGE_SIZE order which is greater than or
        | equal to the requested size.
      
      However, swiotlb_alloc() passes zero for the 'alloc_align_mask'
      parameter of swiotlb_find_slots() and so this property is not upheld.
      Instead, allocations larger than a page are aligned to PAGE_SIZE,
      
      Calculate the mask corresponding to the page order suitable for holding
      the allocation and pass that to swiotlb_find_slots().
      
      Fixes: e81e99ba ("swiotlb: Support aligned swiotlb buffers")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Reviewed-by: default avatarPetr Tesarik <petr.tesarik1@huawei-partners.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      cbf53074
    • Will Deacon's avatar
      swiotlb: Enforce page alignment in swiotlb_alloc() · 823353b7
      Will Deacon authored
      When allocating pages from a restricted DMA pool in swiotlb_alloc(),
      the buffer address is blindly converted to a 'struct page *' that is
      returned to the caller. In the unlikely event of an allocation bug,
      page-unaligned addresses are not detected and slots can silently be
      double-allocated.
      
      Add a simple check of the buffer alignment in swiotlb_alloc() to make
      debugging a little easier if something has gone wonky.
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Reviewed-by: default avatarPetr Tesarik <petr.tesarik1@huawei-partners.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      823353b7
    • Will Deacon's avatar
      swiotlb: Fix double-allocation of slots due to broken alignment handling · 04867a7a
      Will Deacon authored
      Commit bbb73a10 ("swiotlb: fix a braino in the alignment check fix"),
      which was a fix for commit 0eee5ae1 ("swiotlb: fix slot alignment
      checks"), causes a functional regression with vsock in a virtual machine
      using bouncing via a restricted DMA SWIOTLB pool.
      
      When virtio allocates the virtqueues for the vsock device using
      dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
      allocations if 'area->index' was left unaligned by a previous allocation
      from the buffer:
      
       # Final address in brackets is the SWIOTLB address returned to the caller
       | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
       | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
       | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
      
      This ends badly (typically buffer corruption and/or a hang) because
      swiotlb_alloc() is expecting a page-aligned allocation and so blindly
      returns a pointer to the 'struct page' corresponding to the allocation,
      therefore double-allocating the first half (2KiB slot) of the 4KiB page.
      
      Fix the problem by treating the allocation alignment separately to any
      additional alignment requirements from the device, using the maximum
      of the two as the stride to search the buffer slots and taking care
      to ensure a minimum of page-alignment for buffers larger than a page.
      
      This also resolves swiotlb allocation failures occuring due to the
      inclusion of ~PAGE_MASK in 'iotlb_align_mask' for large allocations and
      resulting in alignment requirements exceeding swiotlb_max_mapping_size().
      
      Fixes: bbb73a10 ("swiotlb: fix a braino in the alignment check fix")
      Fixes: 0eee5ae1 ("swiotlb: fix slot alignment checks")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Reviewed-by: default avatarPetr Tesarik <petr.tesarik1@huawei-partners.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      04867a7a
  2. 28 Feb, 2024 4 commits
  3. 26 Feb, 2024 3 commits
  4. 25 Feb, 2024 28 commits
    • Linus Torvalds's avatar
      Linux 6.8-rc6 · d206a76d
      Linus Torvalds authored
      d206a76d
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs · e231dbd4
      Linus Torvalds authored
      Pull bcachefs fixes from Kent Overstreet:
       "Some more mostly boring fixes, but some not
      
        User reported ones:
      
         - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
           performance bug; user reported an untar initially taking two
           seconds and then ~2 minutes
      
         - kill a __GFP_NOFAIL in the buffered read path; this was a leftover
           from the trickier fix to kill __GFP_NOFAIL in readahead, where we
           can't return errors (and have to silently truncate the read
           ourselves).
      
           bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
           filesystems because our folio state is just barely too big, 2MB
           hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
      
           additionally, the flags argument was just buggy, we weren't
           supplying GFP_KERNEL previously (!)"
      
      * tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
        bcachefs: fix bch2_save_backtrace()
        bcachefs: Fix check_snapshot() memcpy
        bcachefs: Fix bch2_journal_flush_device_pins()
        bcachefs: fix iov_iter count underflow on sub-block dio read
        bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
        bcachefs: Kill __GFP_NOFAIL in buffered read path
        bcachefs: fix backpointer_to_text() when dev does not exist
      e231dbd4
    • Kent Overstreet's avatar
      bcachefs: fix bch2_save_backtrace() · 5197728f
      Kent Overstreet authored
      Missed a call in the previous fix.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      5197728f
    • Linus Torvalds's avatar
      Merge tag 'docs-6.8-fixes3' of git://git.lwn.net/linux · 70ff1fe6
      Linus Torvalds authored
      Pull two documentation build fixes from Jonathan Corbet:
      
       - The XFS online fsck documentation uses incredibly deeply nested
         subsection and list nesting; that broke the PDF docs build. Tweak a
         parameter to tell LaTeX to allow the deeper nesting.
      
       - Fix a 6.8 PDF-build regression
      
      * tag 'docs-6.8-fixes3' of git://git.lwn.net/linux:
        docs: translations: use attribute to store current language
        docs: Instruct LaTeX to cope with deeper nesting
      70ff1fe6
    • Linus Torvalds's avatar
      Merge tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · c46ac50e
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB fixes for 6.8-rc6 to resolve some reported
        problems. These include:
      
         - regression fixes with typec tpcm code as reported by many
      
         - cdnsp and cdns3 driver fixes
      
         - usb role setting code bugfixes
      
         - build fix for uhci driver
      
         - ncm gadget driver bugfix
      
         - MAINTAINERS entry update
      
        All of these have been in linux-next all week with no reported issues
        and there is at least one fix in here that is in Thorsten's regression
        list that is being tracked"
      
      * tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: tpcm: Fix issues with power being removed during reset
        MAINTAINERS: Drop myself as maintainer of TYPEC port controller drivers
        usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs
        Revert "usb: typec: tcpm: reset counter when enter into unattached state after try role"
        usb: gadget: omap_udc: fix USB gadget regression on Palm TE
        usb: dwc3: gadget: Don't disconnect if not started
        usb: cdns3: fix memory double free when handle zero packet
        usb: cdns3: fixed memory use after free at cdns3_gadget_ep_disable()
        usb: roles: don't get/set_role() when usb_role_switch is unregistered
        usb: roles: fix NULL pointer issue when put module's reference
        usb: cdnsp: fixed issue with incorrect detecting CDNSP family controllers
        usb: cdnsp: blocked some cdns3 specific code
        usb: uhci-grlib: Explicitly include linux/platform_device.h
      c46ac50e
    • Linus Torvalds's avatar
      Merge tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 1e592e95
      Linus Torvalds authored
      Pull tty/serial driver fixes from Greg KH:
       "Here are three small serial/tty driver fixes for 6.8-rc6 that resolve
        the following reported errors:
      
         - riscv hvc console driver fix that was reported by many
      
         - amba-pl011 serial driver fix for RS485 mode
      
         - stm32 serial driver fix for RS485 mode
      
        All of these have been in linux-next all week with no reported
        problems"
      
      * tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        serial: amba-pl011: Fix DMA transmission in RS485 mode
        serial: stm32: do not always set SER_RS485_RX_DURING_TX if RS485 is enabled
        tty: hvc: Don't enable the RISC-V SBI console by default
      1e592e95
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1eee4ef3
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
      
       - Make sure clearing CPU buffers using VERW happens at the latest
         possible point in the return-to-userspace path, otherwise memory
         accesses after the VERW execution could cause data to land in CPU
         buffers again
      
      * tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        KVM/VMX: Move VERW closer to VMentry for MDS mitigation
        KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
        x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
        x86/entry_32: Add VERW just before userspace transition
        x86/entry_64: Add VERW just before userspace transition
        x86/bugs: Add asm helpers for executing VERW
      1eee4ef3
    • Linus Torvalds's avatar
      Merge tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8c46ed37
      Linus Torvalds authored
      Pull irq fixes from Borislav Petkov:
      
       - Make sure GICv4 always gets initialized to prevent a kexec-ed kernel
         from silently failing to set it up
      
       - Do not call bus_get_dev_root() for the mbigen irqchip as it always
         returns NULL - use NULL directly
      
       - Fix hardware interrupt number truncation when assigning MSI
         interrupts
      
       - Correct sending end-of-interrupt messages to disabled interrupts
         lines on RISC-V PLIC
      
      * tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/gic-v3-its: Do not assume vPE tables are preallocated
        irqchip/mbigen: Don't use bus_get_dev_root() to find the parent
        PCI/MSI: Prevent MSI hardware interrupt number truncation
        irqchip/sifive-plic: Enable interrupt if needed before EOI
      8c46ed37
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 4ca0d989
      Linus Torvalds authored
      Pull erofs fix from Gao Xiang:
      
       - Fix page refcount leak when looking up specific inodes
         introduced by metabuf reworking
      
      * tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: fix refcount on the metabuf used for inode lookup
      4ca0d989
    • Linus Torvalds's avatar
      Merge tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 66a97c2e
      Linus Torvalds authored
      Pull RCU pathwalk fixes from Al Viro:
       "We still have some races in filesystem methods when exposed to RCU
        pathwalk. This series is a result of code audit (the second round of
        it) and it should deal with most of that stuff.
      
        Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
        Up to maintainers (a note for NTFS folks - when documentation says
        that a method may not block, it *does* imply that blocking allocations
        are to be avoided. Really)"
      
      [ More explanations for people who aren't familiar with the vagaries of
        RCU path walking: most of it is hidden from filesystems, but if a
        filesystem actively participates in the low-level path walking it
        needs to make sure the fields involved in that walk are RCU-safe.
      
        That "actively participate in low-level path walking" includes things
        like having its own ->d_hash()/->d_compare() routines, or by having
        its own directory permission function that doesn't just use the common
        helpers.  Having a ->d_revalidate() function will also have this issue.
      
        Note that instead of making everything RCU safe you can also choose to
        abort the RCU pathwalk if your operation cannot be done safely under
        RCU, but that obviously comes with a performance penalty. One common
        pattern is to allow the simple cases under RCU, and abort only if you
        need to do something more complicated.
      
        So not everything needs to be RCU-safe, and things like the inode etc
        that the VFS itself maintains obviously already are. But these fixes
        tend to be about properly RCU-delaying things like ->s_fs_info that
        are maintained by the filesystem and that got potentially released too
        early.   - Linus ]
      
      * tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        ext4_get_link(): fix breakage in RCU mode
        cifs_get_link(): bail out in unsafe case
        fuse: fix UAF in rcu pathwalks
        procfs: make freeing proc_fs_info rcu-delayed
        procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
        nfs: fix UAF on pathwalk running into umount
        nfs: make nfs_set_verifier() safe for use in RCU pathwalk
        afs: fix __afs_break_callback() / afs_drop_open_mmap() race
        hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
        exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
        affs: free affs_sb_info with kfree_rcu()
        rcu pathwalk: prevent bogus hard errors from may_lookup()
        fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
      66a97c2e
    • Linus Torvalds's avatar
      Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 9b243492
      Linus Torvalds authored
      Pull vfs fixes from Al Viro:
       "A couple of fixes - revert of regression from this cycle and a fix for
        erofs failure exit breakage (had been there since way back)"
      
      * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        erofs: fix handling kern_mount() failure
        Revert "get rid of DCACHE_GENOCIDE"
      9b243492
    • Al Viro's avatar
      ext4_get_link(): fix breakage in RCU mode · 9fa8e282
      Al Viro authored
      1) errors from ext4_getblk() should not be propagated to caller
      unless we are really sure that we would've gotten the same error
      in non-RCU pathwalk.
      2) we leak buffer_heads if ext4_getblk() is successful, but bh is
      not uptodate.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9fa8e282
    • Al Viro's avatar
      cifs_get_link(): bail out in unsafe case · 0511fdb4
      Al Viro authored
      ->d_revalidate() bails out there, anyway.  It's not enough
      to prevent getting into ->get_link() in RCU mode, but that
      could happen only in a very contrieved setup.  Not worth
      trying to do anything fancy here unless ->d_revalidate()
      stops kicking out of RCU mode at least in some cases.
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Acked-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0511fdb4
    • Al Viro's avatar
      fuse: fix UAF in rcu pathwalks · 053fc4f7
      Al Viro authored
      ->permission(), ->get_link() and ->inode_get_acl() might dereference
      ->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
      as well) when called from rcu pathwalk.
      
      Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
      and dropping ->user_ns rcu-delayed too.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      053fc4f7
    • Al Viro's avatar
      procfs: make freeing proc_fs_info rcu-delayed · e31f0a57
      Al Viro authored
      makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns()
      is still synchronous, but that's not a problem - it does
      rcu-delay everything that needs to be)
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e31f0a57
    • Al Viro's avatar
      procfs: move dropping pde and pid from ->evict_inode() to ->free_inode() · 47458802
      Al Viro authored
      that keeps both around until struct inode is freed, making access
      to them safe from rcu-pathwalk
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      47458802
    • Al Viro's avatar
      nfs: fix UAF on pathwalk running into umount · c1b967d0
      Al Viro authored
      NFS ->d_revalidate(), ->permission() and ->get_link() need to access
      some parts of nfs_server when called in RCU mode:
      	server->flags
      	server->caps
      	*(server->io_stats)
      and, worst of all, call
      	server->nfs_client->rpc_ops->have_delegation
      (the last one - as NFS_PROTO(inode)->have_delegation()).  We really
      don't want to RCU-delay the entire nfs_free_server() (it would have
      to be done with schedule_work() from RCU callback, since it can't
      be made to run from interrupt context), but actual freeing of
      nfs_server and ->io_stats can be done via call_rcu() just fine.
      nfs_client part is handled simply by making nfs_free_client() use
      kfree_rcu().
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c1b967d0
    • Al Viro's avatar
      nfs: make nfs_set_verifier() safe for use in RCU pathwalk · 10a973fc
      Al Viro authored
      nfs_set_verifier() relies upon dentry being pinned; if that's
      the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
      that ->d_parent points to a positive dentry.  For something
      we'd run into in RCU mode that is *not* true - dentry might've
      been through dentry_kill() just as we grabbed ->d_lock, with
      its parent going through the same just as we get to into
      nfs_set_verifier_locked().  It might get to detaching inode
      (and zeroing ->d_inode) before nfs_set_verifier_locked() gets
      to fetching that; we get an oops as the result.
      
      That can happen in nfs{,4} ->d_revalidate(); the call chain in
      question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
      nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
      We have checked that the parent had been positive, but that's
      done before we get to nfs_set_verifier() and it's possible for
      memory pressure to pick our dentry as eviction candidate by that
      time.  If that happens, back-to-back attempts to kill dentry and
      its parent are quite normal.  Sure, in case of eviction we'll
      fail the ->d_seq check in the caller, but we need to survive
      until we return there...
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      10a973fc
    • Al Viro's avatar
      afs: fix __afs_break_callback() / afs_drop_open_mmap() race · 275655d3
      Al Viro authored
      In __afs_break_callback() we might check ->cb_nr_mmap and if it's non-zero
      do queue_work(&vnode->cb_work).  In afs_drop_open_mmap() we decrement
      ->cb_nr_mmap and do flush_work(&vnode->cb_work) if it reaches zero.
      
      The trouble is, there's nothing to prevent __afs_break_callback() from
      seeing ->cb_nr_mmap before the decrement and do queue_work() after both
      the decrement and flush_work().  If that happens, we might be in trouble -
      vnode might get freed before the queued work runs.
      
      __afs_break_callback() is always done under ->cb_lock, so let's make
      sure that ->cb_nr_mmap can change from non-zero to zero while holding
      ->cb_lock (the spinlock component of it - it's a seqlock and we don't
      need to mess with the counter).
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      275655d3
    • Al Viro's avatar
      hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info · af072cf6
      Al Viro authored
      ->d_hash() and ->d_compare() use those, so we need to delay freeing
      them.
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      af072cf6
    • Al Viro's avatar
      exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper · a13d1a4d
      Al Viro authored
      That stuff can be accessed by ->d_hash()/->d_compare(); as it is, we have
      a hard-to-hit UAF if rcu pathwalk manages to get into ->d_hash() on a filesystem
      that is in process of getting shut down.
      
      Besides, having nls and upcase table cleanup moved from ->put_super() towards
      the place where sbi is freed makes for simpler failure exits.
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a13d1a4d
    • Al Viro's avatar
      affs: free affs_sb_info with kfree_rcu() · 529f89a9
      Al Viro authored
      one of the flags in it is used by ->d_hash()/->d_compare()
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      529f89a9
    • Al Viro's avatar
      rcu pathwalk: prevent bogus hard errors from may_lookup() · cdb67fde
      Al Viro authored
      If lazy call of ->permission() returns a hard error, check that
      try_to_unlazy() succeeds before returning it.  That both makes
      life easier for ->permission() instances and closes the race
      in ENOTDIR handling - it is possible that positive d_can_lookup()
      seen in link_path_walk() applies to the state *after* unlink() +
      mkdir(), while nd->inode matches the state prior to that.
      
      Normally seeing e.g. EACCES from permission check in rcu pathwalk
      means that with some timings non-rcu pathwalk would've run into
      the same; however, running into a non-executable regular file
      in the middle of a pathname would not get to permission check -
      it would fail with ENOTDIR instead.
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cdb67fde
    • Al Viro's avatar
      fs/super.c: don't drop ->s_user_ns until we free struct super_block itself · 583340de
      Al Viro authored
      Avoids fun races in RCU pathwalk...  Same goes for freeing LSM shite
      hanging off super_block's arse.
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      583340de
    • Kent Overstreet's avatar
      bcachefs: Fix check_snapshot() memcpy · c4333eb5
      Kent Overstreet authored
      check_snapshot() copies the bch_snapshot to a temporary to easily handle
      older versions that don't have all the fields of the current version,
      but it lacked a min() to correctly handle keys newer and larger than the
      current version.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      c4333eb5
    • Kent Overstreet's avatar
      bcachefs: Fix bch2_journal_flush_device_pins() · 097471f9
      Kent Overstreet authored
      If a journal write errored, the list of devices it was written to could
      be empty - we're not supposed to mark an empty replicas list.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      097471f9
    • Brian Foster's avatar
      bcachefs: fix iov_iter count underflow on sub-block dio read · b58b1b88
      Brian Foster authored
      bch2_direct_IO_read() checks the request offset and size for sector
      alignment and then falls through to a couple calculations to shrink
      the size of the request based on the inode size. The problem is that
      these checks round up to the fs block size, which runs the risk of
      underflowing iter->count if the block size happens to be large
      enough. This is triggered by fstest generic/361 with a 4k block
      size, which subsequently leads to a crash. To avoid this crash,
      check that the shorten length doesn't exceed the overall length of
      the iter.
      
      Fixes:
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarSu Yue <glass.su@suse.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b58b1b88
    • Kent Overstreet's avatar
      bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree · 204f4514
      Kent Overstreet authored
      If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the
      keyspace where no keys are visible in the current snapshot, we have a
      problem - we'll scan for a very long time before scanning terminates.
      
      Awhile back, this was fixed for most cases with peek_upto() (and
      assertions that enforce that it's being used).
      
      But the fix missed the fact that the inodes btree is different - every
      key offset is in a different snapshot tree, not just the inode field.
      
      Fixes:
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      204f4514