1. 18 Nov, 2020 2 commits
  2. 11 Nov, 2020 5 commits
  3. 05 Nov, 2020 1 commit
  4. 04 Nov, 2020 5 commits
    • Darrick J. Wong's avatar
      xfs: fix scrub flagging rtinherit even if there is no rt device · c1f6b1ac
      Darrick J. Wong authored
      The kernel has always allowed directories to have the rtinherit flag
      set, even if there is no rt device, so this check is wrong.
      
      Fixes: 80e4e126 ("xfs: scrub inodes")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c1f6b1ac
    • Darrick J. Wong's avatar
      xfs: fix missing CoW blocks writeback conversion retry · c2f09217
      Darrick J. Wong authored
      In commit 7588cbee, we tried to fix a race stemming from the lack of
      coordination between higher level code that wants to allocate and remap
      CoW fork extents into the data fork.  Christoph cites as examples the
      always_cow mode, and a directio write completion racing with writeback.
      
      According to the comments before the goto retry, we want to restart the
      lookup to catch the extent in the data fork, but we don't actually reset
      whichfork or cow_fsb, which means the second try executes using stale
      information.  Up until now I think we've gotten lucky that either
      there's something left in the CoW fork to cause cow_fsb to be reset, or
      either data/cow fork sequence numbers have advanced enough to force a
      fresh lookup from the data fork.  However, if we reach the retry with an
      empty stable CoW fork and a stable data fork, neither of those things
      happens.  The retry foolishly re-calls xfs_convert_blocks on the CoW
      fork which fails again.  This time, we toss the write.
      
      I've recently been working on extending reflink to the realtime device.
      When the realtime extent size is larger than a single block, we have to
      force the page cache to CoW the entire rt extent if a write (or
      fallocate) are not aligned with the rt extent size.  The strategy I've
      chosen to deal with this is derived from Dave's blocksize > pagesize
      series: dirtying around the write range, and ensuring that writeback
      always starts mapping on an rt extent boundary.  This has brought this
      race front and center, since generic/522 blows up immediately.
      
      However, I'm pretty sure this is a bug outright, independent of that.
      
      Fixes: 7588cbee ("xfs: retry COW fork delalloc conversion when no extent was found")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c2f09217
    • Brian Foster's avatar
      iomap: clean up writeback state logic on writepage error · 50e7d6c7
      Brian Foster authored
      The iomap writepage error handling logic is a mash of old and
      slightly broken XFS writepage logic. When keepwrite writeback state
      tracking was introduced in XFS in commit 0d085a52 ("xfs: ensure
      WB_SYNC_ALL writeback handles partial pages correctly"), XFS had an
      additional cluster writeback context that scanned ahead of
      ->writepage() to process dirty pages over the current ->writepage()
      extent mapping. This context expected a dirty page and required
      retention of the TOWRITE tag on partial page processing so the
      higher level writeback context would revisit the page (in contrast
      to ->writepage(), which passes a page with the dirty bit already
      cleared).
      
      The cluster writeback mechanism was eventually removed and some of
      the error handling logic folded into the primary writeback path in
      commit 150d5be0 ("xfs: remove xfs_cancel_ioend"). This patch
      accidentally conflated the two contexts by using the keepwrite logic
      in ->writepage() without accounting for the fact that the page is
      not dirty. Further, the keepwrite logic has no practical effect on
      the core ->writepage() caller (write_cache_pages()) because it never
      revisits a page in the current function invocation.
      
      Technically, the page should be redirtied for the keepwrite logic to
      have any effect. Otherwise, write_cache_pages() may find the tagged
      page but will skip it since it is clean. Even if the page was
      redirtied, however, there is still no practical effect to keepwrite
      since write_cache_pages() does not wrap around within a single
      invocation of the function. Therefore, the dirty page would simply
      end up retagged on the next writeback sequence over the associated
      range.
      
      All that being said, none of this really matters because redirtying
      a partially processed page introduces a potential infinite redirty
      -> writeback failure loop that deviates from the current design
      principle of clearing the dirty state on writepage failure to avoid
      building up too much dirty, unreclaimable memory on the system.
      Therefore, drop the spurious keepwrite usage and dirty state
      clearing logic from iomap_writepage_map(), treat the partially
      processed page the same as a fully processed page, and let the
      imminent ioend failure clean up the writeback state.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      50e7d6c7
    • Brian Foster's avatar
      iomap: support partial page discard on writeback block mapping failure · 763e4cdc
      Brian Foster authored
      iomap writeback mapping failure only calls into ->discard_page() if
      the current page has not been added to the ioend. Accordingly, the
      XFS callback assumes a full page discard and invalidation. This is
      problematic for sub-page block size filesystems where some portion
      of a page might have been mapped successfully before a failure to
      map a delalloc block occurs. ->discard_page() is not called in that
      error scenario and the bio is explicitly failed by iomap via the
      error return from ->prepare_ioend(). As a result, the filesystem
      leaks delalloc blocks and corrupts the filesystem block counters.
      
      Since XFS is the only user of ->discard_page(), tweak the semantics
      to invoke the callback unconditionally on mapping errors and provide
      the file offset that failed to map. Update xfs_discard_page() to
      discard the corresponding portion of the file and pass the range
      along to iomap_invalidatepage(). The latter already properly handles
      both full and sub-page scenarios by not changing any iomap or page
      state on sub-page invalidations.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      763e4cdc
    • Brian Foster's avatar
      xfs: flush new eof page on truncate to avoid post-eof corruption · 869ae85d
      Brian Foster authored
      It is possible to expose non-zeroed post-EOF data in XFS if the new
      EOF page is dirty, backed by an unwritten block and the truncate
      happens to race with writeback. iomap_truncate_page() will not zero
      the post-EOF portion of the page if the underlying block is
      unwritten. The subsequent call to truncate_setsize() will, but
      doesn't dirty the page. Therefore, if writeback happens to complete
      after iomap_truncate_page() (so it still sees the unwritten block)
      but before truncate_setsize(), the cached page becomes inconsistent
      with the on-disk block. A mapped read after the associated page is
      reclaimed or invalidated exposes non-zero post-EOF data.
      
      For example, consider the following sequence when run on a kernel
      modified to explicitly flush the new EOF page within the race
      window:
      
      $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
      $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
        ...
      $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
      00000400:  00 00 00 00 00 00 00 00  ........
      $ umount /mnt/; mount <dev> /mnt/
      $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
      00000400:  cd cd cd cd cd cd cd cd  ........
      
      Update xfs_setattr_size() to explicitly flush the new EOF page prior
      to the page truncate to ensure iomap has the latest state of the
      underlying block.
      
      Fixes: 68a9f5e7 ("xfs: implement iomap based buffered write path")
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      869ae85d
  5. 29 Oct, 2020 1 commit
  6. 25 Oct, 2020 17 commits
  7. 24 Oct, 2020 9 commits
    • Linus Torvalds's avatar
      Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block · d7691390
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - NVMe pull request from Christoph
           - rdma error handling fixes (Chao Leng)
           - fc error handling and reconnect fixes (James Smart)
           - fix the qid displace when tracing ioctl command (Keith Busch)
           - don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
           - fix MTDT for passthru (Logan Gunthorpe)
           - blacklist Write Same on more devices (Kai-Heng Feng)
           - fix an uninitialized work struct (zhenwei pi)"
      
       - lightnvm out-of-bounds fix (Colin)
      
       - SG allocation leak fix (Doug)
      
       - rnbd fixes (Gioh, Guoqing, Jack)
      
       - zone error translation fixes (Keith)
      
       - kerneldoc markup fix (Mauro)
      
       - zram lockdep fix (Peter)
      
       - Kill unused io_context members (Yufen)
      
       - NUMA memory allocation cleanup (Xianting)
      
       - NBD config wakeup fix (Xiubo)
      
      * tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
        block: blk-mq: fix a kernel-doc markup
        nvme-fc: shorten reconnect delay if possible for FC
        nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
        nvme-fc: fix error loop in create_hw_io_queues
        nvme-fc: fix io timeout to abort I/O
        null_blk: use zone status for max active/open
        nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
        nvmet: cleanup nvmet_passthru_map_sg()
        nvmet: limit passthru MTDS by BIO_MAX_PAGES
        nvmet: fix uninitialized work for zero kato
        nvme-pci: disable Write Zeroes on Sandisk Skyhawk
        nvme: use queuedata for nvme_req_qid
        nvme-rdma: fix crash due to incorrect cqe
        nvme-rdma: fix crash when connect rejected
        block: remove unused members for io_context
        blk-mq: remove the calling of local_memory_node()
        zram: Fix __zram_bvec_{read,write}() locking order
        skd_main: remove unused including <linux/version.h>
        sgl_alloc_order: fix memory leak
        lightnvm: fix out-of-bounds write to array devices->info[]
        ...
      d7691390
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block · af004187
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - fsize was missed in previous unification of work flags
      
       - Few fixes cleaning up the flags unification creds cases (Pavel)
      
       - Fix NUMA affinities for completely unplugged/replugged node for io-wq
      
       - Two fallout fixes from the set_fs changes. One local to io_uring, one
         for the splice entry point that io_uring uses.
      
       - Linked timeout fixes (Pavel)
      
       - Removal of ->flush() ->files work-around that we don't need anymore
         with referenced files (Pavel)
      
       - Various cleanups (Pavel)
      
      * tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
        splice: change exported internal do_splice() helper to take kernel offset
        io_uring: make loop_rw_iter() use original user supplied pointers
        io_uring: remove req cancel in ->flush()
        io-wq: re-set NUMA node affinities if CPUs come online
        io_uring: don't reuse linked_timeout
        io_uring: unify fsize with def->work_flags
        io_uring: fix racy REQ_F_LINK_TIMEOUT clearing
        io_uring: do poll's hash_node init in common code
        io_uring: inline io_poll_task_handler()
        io_uring: remove extra ->file check in poll prep
        io_uring: make cached_cq_overflow non atomic_t
        io_uring: inline io_fail_links()
        io_uring: kill ref get/drop in personality init
        io_uring: flags-based creds init in queue
      af004187
    • Linus Torvalds's avatar
      Merge tag 'libata-5.10-2020-10-24' of git://git.kernel.dk/linux-block · cb6b2897
      Linus Torvalds authored
      Pull libata fixes from Jens Axboe:
       "Two minor libata fixes:
      
         - Fix a DMA boundary mask regression for sata_rcar (Geert)
      
         - kerneldoc markup fix (Mauro)"
      
      * tag 'libata-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
        ata: fix some kernel-doc markups
        ata: sata_rcar: Fix DMA boundary mask
      cb6b2897
    • Linus Torvalds's avatar
      Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 0eac1102
      Linus Torvalds authored
      Pull misc vfs updates from Al Viro:
       "Assorted stuff all over the place (the largest group here is
        Christoph's stat cleanups)"
      
      * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        fs: remove KSTAT_QUERY_FLAGS
        fs: remove vfs_stat_set_lookup_flags
        fs: move vfs_fstatat out of line
        fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
        fs: remove vfs_statx_fd
        fs: omfs: use kmemdup() rather than kmalloc+memcpy
        [PATCH] reduce boilerplate in fsid handling
        fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
        selftests: mount: add nosymfollow tests
        Add a "nosymfollow" mount option.
      0eac1102
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping · 1b307ac8
      Linus Torvalds authored
      Pull dma-mapping fixes from Christoph Hellwig:
      
       - document the new dma_{alloc,free}_pages() API
      
       - two fixups for the dma-mapping.h split
      
      * tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping:
        dma-mapping: document dma_{alloc,free}_pages
        dma-mapping: move more functions to dma-map-ops.h
        ARM/sa1111: add a missing include of dma-map-ops.h
      1b307ac8
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 9bf8d8bc
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "Two fixes for this merge window, and an unrelated bugfix for a host
        hang"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: ioapic: break infinite recursion on lazy EOI
        KVM: vmx: rename pi_init to avoid conflict with paride
        KVM: x86/mmu: Avoid modulo operator on 64-bit value to fix i386 build
      9bf8d8bc
    • Linus Torvalds's avatar
      Merge tag 'x86_seves_fixes_for_v5.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c51ae124
      Linus Torvalds authored
      Pull x86 SEV-ES fixes from Borislav Petkov:
       "Three fixes to SEV-ES to correct setting up the new early pagetable on
        5-level paging machines, to always map boot_params and the kernel
        cmdline, and disable stack protector for ../compressed/head{32,64}.c.
        (Arvind Sankar)"
      
      * tag 'x86_seves_fixes_for_v5.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/boot/64: Explicitly map boot_params and command line
        x86/head/64: Disable stack protection for head$(BITS).o
        x86/boot/64: Initialize 5-level paging variables earlier
      c51ae124
    • Willy Tarreau's avatar
      random32: add a selftest for the prandom32 code · c6e169bc
      Willy Tarreau authored
      Given that this code is new, let's add a selftest for it as well.
      It doesn't rely on fixed sets, instead it picks 1024 numbers and
      verifies that they're not more correlated than desired.
      
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      Cc: George Spelvin <lkml@sdf.org>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      c6e169bc
    • Willy Tarreau's avatar
      random32: add noise from network and scheduling activity · 3744741a
      Willy Tarreau authored
      With the removal of the interrupt perturbations in previous random32
      change (random32: make prandom_u32() output unpredictable), the PRNG
      has become 100% deterministic again. While SipHash is expected to be
      way more robust against brute force than the previous Tausworthe LFSR,
      there's still the risk that whoever has even one temporary access to
      the PRNG's internal state is able to predict all subsequent draws till
      the next reseed (roughly every minute). This may happen through a side
      channel attack or any data leak.
      
      This patch restores the spirit of commit f227e3ec ("random32: update
      the net random state on interrupt and activity") in that it will perturb
      the internal PRNG's statee using externally collected noise, except that
      it will not pick that noise from the random pool's bits nor upon
      interrupt, but will rather combine a few elements along the Tx path
      that are collectively hard to predict, such as dev, skb and txq
      pointers, packet length and jiffies values. These ones are combined
      using a single round of SipHash into a single long variable that is
      mixed with the net_rand_state upon each invocation.
      
      The operation was inlined because it produces very small and efficient
      code, typically 3 xor, 2 add and 2 rol. The performance was measured
      to be the same (even very slightly better) than before the switch to
      SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
      (i40e), the connection rate dropped from 556k/s to 555k/s while the
      SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.
      
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      Cc: George Spelvin <lkml@sdf.org>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3744741a