1. 24 Jan, 2024 1 commit
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 615d3006
      Linus Torvalds authored
      Pull tracing and eventfs fixes from Steven Rostedt:
      
       - Fix histogram tracing_map insertion.
      
         The tracing_map_insert copies the value into the elt variable and
         then assigns the elt to the entry value. But it is possible that the
         entry value becomes visible on other CPUs before the elt is fully
         initialized. This is fixed by adding a wmb() between the
         initialization of the elt variable and assigning it.
      
       - Have eventfs directory have unique inode numbers.
      
         Having them be all the same proved to be a failure as the 'find'
         application will think that the directories are causing loops, as it
         checks for directory loops via their inodes. Have the evenfs dir
         entries get their inodes assigned when they are referenced and then
         save them in the eventfs_inode structure.
      
      * tag 'trace-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        eventfs: Save directory inodes in the eventfs_inode structure
        tracing: Ensure visibility when inserting an element into tracing_map
      615d3006
  2. 23 Jan, 2024 2 commits
    • Steven Rostedt (Google)'s avatar
      eventfs: Save directory inodes in the eventfs_inode structure · 834bf76a
      Steven Rostedt (Google) authored
      The eventfs inodes and directories are allocated when referenced. But this
      leaves the issue of keeping consistent inode numbers and the number is
      only saved in the inode structure itself. When the inode is no longer
      referenced, it can be freed. When the file that the inode was representing
      is referenced again, the inode is once again created, but the inode number
      needs to be the same as it was before.
      
      Just making the inode numbers the same for all files is fine, but that
      does not work with directories. The find command will check for loops via
      the inode number and having the same inode number for directories triggers:
      
        # find /sys/kernel/tracing
      find: File system loop detected;
      '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as
      '/sys/kernel/debug/tracing/events/initcall'.
      [..]
      
      Linus pointed out that the eventfs_inode structure ends with a single
      32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
      alignment. We can use this hole to store the inode number for the
      eventfs_inode. All directories in eventfs are represented by an
      eventfs_inode and that data structure can hold its inode number.
      
      That last int was also purposely placed at the end of the structure to
      prevent holes from within. Now that there's a 4 byte number to hold the
      inode, both the inode number and the last integer can be moved up in the
      structure for better cache locality, where the llist and rcu fields can be
      moved to the end as they are only used when the eventfs_inode is being
      deleted.
      
      Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20240122152748.46897388@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Fixes: 53c41052 ("eventfs: Have the inodes all for files and directories all be the same")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      834bf76a
    • Fedor Pchelkin's avatar
      drm/ttm: fix ttm pool initialization for no-dma-device drivers · 7ed2632e
      Fedor Pchelkin authored
      The QXL driver doesn't use any device for DMA mappings or allocations so
      dev_to_node() will panic inside ttm_device_init() on NUMA systems:
      
        general protection fault, probably for non-canonical address 0xdffffc000000007a: 0000 [#1] PREEMPT SMP KASAN NOPTI
        KASAN: null-ptr-deref in range [0x00000000000003d0-0x00000000000003d7]
        CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.7.0+ #9
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:ttm_device_init+0x10e/0x340
        Call Trace:
          qxl_ttm_init+0xaa/0x310
          qxl_device_init+0x1071/0x2000
          qxl_pci_probe+0x167/0x3f0
          local_pci_probe+0xe1/0x1b0
          pci_device_probe+0x29d/0x790
          really_probe+0x251/0x910
          __driver_probe_device+0x1ea/0x390
          driver_probe_device+0x4e/0x2e0
          __driver_attach+0x1e3/0x600
          bus_for_each_dev+0x12d/0x1c0
          bus_add_driver+0x25a/0x590
          driver_register+0x15c/0x4b0
          qxl_pci_driver_init+0x67/0x80
          do_one_initcall+0xf5/0x5d0
          kernel_init_freeable+0x637/0xb10
          kernel_init+0x1c/0x2e0
          ret_from_fork+0x48/0x80
          ret_from_fork_asm+0x1b/0x30
        RIP: 0010:ttm_device_init+0x10e/0x340
      
      Fall back to NUMA_NO_NODE if there is no device for DMA.
      
      Found by Linux Verification Center (linuxtesting.org).
      
      Fixes: b0a7ce53 ("drm/ttm: Schedule delayed_delete worker closer")
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ed2632e
  3. 22 Jan, 2024 5 commits
    • Linus Torvalds's avatar
      Revert "btrfs: zstd: fix and simplify the inline extent decompression" · e01a83e1
      Linus Torvalds authored
      This reverts commit 1e7f6def.
      
      It causes my machine to not even boot, and Klara Modin reports that the
      cause is that small zstd-compressed files return garbage when read.
      Reported-by: default avatarKlara Modin <klarasmodin@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/Reported-and-bisected-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarDavid Sterba <dsterba@suse.com>
      Cc: Qu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e01a83e1
    • Petr Pavlu's avatar
      tracing: Ensure visibility when inserting an element into tracing_map · 2b447606
      Petr Pavlu authored
      Running the following two commands in parallel on a multi-processor
      AArch64 machine can sporadically produce an unexpected warning about
      duplicate histogram entries:
      
       $ while true; do
           echo hist:key=id.syscall:val=hitcount > \
             /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
           cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
           sleep 0.001
         done
       $ stress-ng --sysbadaddr $(nproc)
      
      The warning looks as follows:
      
      [ 2911.172474] ------------[ cut here ]------------
      [ 2911.173111] Duplicates detected: 1
      [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
      [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
      [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
      [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G            E      6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
      [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
      [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
      [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.185310] sp : ffff8000a1513900
      [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
      [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
      [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
      [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
      [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
      [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
      [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
      [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
      [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
      [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
      [ 2911.194259] Call trace:
      [ 2911.194626]  tracing_map_sort_entries+0x3e0/0x408
      [ 2911.195220]  hist_show+0x124/0x800
      [ 2911.195692]  seq_read_iter+0x1d4/0x4e8
      [ 2911.196193]  seq_read+0xe8/0x138
      [ 2911.196638]  vfs_read+0xc8/0x300
      [ 2911.197078]  ksys_read+0x70/0x108
      [ 2911.197534]  __arm64_sys_read+0x24/0x38
      [ 2911.198046]  invoke_syscall+0x78/0x108
      [ 2911.198553]  el0_svc_common.constprop.0+0xd0/0xf8
      [ 2911.199157]  do_el0_svc+0x28/0x40
      [ 2911.199613]  el0_svc+0x40/0x178
      [ 2911.200048]  el0t_64_sync_handler+0x13c/0x158
      [ 2911.200621]  el0t_64_sync+0x1a8/0x1b0
      [ 2911.201115] ---[ end trace 0000000000000000 ]---
      
      The problem appears to be caused by CPU reordering of writes issued from
      __tracing_map_insert().
      
      The check for the presence of an element with a given key in this
      function is:
      
       val = READ_ONCE(entry->val);
       if (val && keys_match(key, val->key, map->key_size)) ...
      
      The write of a new entry is:
      
       elt = get_free_elt(map);
       memcpy(elt->key, key, map->key_size);
       entry->val = elt;
      
      The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
      stores may become visible in the reversed order on another CPU. This
      second CPU might then incorrectly determine that a new key doesn't match
      an already present val->key and subsequently insert a new element,
      resulting in a duplicate.
      
      Fix the problem by adding a write barrier between
      "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
      good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
      element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
      and the "val->key" check which has an address dependency.
      
      The barrier is placed on a path executed when adding an element for
      a new key. Subsequent updates targeting the same key remain unaffected.
      
      From the user's perspective, the issue was introduced by commit
      c193707d ("tracing: Remove code which merges duplicates"), which
      followed commit cbf4100e ("tracing: Add support to detect and avoid
      duplicates"). The previous code operated differently; it inherently
      expected potential races which result in duplicates but merged them
      later when they occurred.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
      
      Fixes: c193707d ("tracing: Remove code which merges duplicates")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Acked-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2b447606
    • Linus Torvalds's avatar
      Merge tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 5d9248ee
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - zoned mode fixes:
           - fix slowdown when writing large file sequentially by looking up
             block groups with enough space faster
           - locking fixes when activating a zone
      
       - new mount API fixes:
           - preserve mount options for a ro/rw mount of the same subvolume
      
       - scrub fixes:
           - fix use-after-free in case the chunk length is not aligned to
             64K, this does not happen normally but has been reported on
             images converted from ext4
           - similar alignment check was missing with raid-stripe-tree
      
       - subvolume deletion fixes:
           - prevent calling ioctl on already deleted subvolume
           - properly track flag tracking a deleted subvolume
      
       - in subpage mode, fix decompression of an inline extent (zlib, lzo,
         zstd)
      
       - fix crash when starting writeback on a folio, after integration with
         recent MM changes this needs to be started conditionally
      
       - reject unknown flags in defrag ioctl
      
       - error handling, API fixes, minor warning fixes
      
      * tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: scrub: limit RST scrub to chunk boundary
        btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
        btrfs: don't unconditionally call folio_start_writeback in subpage
        btrfs: use the original mount's mount options for the legacy reconfigure
        btrfs: don't warn if discard range is not aligned to sector
        btrfs: tree-checker: fix inline ref size in error messages
        btrfs: zstd: fix and simplify the inline extent decompression
        btrfs: lzo: fix and simplify the inline extent decompression
        btrfs: zlib: fix and simplify the inline extent decompression
        btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
        btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
        btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
        btrfs: zoned: fix lock ordering in btrfs_zone_activate()
        btrfs: fix unbalanced unlock of mapping_tree_lock
        btrfs: ref-verify: free ref cache before clearing mount opt
        btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send()
        btrfs: zoned: optimize hint byte for zoned allocator
        btrfs: zoned: factor out prepare_allocation_zoned()
      5d9248ee
    • Linus Torvalds's avatar
      Merge tag 'Wstringop-overflow-for-6.8-rc2' of... · 610347ef
      Linus Torvalds authored
      Merge tag 'Wstringop-overflow-for-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
      
      Pull stringop-overflow warning update from Gustavo A. R. Silva:
       "Enable -Wstringop-overflow globally.
      
        I waited for the release of -rc1 to run a final build-test on top of
        it before sending this pull request. Fortunatelly, after building 358
        kernels overnight (basically all supported archs with a wide variety
        of configs), no more warnings have surfaced! :)
      
        Thus, we are in a good position to enable this compiler option for all
        versions of GCC that support it, with the exception of GCC-11, which
        appears to have some issues with this option [1]"
      
      Link: https://lore.kernel.org/lkml/b3c99290-40bc-426f-b3d2-1aa903f95c4e@embeddedor.com/ [1]
      
      * tag 'Wstringop-overflow-for-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
        init: Kconfig: Disable -Wstringop-overflow for GCC-11
        Makefile: Enable -Wstringop-overflow globally
      610347ef
    • Linus Torvalds's avatar
      Merge tag 'xsa448-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 0f0d819a
      Linus Torvalds authored
      Pull xen netback fix from Juergen Gross:
       "Transmit requests in Xen's virtual network protocol can consist of
        multiple parts. While not really useful, except for the initial part
        any of them may be of zero length, i.e. carry no data at all.
      
        Besides a certain initial portion of the to be transferred data, these
        parts are directly translated into what Linux calls SKB fragments.
        Such converted request parts can, when for a particular SKB they are
        all of length zero, lead to a de-reference of NULL in core networking
        code"
      
      * tag 'xsa448-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen-netback: don't produce zero-size SKB frags
      0f0d819a
  4. 21 Jan, 2024 32 commits
    • Gustavo A. R. Silva's avatar
      init: Kconfig: Disable -Wstringop-overflow for GCC-11 · a5e0ace0
      Gustavo A. R. Silva authored
      -Wstringop-overflow is buggy in GCC-11. Therefore, we should disable
      this option specifically for that compiler version. To achieve this,
      we introduce a new configuration option: GCC11_NO_STRINGOP_OVERFLOW.
      
      The compiler option related to string operation overflow is now managed
      under configuration CC_STRINGOP_OVERFLOW. This option is enabled by
      default for all other versions of GCC that support it.
      
      Link: https://lore.kernel.org/lkml/b3c99290-40bc-426f-b3d2-1aa903f95c4e@embeddedor.com/
      Link: https://lore.kernel.org/lkml/20231128091351.2bfb38dd@canb.auug.org.au/Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/linux-hardening/ZWj1+jkweEDWbmAR@work/Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      a5e0ace0
    • Gustavo A. R. Silva's avatar
      Makefile: Enable -Wstringop-overflow globally · 113a6186
      Gustavo A. R. Silva authored
      It seems that we have finished addressing all the remaining
      issues regarding -Wstringop-overflow. So, we are now in good
      shape to enable this compiler option globally.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      113a6186
    • Linus Torvalds's avatar
      Linux 6.8-rc1 · 6613476e
      Linus Torvalds authored
      6613476e
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs · 35a4474b
      Linus Torvalds authored
      Pull more bcachefs updates from Kent Overstreet:
       "Some fixes, Some refactoring, some minor features:
      
         - Assorted prep work for disk space accounting rewrite
      
         - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
           makes our trigger context more explicit
      
         - A few fixes to avoid excessive transaction restarts on
           multithreaded workloads: fstests (in addition to ktest tests) are
           now checking slowpath counters, and that's shaking out a few bugs
      
         - Assorted tracepoint improvements
      
         - Starting to break up bcachefs_format.h and move on disk types so
           they're with the code they belong to; this will make room to start
           documenting the on disk format better.
      
         - A few minor fixes"
      
      * tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
        bcachefs: Improve inode_to_text()
        bcachefs: logged_ops_format.h
        bcachefs: reflink_format.h
        bcachefs; extents_format.h
        bcachefs: ec_format.h
        bcachefs: subvolume_format.h
        bcachefs: snapshot_format.h
        bcachefs: alloc_background_format.h
        bcachefs: xattr_format.h
        bcachefs: dirent_format.h
        bcachefs: inode_format.h
        bcachefs; quota_format.h
        bcachefs: sb-counters_format.h
        bcachefs: counters.c -> sb-counters.c
        bcachefs: comment bch_subvolume
        bcachefs: bch_snapshot::btime
        bcachefs: add missing __GFP_NOWARN
        bcachefs: opts->compression can now also be applied in the background
        bcachefs: Prep work for variable size btree node buffers
        bcachefs: grab s_umount only if snapshotting
        ...
      35a4474b
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4fbbed78
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "Updates for time and clocksources:
      
         - A fix for the idle and iowait time accounting vs CPU hotplug.
      
           The time is reset on CPU hotplug which makes the accumulated
           systemwide time jump backwards.
      
         - Assorted fixes and improvements for clocksource/event drivers"
      
      * tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
        clocksource/drivers/ep93xx: Fix error handling during probe
        clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
        clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
        clocksource/timer-riscv: Add riscv_clock_shutdown callback
        dt-bindings: timer: Add StarFive JH8100 clint
        dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
      4fbbed78
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 7b297a5c
      Linus Torvalds authored
      Pull powerpc fixes from Aneesh Kumar:
      
       - Increase default stack size to 32KB for Book3S
      
      Thanks to Michael Ellerman.
      
      * tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: Increase default stack size to 32KB
      7b297a5c
    • Kent Overstreet's avatar
      bcachefs: Improve inode_to_text() · 249f441f
      Kent Overstreet authored
      Add line breaks - inode_to_text() is now much easier to read.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      249f441f
    • Kent Overstreet's avatar
      bcachefs: logged_ops_format.h · d826cc57
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d826cc57
    • Kent Overstreet's avatar
      bcachefs: reflink_format.h · 8d52ba60
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8d52ba60
    • Kent Overstreet's avatar
      bcachefs; extents_format.h · b2fa1b63
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b2fa1b63
    • Kent Overstreet's avatar
      bcachefs: ec_format.h · 0560eb9a
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      0560eb9a
    • Kent Overstreet's avatar
      bcachefs: subvolume_format.h · c6c4ff65
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      c6c4ff65
    • Kent Overstreet's avatar
      bcachefs: snapshot_format.h · 8fed323b
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8fed323b
    • Kent Overstreet's avatar
      d455179f
    • Kent Overstreet's avatar
      bcachefs: xattr_format.h · 72e08010
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      72e08010
    • Kent Overstreet's avatar
      bcachefs: dirent_format.h · 7ffc4daa
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      7ffc4daa
    • Kent Overstreet's avatar
      bcachefs: inode_format.h · b36425da
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b36425da
    • Kent Overstreet's avatar
      bcachefs; quota_format.h · 82de6207
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      82de6207
    • Kent Overstreet's avatar
      bcachefs: sb-counters_format.h · 43314801
      Kent Overstreet authored
      bcachefs_format.h has gotten too big; let's do some organizing.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      43314801
    • Kent Overstreet's avatar
      3a58dfbc
    • Kent Overstreet's avatar
      12207f49
    • Kent Overstreet's avatar
      bcachefs: bch_snapshot::btime · d32088f2
      Kent Overstreet authored
      Add a field to bch_snapshot for creation time; this will be important
      when we start exposing the snapshot tree to userspace.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d32088f2
    • Kent Overstreet's avatar
      7be0208f
    • Kent Overstreet's avatar
      bcachefs: opts->compression can now also be applied in the background · d7e77f53
      Kent Overstreet authored
      The "apply this compression method in the background" paths now use the
      compression option if background_compression is not set; this means that
      setting or changing the compression option will cause existing data to
      be compressed accordingly in the background.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d7e77f53
    • Kent Overstreet's avatar
      bcachefs: Prep work for variable size btree node buffers · ec4edd7b
      Kent Overstreet authored
      bcachefs btree nodes are big - typically 256k - and btree roots are
      pinned in memory. As we're now up to 18 btrees, we now have significant
      memory overhead in mostly empty btree roots.
      
      And in the future we're going to start enforcing that certain btree node
      boundaries exist, to solve lock contention issues - analagous to XFS's
      AGIs.
      
      Thus, we need to start allocating smaller btree node buffers when we
      can. This patch changes code that refers to the filesystem constant
      c->opts.btree_node_size to refer to the btree node buffer size -
      btree_buf_bytes() - where appropriate.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      ec4edd7b
    • Su Yue's avatar
      bcachefs: grab s_umount only if snapshotting · 2acc59dd
      Su Yue authored
      When I was testing mongodb over bcachefs with compression,
      there is a lockdep warning when snapshotting mongodb data volume.
      
      $ cat test.sh
      prog=bcachefs
      
      $prog subvolume create /mnt/data
      $prog subvolume create /mnt/data/snapshots
      
      while true;do
          $prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s)
          sleep 1s
      done
      
      $ cat /etc/mongodb.conf
      systemLog:
        destination: file
        logAppend: true
        path: /mnt/data/mongod.log
      
      storage:
        dbPath: /mnt/data/
      
      lockdep reports:
      [ 3437.452330] ======================================================
      [ 3437.452750] WARNING: possible circular locking dependency detected
      [ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G            E
      [ 3437.453562] ------------------------------------------------------
      [ 3437.453981] bcachefs/35533 is trying to acquire lock:
      [ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190
      [ 3437.454875]
                     but task is already holding lock:
      [ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.456009]
                     which lock already depends on the new lock.
      
      [ 3437.456553]
                     the existing dependency chain (in reverse order) is:
      [ 3437.457054]
                     -> #3 (&type->s_umount_key#48){.+.+}-{3:3}:
      [ 3437.457507]        down_read+0x3e/0x170
      [ 3437.457772]        bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.458206]        __x64_sys_ioctl+0x93/0xd0
      [ 3437.458498]        do_syscall_64+0x42/0xf0
      [ 3437.458779]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.459155]
                     -> #2 (&c->snapshot_create_lock){++++}-{3:3}:
      [ 3437.459615]        down_read+0x3e/0x170
      [ 3437.459878]        bch2_truncate+0x82/0x110 [bcachefs]
      [ 3437.460276]        bchfs_truncate+0x254/0x3c0 [bcachefs]
      [ 3437.460686]        notify_change+0x1f1/0x4a0
      [ 3437.461283]        do_truncate+0x7f/0xd0
      [ 3437.461555]        path_openat+0xa57/0xce0
      [ 3437.461836]        do_filp_open+0xb4/0x160
      [ 3437.462116]        do_sys_openat2+0x91/0xc0
      [ 3437.462402]        __x64_sys_openat+0x53/0xa0
      [ 3437.462701]        do_syscall_64+0x42/0xf0
      [ 3437.462982]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.463359]
                     -> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}:
      [ 3437.463843]        down_write+0x3b/0xc0
      [ 3437.464223]        bch2_write_iter+0x5b/0xcc0 [bcachefs]
      [ 3437.464493]        vfs_write+0x21b/0x4c0
      [ 3437.464653]        ksys_write+0x69/0xf0
      [ 3437.464839]        do_syscall_64+0x42/0xf0
      [ 3437.465009]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.465231]
                     -> #0 (sb_writers#10){.+.+}-{0:0}:
      [ 3437.465471]        __lock_acquire+0x1455/0x21b0
      [ 3437.465656]        lock_acquire+0xc6/0x2b0
      [ 3437.465822]        mnt_want_write+0x46/0x1a0
      [ 3437.465996]        filename_create+0x62/0x190
      [ 3437.466175]        user_path_create+0x2d/0x50
      [ 3437.466352]        bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
      [ 3437.466617]        __x64_sys_ioctl+0x93/0xd0
      [ 3437.466791]        do_syscall_64+0x42/0xf0
      [ 3437.466957]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.467180]
                     other info that might help us debug this:
      
      [ 3437.469670] 2 locks held by bcachefs/35533:
                     other info that might help us debug this:
      
      [ 3437.467507] Chain exists of:
                       sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48
      
      [ 3437.467979]  Possible unsafe locking scenario:
      
      [ 3437.468223]        CPU0                    CPU1
      [ 3437.468405]        ----                    ----
      [ 3437.468585]   rlock(&type->s_umount_key#48);
      [ 3437.468758]                                lock(&c->snapshot_create_lock);
      [ 3437.469030]                                lock(&type->s_umount_key#48);
      [ 3437.469291]   rlock(sb_writers#10);
      [ 3437.469434]
                      *** DEADLOCK ***
      
      [ 3437.469670] 2 locks held by bcachefs/35533:
      [ 3437.469838]  #0: ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs]
      [ 3437.470294]  #1: ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.470744]
                     stack backtrace:
      [ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G            E      6.7.0-rc7-custom+ #85
      [ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
      [ 3437.471694] Call Trace:
      [ 3437.471795]  <TASK>
      [ 3437.471884]  dump_stack_lvl+0x57/0x90
      [ 3437.472035]  check_noncircular+0x132/0x150
      [ 3437.472202]  __lock_acquire+0x1455/0x21b0
      [ 3437.472369]  lock_acquire+0xc6/0x2b0
      [ 3437.472518]  ? filename_create+0x62/0x190
      [ 3437.472683]  ? lock_is_held_type+0x97/0x110
      [ 3437.472856]  mnt_want_write+0x46/0x1a0
      [ 3437.473025]  ? filename_create+0x62/0x190
      [ 3437.473204]  filename_create+0x62/0x190
      [ 3437.473380]  user_path_create+0x2d/0x50
      [ 3437.473555]  bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
      [ 3437.473819]  ? lock_acquire+0xc6/0x2b0
      [ 3437.474002]  ? __fget_files+0x2a/0x190
      [ 3437.474195]  ? __fget_files+0xbc/0x190
      [ 3437.474380]  ? lock_release+0xc5/0x270
      [ 3437.474567]  ? __x64_sys_ioctl+0x93/0xd0
      [ 3437.474764]  ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs]
      [ 3437.475090]  __x64_sys_ioctl+0x93/0xd0
      [ 3437.475277]  do_syscall_64+0x42/0xf0
      [ 3437.475454]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.475691] RIP: 0033:0x7f2743c313af
      ======================================================
      
      In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
      and unlock it at the end of the function. There is a comment
      "why do we need this lock?" about the lock coming from
      commit 42d23732 ("bcachefs: Snapshot creation, deletion")
      The reason is that __bch2_ioctl_subvolume_create() calls
      sync_inodes_sb() which enforce locked s_umount to writeback all dirty
      nodes before doing snapshot works.
      
      Fix it by read locking s_umount for snapshotting only and unlocking
      s_umount after sync_inodes_sb().
      Signed-off-by: default avatarSu Yue <glass.su@suse.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      2acc59dd
    • Su Yue's avatar
      bcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exit · 369acf97
      Su Yue authored
      bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut.
      It should be freed by kvfree not kfree.
      Or umount will triger:
      
      [  406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008
      [  406.830676 ] #PF: supervisor read access in kernel mode
      [  406.831643 ] #PF: error_code(0x0000) - not-present page
      [  406.832487 ] PGD 0 P4D 0
      [  406.832898 ] Oops: 0000 [#1] PREEMPT SMP PTI
      [  406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G           OE      6.7.0-rc7-custom+ #90
      [  406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
      [  406.835796 ] RIP: 0010:kfree+0x62/0x140
      [  406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6
      [  406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286
      [  406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4
      [  406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000
      [  406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001
      [  406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80
      [  406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000
      [  406.840451 ] FS:  00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000
      [  406.840851 ] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0
      [  406.841464 ] Call Trace:
      [  406.841583 ]  <TASK>
      [  406.841682 ]  ? __die+0x1f/0x70
      [  406.841828 ]  ? page_fault_oops+0x159/0x470
      [  406.842014 ]  ? fixup_exception+0x22/0x310
      [  406.842198 ]  ? exc_page_fault+0x1ed/0x200
      [  406.842382 ]  ? asm_exc_page_fault+0x22/0x30
      [  406.842574 ]  ? bch2_fs_release+0x54/0x280 [bcachefs]
      [  406.842842 ]  ? kfree+0x62/0x140
      [  406.842988 ]  ? kfree+0x104/0x140
      [  406.843138 ]  bch2_fs_release+0x54/0x280 [bcachefs]
      [  406.843390 ]  kobject_put+0xb7/0x170
      [  406.843552 ]  deactivate_locked_super+0x2f/0xa0
      [  406.843756 ]  cleanup_mnt+0xba/0x150
      [  406.843917 ]  task_work_run+0x59/0xa0
      [  406.844083 ]  exit_to_user_mode_prepare+0x197/0x1a0
      [  406.844302 ]  syscall_exit_to_user_mode+0x16/0x40
      [  406.844510 ]  do_syscall_64+0x4e/0xf0
      [  406.844675 ]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [  406.844907 ] RIP: 0033:0x7f0a2664e4fb
      Signed-off-by: default avatarSu Yue <glass.su@suse.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      369acf97
    • Kent Overstreet's avatar
      bcachefs: bios must be 512 byte algined · 00fff4dd
      Kent Overstreet authored
      Fixes: 023f9ac9 bcachefs: Delete dio read alignment check
      Reported-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      00fff4dd
    • Colin Ian King's avatar
      bcachefs: remove redundant variable tmp · aead3428
      Colin Ian King authored
      The variable tmp is being assigned a value but it isn't being
      read afterwards. The assignment is redundant and so tmp can be
      removed.
      
      Cleans up clang scan build warning:
      warning: Although the value stored to 'ret' is used in the enclosing
      expression, the value is never actually read from 'ret'
      [deadcode.DeadStores]
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      aead3428
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: Fix excess transaction restarts in __bchfs_fallocate() · 46bf2e9c
      Kent Overstreet authored
      drop_locks_do() should not be used in a fastpath without first trying
      the do in nonblocking mode - the unlock and relock will cause excessive
      transaction restarts and potentially livelocking with other threads that
      are contending for the same locks.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      46bf2e9c
    • Kent Overstreet's avatar
      bcachefs: extents_to_bp_state · 1a503904
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      1a503904