1. 22 Jan, 2024 12 commits
    • Yang Jihong's avatar
      perf record: Fix possible incorrect free in record__switch_output() · aff10a16
      Yang Jihong authored
      perf_data__switch() may not assign a legal value to 'new_filename'.
      In this case, 'new_filename' uses the on-stack value, which may cause a
      incorrect free and unexpected result.
      
      Fixes: 03724b2e ("perf record: Allow to limit number of reported perf.data files")
      Signed-off-by: default avatarYang Jihong <yangjihong1@huawei.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240119040304.3708522-2-yangjihong1@huawei.comSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      aff10a16
    • Namhyung Kim's avatar
      perf dwarf-aux: Check allowed DWARF Ops · 55442cc2
      Namhyung Kim authored
      The DWARF location expression can be fairly complex and it'd be hard
      to match it with the condition correctly.  So let's be conservative
      and only allow simple expressions.  For now it just checks the first
      operation in the list.  The following operations looks ok:
      
       * DW_OP_stack_value
       * DW_OP_deref_size
       * DW_OP_deref
       * DW_OP_piece
      
      To refuse complex (and unsupported) location expressions, add
      check_allowed_ops() to compare the rest of the list.  It seems earlier
      result contained those unsupported expressions.  For example, I found
      some local struct variable is placed like below.
      
       <2><43d1517>: Abbrev Number: 62 (DW_TAG_variable)
          <43d1518>   DW_AT_location    : 15 byte block: 91 50 93 8 91 78 93 4 93 84 8 91 68 93 4
              (DW_OP_fbreg: -48; DW_OP_piece: 8;
               DW_OP_fbreg: -8; DW_OP_piece: 4;
               DW_OP_piece: 1028;
               DW_OP_fbreg: -24; DW_OP_piece: 4)
      
      Another example is something like this.
      
          0057c8be ffffffffffffffff ffffffff812109f0 (base address)
          0057c8ce ffffffff812112b5 ffffffff812112c8 (DW_OP_breg3 (rbx): 0;
                                                      DW_OP_constu: 18446744073709551612;
                                                      DW_OP_and;
                                                      DW_OP_stack_value)
      
      It should refuse them.  After the change, the stat shows:
      
        Annotate data type stats:
        total 294, ok 158 (53.7%), bad 136 (46.3%)
        -----------------------------------------------------------
                30 : no_sym
                32 : no_mem_ops
                53 : no_var
                14 : no_typeinfo
                 7 : bad_offset
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Link: https://lore.kernel.org/r/20240117062657.985479-10-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      55442cc2
    • Namhyung Kim's avatar
      perf annotate-data: Support stack variables · bc10db8e
      Namhyung Kim authored
      Local variables are allocated in the stack and the location list
      should look like base register(s) and an offset.  Extend the
      die_find_variable_by_reg() to handle the following expressions
      
       * DW_OP_breg{0..31}
       * DW_OP_bregx
       * DW_OP_fbreg
      
      Ususally DWARF subprogram entries have frame base information and
      use it to locate stack variable like below:
      
       <2><43d1575>: Abbrev Number: 62 (DW_TAG_variable)
          <43d1576>   DW_AT_location    : 2 byte block: 91 7c         (DW_OP_fbreg: -4)  <--- here
          <43d1579>   DW_AT_name        : (indirect string, offset: 0x2c00c9): i
          <43d157d>   DW_AT_decl_file   : 1
          <43d157e>   DW_AT_decl_line   : 78
          <43d157f>   DW_AT_type        : <0x43d19d7>
      
      I found some differences on saving the frame base between gcc and clang.
      The gcc uses the CFA to get the base so it needs to check the current
      frame's CFI info.  In this case, stack offset needs to be adjusted from
      the start of the CFA.
      
       <1><1bb8d>: Abbrev Number: 102 (DW_TAG_subprogram)
          <1bb8e>   DW_AT_name        : (indirect string, offset: 0x74d41): kernel_init
          <1bb92>   DW_AT_decl_file   : 2
          <1bb92>   DW_AT_decl_line   : 1440
          <1bb94>   DW_AT_decl_column : 18
          <1bb95>   DW_AT_prototyped  : 1
          <1bb95>   DW_AT_type        : <0xcc>
          <1bb99>   DW_AT_low_pc      : 0xffffffff81bab9e0
          <1bba1>   DW_AT_high_pc     : 0x1b2
          <1bba9>   DW_AT_frame_base  : 1 byte block: 9c      (DW_OP_call_frame_cfa)  <------ here
          <1bbab>   DW_AT_call_all_calls: 1
          <1bbab>   DW_AT_sibling     : <0x1bf5a>
      
      While clang sets it to a register directly and it can check the register
      and offset in the instruction directly.
      
       <1><43d1542>: Abbrev Number: 60 (DW_TAG_subprogram)
          <43d1543>   DW_AT_low_pc      : 0xffffffff816a7c60
          <43d154b>   DW_AT_high_pc     : 0x98
          <43d154f>   DW_AT_frame_base  : 1 byte block: 56    (DW_OP_reg6 (rbp))  <---------- here
          <43d1551>   DW_AT_GNU_all_call_sites: 1
          <43d1551>   DW_AT_name        : (indirect string, offset: 0x3bce91): foo
          <43d1555>   DW_AT_decl_file   : 1
          <43d1556>   DW_AT_decl_line   : 75
          <43d1557>   DW_AT_prototyped  : 1
          <43d1557>   DW_AT_type        : <0x43c7332>
          <43d155b>   DW_AT_external    : 1
      
      Also it needs to update the offset after finding the type like global
      variables since the offset was from the frame base.  Factor out
      match_var_offset() to check global and local variables in the same way.
      
      The type stats are improved too:
      
        Annotate data type stats:
        total 294, ok 160 (54.4%), bad 134 (45.6%)
        -----------------------------------------------------------
                30 : no_sym
                32 : no_mem_ops
                51 : no_var
                14 : no_typeinfo
                 7 : bad_offset
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240117062657.985479-9-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      bc10db8e
    • Namhyung Kim's avatar
      perf dwarf-aux: Add die_get_cfa() · 6fed025f
      Namhyung Kim authored
      The die_get_cfa() is to get frame base register and offset at the given
      instruction address (pc).  This info will be used to locate stack
      variables which have location expression using DW_OP_fbreg.
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240117062657.985479-8-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      6fed025f
    • Namhyung Kim's avatar
      perf annotate-data: Support global variables · 5f7cdde8
      Namhyung Kim authored
      Global variables are accessed using PC-relative address so it needs to
      be handled separately.  The PC-rel addressing is detected by using
      DWARF_REG_PC.  On x86, %rip register would be used.
      
      The address can be calculated using the ip and offset in the
      instruction.  But it should start from the next instruction so add
      calculate_pcrel_addr() to do it properly.
      
      But global variables defined in a different file would only have a
      declaration which doesn't include a location list.  So it first tries
      to get the type info using the address, and then looks up the variable
      declarations using name.  The name of global variables should be get
      from the symbol table.  The declaration would have the type info.
      
      So extend find_var_type() to take both address and name for global
      variables.
      
      The stat is now looks like:
      
        Annotate data type stats:
        total 294, ok 153 (52.0%), bad 141 (48.0%)
        -----------------------------------------------------------
                30 : no_sym
                32 : no_mem_ops
                61 : no_var
                10 : no_typeinfo
                 8 : bad_offset
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240117062657.985479-7-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      5f7cdde8
    • Namhyung Kim's avatar
      perf annotate-data: Handle PC-relative addressing · 83bfa06d
      Namhyung Kim authored
      Extend find_data_type_die() to find data type from PC-relative address
      using die_find_variable_by_addr().  Users need to pass the address for
      the (global) variable.
      
      The offset for the variable should be updated after finding the type
      because the offset in the instruction is just to calcuate the address
      for the variable.  So it changed to pass a pointer to offset and renamed
      it to 'poffset'.
      
      First it searches variables in the CU DIE as it's likely that the global
      variables are defined in the file level.  And then it iterates the scope
      DIEs to find a local (static) variable.
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240117062657.985479-6-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      83bfa06d
    • Namhyung Kim's avatar
      perf annotate-data: Add stack operation pseudo type · 7a54f1d8
      Namhyung Kim authored
      A typical function prologue and epilogue include multiple stack
      operations to save and restore the current value of registers.
      On x86, it looks like below:
      
        push  r15
        push  r14
        push  r13
        push  r12
      
        ...
      
        pop   r12
        pop   r13
        pop   r14
        pop   r15
        ret
      
      As these all touches the stack memory region, chances are high that they
      appear in a memory profile data.  But these are not used for any real
      purpose yet so it'd return no types.
      
      One of my profile type shows that non neglible portion of data came from
      the stack operations.  It also seems GCC generates more stack operations
      than clang.
      
      Annotate Instruction stats
      total 264, ok 169 (64.0%), bad 95 (36.0%)
      
          Name      :  Good   Bad
        -----------------------------------------------------------
          movq      :    49    27
          movl      :    24     9
          popq      :     0    19   <-- here
          cmpl      :    17     2
          addq      :    14     1
          cmpq      :    12     2
          cmpxchgl  :     3     7
      
      Instead of dealing them as unknown, let's create a seperate pseudo type
      to represent those stack operations separately.
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240117062657.985479-5-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      7a54f1d8
    • Namhyung Kim's avatar
      perf annotate-data: Handle array style accesses · d3030191
      Namhyung Kim authored
      On x86, instructions for array access often looks like below.
      
        mov  0x1234(%rax,%rbx,8), %rcx
      
      Usually the first register holds the type information and the second one
      has the index.  And the current code only looks up a variable for the
      first register.  But it's possible to be in the other way around so it
      needs to check the second register if the first one failed.
      
      The stat changed like this.
      
        Annotate data type stats:
        total 294, ok 148 (50.3%), bad 146 (49.7%)
        -----------------------------------------------------------
                30 : no_sym
                32 : no_mem_ops
                66 : no_var
                10 : no_typeinfo
                 8 : bad_offset
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240117062657.985479-4-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      d3030191
    • Namhyung Kim's avatar
      perf annotate-data: Handle macro fusion on x86 · 1cf4df03
      Namhyung Kim authored
      When a sample was come from a conditional branch without a memory
      operand, it could be due to a macro fusion with a previous instruction.
      So it needs to check the memory operand in the previous one.
      
      This improves the stat like below:
      
        Annotate data type stats:
        total 294, ok 147 (50.0%), bad 147 (50.0%)
        -----------------------------------------------------------
                30 : no_sym
                32 : no_mem_ops
                71 : no_var
                 6 : no_typeinfo
                 8 : bad_offset
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240117062657.985479-3-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      1cf4df03
    • Namhyung Kim's avatar
      perf annotate-data: Parse 'lock' prefix from llvm-objdump · a3397d69
      Namhyung Kim authored
      For the performance reason, I prefer llvm-objdump over GNU's.  But I
      found that llvm-objdump puts x86 lock prefix in a separate line like
      below.
      
        ffffffff81000695: f0                    lock
        ffffffff81000696: ff 83 54 0b 00 00     incl    2900(%rbx)
      
      This should be parsed properly, but I just changed to find the insn
      with next offset for now.
      
      This improves the statistics as it can process more instructions.
      
        Annotate data type stats:
        total 294, ok 144 (49.0%), bad 150 (51.0%)
        -----------------------------------------------------------
                30 : no_sym
                35 : no_mem_ops
                71 : no_var
                 6 : no_typeinfo
                 8 : bad_offset
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240117062657.985479-2-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      a3397d69
    • Yang Jihong's avatar
      perf build: Check whether pkg-config is installed when libtraceevent is linked · 8462247f
      Yang Jihong authored
      If pkg-config is not installed when libtraceevent is linked, the build fails.
      
      The error information is as follows:
      
        $ make
        <SNIP>
        In file included from /home/yjh/projects_linux/perf-tool-next/linux/tools/perf/util/evsel.c:43:
        /home/yjh/projects_linux/perf-tool-next/linux/tools/perf/util/trace-event.h:149:62: error: operator '&&' has no right operand
          149 | #if defined(LIBTRACEEVENT_VERSION) &&  LIBTRACEEVENT_VERSION >= MAKE_LIBTRACEEVENT_VERSION(1, 5, 0)
              |                                                              ^~
        error: command '/usr/bin/gcc' failed with exit code 1
        cp: cannot stat 'python_ext_build/lib/perf*.so': No such file or directory
        make[2]: *** [Makefile.perf:668: python/perf.cpython-310-x86_64-linux-gnu.so] Error 1
        make[2]: *** Waiting for unfinished jobs....
      
      Because pkg-config is not installed, fail to get libtraceevent version in
      Makefile.config file. As a result, LIBTRACEEVENT_VERSION is empty.
      However, the preceding error information is not user-friendly.
      
      Identify errors in advance by checking that pkg-config is installed at
      compile time.
      
      The build results of various scenarios are as follows:
      
      1. build successful when libtraceevent is not linked and pkg-config is not installed
      
        $ pkg-config --version
        -bash: /usr/bin/pkg-config: No such file or directory
        $ make clean >/dev/null
        $ make NO_LIBTRACEEVENT=1 >/dev/null
        Makefile.config:1133: No alternatives command found, you need to set JDIR= to point to the root of your Java directory
          PERF_VERSION = 6.7.rc6.gd988c9f5
        $ echo $?
        0
      
      2. dummy pkg-config is missing when libtraceevent is linked
      
        $ pkg-config --version
        -bash: /usr/bin/pkg-config: No such file or directory
        $ make clean >/dev/null
        $ make >/dev/null
        Makefile.config:221: *** Error: pkg-config needed by libtraceevent is missing on this system, please install it.  Stop.
        make[1]: *** [Makefile.perf:251: sub-make] Error 2
        make: *** [Makefile:70: all] Error 2
        $ echo $?
        2
      
      3. build successful when libtraceevent is linked and pkg-config is installed
      
        $ pkg-config --version
        0.29.2
        $ make clean >/dev/null
        $ make >/dev/null
        Makefile.config:1133: No alternatives command found, you need to set JDIR= to point to the root of your Java directory
          PERF_VERSION = 6.7.rc6.gd988c9f5
        $ echo $?
        0
      Signed-off-by: default avatarYang Jihong <yangjihong1@huawei.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240112034019.3558584-1-yangjihong1@huawei.comSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      8462247f
    • Thomas Richter's avatar
      perf test: raise limit to 20 percent for perf_stat_--bpf-counters_test · 999eea92
      Thomas Richter authored
      This test case often fails on s390 (about 2 out of 10) because the
      10% percent limit on the difference between --bpf-counters event counting
      and s390 hardware counting is more than 10% in all failure cases.
      Raise the limit to 20% on s390 and the test case succeeds.
      Signed-off-by: default avatarThomas Richter <tmricht@linux.ibm.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Cc: gor@linux.ibm.com
      Cc: hca@linux.ibm.com
      Cc: sumanthk@linux.ibm.com
      Cc: svens@linux.ibm.com
      Link: https://lore.kernel.org/r/20240108084009.3959211-1-tmricht@linux.ibm.comSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      999eea92
  2. 21 Jan, 2024 28 commits
    • Linus Torvalds's avatar
      Linux 6.8-rc1 · 6613476e
      Linus Torvalds authored
      6613476e
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs · 35a4474b
      Linus Torvalds authored
      Pull more bcachefs updates from Kent Overstreet:
       "Some fixes, Some refactoring, some minor features:
      
         - Assorted prep work for disk space accounting rewrite
      
         - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
           makes our trigger context more explicit
      
         - A few fixes to avoid excessive transaction restarts on
           multithreaded workloads: fstests (in addition to ktest tests) are
           now checking slowpath counters, and that's shaking out a few bugs
      
         - Assorted tracepoint improvements
      
         - Starting to break up bcachefs_format.h and move on disk types so
           they're with the code they belong to; this will make room to start
           documenting the on disk format better.
      
         - A few minor fixes"
      
      * tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
        bcachefs: Improve inode_to_text()
        bcachefs: logged_ops_format.h
        bcachefs: reflink_format.h
        bcachefs; extents_format.h
        bcachefs: ec_format.h
        bcachefs: subvolume_format.h
        bcachefs: snapshot_format.h
        bcachefs: alloc_background_format.h
        bcachefs: xattr_format.h
        bcachefs: dirent_format.h
        bcachefs: inode_format.h
        bcachefs; quota_format.h
        bcachefs: sb-counters_format.h
        bcachefs: counters.c -> sb-counters.c
        bcachefs: comment bch_subvolume
        bcachefs: bch_snapshot::btime
        bcachefs: add missing __GFP_NOWARN
        bcachefs: opts->compression can now also be applied in the background
        bcachefs: Prep work for variable size btree node buffers
        bcachefs: grab s_umount only if snapshotting
        ...
      35a4474b
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4fbbed78
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "Updates for time and clocksources:
      
         - A fix for the idle and iowait time accounting vs CPU hotplug.
      
           The time is reset on CPU hotplug which makes the accumulated
           systemwide time jump backwards.
      
         - Assorted fixes and improvements for clocksource/event drivers"
      
      * tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
        clocksource/drivers/ep93xx: Fix error handling during probe
        clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
        clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
        clocksource/timer-riscv: Add riscv_clock_shutdown callback
        dt-bindings: timer: Add StarFive JH8100 clint
        dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
      4fbbed78
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 7b297a5c
      Linus Torvalds authored
      Pull powerpc fixes from Aneesh Kumar:
      
       - Increase default stack size to 32KB for Book3S
      
      Thanks to Michael Ellerman.
      
      * tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: Increase default stack size to 32KB
      7b297a5c
    • Kent Overstreet's avatar
      bcachefs: Improve inode_to_text() · 249f441f
      Kent Overstreet authored
      Add line breaks - inode_to_text() is now much easier to read.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      249f441f
    • Kent Overstreet's avatar
      bcachefs: logged_ops_format.h · d826cc57
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d826cc57
    • Kent Overstreet's avatar
      bcachefs: reflink_format.h · 8d52ba60
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8d52ba60
    • Kent Overstreet's avatar
      bcachefs; extents_format.h · b2fa1b63
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b2fa1b63
    • Kent Overstreet's avatar
      bcachefs: ec_format.h · 0560eb9a
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      0560eb9a
    • Kent Overstreet's avatar
      bcachefs: subvolume_format.h · c6c4ff65
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      c6c4ff65
    • Kent Overstreet's avatar
      bcachefs: snapshot_format.h · 8fed323b
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8fed323b
    • Kent Overstreet's avatar
      d455179f
    • Kent Overstreet's avatar
      bcachefs: xattr_format.h · 72e08010
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      72e08010
    • Kent Overstreet's avatar
      bcachefs: dirent_format.h · 7ffc4daa
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      7ffc4daa
    • Kent Overstreet's avatar
      bcachefs: inode_format.h · b36425da
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b36425da
    • Kent Overstreet's avatar
      bcachefs; quota_format.h · 82de6207
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      82de6207
    • Kent Overstreet's avatar
      bcachefs: sb-counters_format.h · 43314801
      Kent Overstreet authored
      bcachefs_format.h has gotten too big; let's do some organizing.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      43314801
    • Kent Overstreet's avatar
      3a58dfbc
    • Kent Overstreet's avatar
      12207f49
    • Kent Overstreet's avatar
      bcachefs: bch_snapshot::btime · d32088f2
      Kent Overstreet authored
      Add a field to bch_snapshot for creation time; this will be important
      when we start exposing the snapshot tree to userspace.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d32088f2
    • Kent Overstreet's avatar
      7be0208f
    • Kent Overstreet's avatar
      bcachefs: opts->compression can now also be applied in the background · d7e77f53
      Kent Overstreet authored
      The "apply this compression method in the background" paths now use the
      compression option if background_compression is not set; this means that
      setting or changing the compression option will cause existing data to
      be compressed accordingly in the background.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d7e77f53
    • Kent Overstreet's avatar
      bcachefs: Prep work for variable size btree node buffers · ec4edd7b
      Kent Overstreet authored
      bcachefs btree nodes are big - typically 256k - and btree roots are
      pinned in memory. As we're now up to 18 btrees, we now have significant
      memory overhead in mostly empty btree roots.
      
      And in the future we're going to start enforcing that certain btree node
      boundaries exist, to solve lock contention issues - analagous to XFS's
      AGIs.
      
      Thus, we need to start allocating smaller btree node buffers when we
      can. This patch changes code that refers to the filesystem constant
      c->opts.btree_node_size to refer to the btree node buffer size -
      btree_buf_bytes() - where appropriate.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      ec4edd7b
    • Su Yue's avatar
      bcachefs: grab s_umount only if snapshotting · 2acc59dd
      Su Yue authored
      When I was testing mongodb over bcachefs with compression,
      there is a lockdep warning when snapshotting mongodb data volume.
      
      $ cat test.sh
      prog=bcachefs
      
      $prog subvolume create /mnt/data
      $prog subvolume create /mnt/data/snapshots
      
      while true;do
          $prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s)
          sleep 1s
      done
      
      $ cat /etc/mongodb.conf
      systemLog:
        destination: file
        logAppend: true
        path: /mnt/data/mongod.log
      
      storage:
        dbPath: /mnt/data/
      
      lockdep reports:
      [ 3437.452330] ======================================================
      [ 3437.452750] WARNING: possible circular locking dependency detected
      [ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G            E
      [ 3437.453562] ------------------------------------------------------
      [ 3437.453981] bcachefs/35533 is trying to acquire lock:
      [ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190
      [ 3437.454875]
                     but task is already holding lock:
      [ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.456009]
                     which lock already depends on the new lock.
      
      [ 3437.456553]
                     the existing dependency chain (in reverse order) is:
      [ 3437.457054]
                     -> #3 (&type->s_umount_key#48){.+.+}-{3:3}:
      [ 3437.457507]        down_read+0x3e/0x170
      [ 3437.457772]        bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.458206]        __x64_sys_ioctl+0x93/0xd0
      [ 3437.458498]        do_syscall_64+0x42/0xf0
      [ 3437.458779]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.459155]
                     -> #2 (&c->snapshot_create_lock){++++}-{3:3}:
      [ 3437.459615]        down_read+0x3e/0x170
      [ 3437.459878]        bch2_truncate+0x82/0x110 [bcachefs]
      [ 3437.460276]        bchfs_truncate+0x254/0x3c0 [bcachefs]
      [ 3437.460686]        notify_change+0x1f1/0x4a0
      [ 3437.461283]        do_truncate+0x7f/0xd0
      [ 3437.461555]        path_openat+0xa57/0xce0
      [ 3437.461836]        do_filp_open+0xb4/0x160
      [ 3437.462116]        do_sys_openat2+0x91/0xc0
      [ 3437.462402]        __x64_sys_openat+0x53/0xa0
      [ 3437.462701]        do_syscall_64+0x42/0xf0
      [ 3437.462982]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.463359]
                     -> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}:
      [ 3437.463843]        down_write+0x3b/0xc0
      [ 3437.464223]        bch2_write_iter+0x5b/0xcc0 [bcachefs]
      [ 3437.464493]        vfs_write+0x21b/0x4c0
      [ 3437.464653]        ksys_write+0x69/0xf0
      [ 3437.464839]        do_syscall_64+0x42/0xf0
      [ 3437.465009]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.465231]
                     -> #0 (sb_writers#10){.+.+}-{0:0}:
      [ 3437.465471]        __lock_acquire+0x1455/0x21b0
      [ 3437.465656]        lock_acquire+0xc6/0x2b0
      [ 3437.465822]        mnt_want_write+0x46/0x1a0
      [ 3437.465996]        filename_create+0x62/0x190
      [ 3437.466175]        user_path_create+0x2d/0x50
      [ 3437.466352]        bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
      [ 3437.466617]        __x64_sys_ioctl+0x93/0xd0
      [ 3437.466791]        do_syscall_64+0x42/0xf0
      [ 3437.466957]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.467180]
                     other info that might help us debug this:
      
      [ 3437.469670] 2 locks held by bcachefs/35533:
                     other info that might help us debug this:
      
      [ 3437.467507] Chain exists of:
                       sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48
      
      [ 3437.467979]  Possible unsafe locking scenario:
      
      [ 3437.468223]        CPU0                    CPU1
      [ 3437.468405]        ----                    ----
      [ 3437.468585]   rlock(&type->s_umount_key#48);
      [ 3437.468758]                                lock(&c->snapshot_create_lock);
      [ 3437.469030]                                lock(&type->s_umount_key#48);
      [ 3437.469291]   rlock(sb_writers#10);
      [ 3437.469434]
                      *** DEADLOCK ***
      
      [ 3437.469670] 2 locks held by bcachefs/35533:
      [ 3437.469838]  #0: ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs]
      [ 3437.470294]  #1: ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.470744]
                     stack backtrace:
      [ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G            E      6.7.0-rc7-custom+ #85
      [ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
      [ 3437.471694] Call Trace:
      [ 3437.471795]  <TASK>
      [ 3437.471884]  dump_stack_lvl+0x57/0x90
      [ 3437.472035]  check_noncircular+0x132/0x150
      [ 3437.472202]  __lock_acquire+0x1455/0x21b0
      [ 3437.472369]  lock_acquire+0xc6/0x2b0
      [ 3437.472518]  ? filename_create+0x62/0x190
      [ 3437.472683]  ? lock_is_held_type+0x97/0x110
      [ 3437.472856]  mnt_want_write+0x46/0x1a0
      [ 3437.473025]  ? filename_create+0x62/0x190
      [ 3437.473204]  filename_create+0x62/0x190
      [ 3437.473380]  user_path_create+0x2d/0x50
      [ 3437.473555]  bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
      [ 3437.473819]  ? lock_acquire+0xc6/0x2b0
      [ 3437.474002]  ? __fget_files+0x2a/0x190
      [ 3437.474195]  ? __fget_files+0xbc/0x190
      [ 3437.474380]  ? lock_release+0xc5/0x270
      [ 3437.474567]  ? __x64_sys_ioctl+0x93/0xd0
      [ 3437.474764]  ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs]
      [ 3437.475090]  __x64_sys_ioctl+0x93/0xd0
      [ 3437.475277]  do_syscall_64+0x42/0xf0
      [ 3437.475454]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.475691] RIP: 0033:0x7f2743c313af
      ======================================================
      
      In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
      and unlock it at the end of the function. There is a comment
      "why do we need this lock?" about the lock coming from
      commit 42d23732 ("bcachefs: Snapshot creation, deletion")
      The reason is that __bch2_ioctl_subvolume_create() calls
      sync_inodes_sb() which enforce locked s_umount to writeback all dirty
      nodes before doing snapshot works.
      
      Fix it by read locking s_umount for snapshotting only and unlocking
      s_umount after sync_inodes_sb().
      Signed-off-by: default avatarSu Yue <glass.su@suse.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      2acc59dd
    • Su Yue's avatar
      bcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exit · 369acf97
      Su Yue authored
      bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut.
      It should be freed by kvfree not kfree.
      Or umount will triger:
      
      [  406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008
      [  406.830676 ] #PF: supervisor read access in kernel mode
      [  406.831643 ] #PF: error_code(0x0000) - not-present page
      [  406.832487 ] PGD 0 P4D 0
      [  406.832898 ] Oops: 0000 [#1] PREEMPT SMP PTI
      [  406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G           OE      6.7.0-rc7-custom+ #90
      [  406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
      [  406.835796 ] RIP: 0010:kfree+0x62/0x140
      [  406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6
      [  406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286
      [  406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4
      [  406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000
      [  406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001
      [  406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80
      [  406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000
      [  406.840451 ] FS:  00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000
      [  406.840851 ] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0
      [  406.841464 ] Call Trace:
      [  406.841583 ]  <TASK>
      [  406.841682 ]  ? __die+0x1f/0x70
      [  406.841828 ]  ? page_fault_oops+0x159/0x470
      [  406.842014 ]  ? fixup_exception+0x22/0x310
      [  406.842198 ]  ? exc_page_fault+0x1ed/0x200
      [  406.842382 ]  ? asm_exc_page_fault+0x22/0x30
      [  406.842574 ]  ? bch2_fs_release+0x54/0x280 [bcachefs]
      [  406.842842 ]  ? kfree+0x62/0x140
      [  406.842988 ]  ? kfree+0x104/0x140
      [  406.843138 ]  bch2_fs_release+0x54/0x280 [bcachefs]
      [  406.843390 ]  kobject_put+0xb7/0x170
      [  406.843552 ]  deactivate_locked_super+0x2f/0xa0
      [  406.843756 ]  cleanup_mnt+0xba/0x150
      [  406.843917 ]  task_work_run+0x59/0xa0
      [  406.844083 ]  exit_to_user_mode_prepare+0x197/0x1a0
      [  406.844302 ]  syscall_exit_to_user_mode+0x16/0x40
      [  406.844510 ]  do_syscall_64+0x4e/0xf0
      [  406.844675 ]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [  406.844907 ] RIP: 0033:0x7f0a2664e4fb
      Signed-off-by: default avatarSu Yue <glass.su@suse.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      369acf97
    • Kent Overstreet's avatar
      bcachefs: bios must be 512 byte algined · 00fff4dd
      Kent Overstreet authored
      Fixes: 023f9ac9 bcachefs: Delete dio read alignment check
      Reported-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      00fff4dd
    • Colin Ian King's avatar
      bcachefs: remove redundant variable tmp · aead3428
      Colin Ian King authored
      The variable tmp is being assigned a value but it isn't being
      read afterwards. The assignment is redundant and so tmp can be
      removed.
      
      Cleans up clang scan build warning:
      warning: Although the value stored to 'ret' is used in the enclosing
      expression, the value is never actually read from 'ret'
      [deadcode.DeadStores]
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      aead3428
    • Kent Overstreet's avatar