1. 31 Jul, 2024 17 commits
    • Athira Rajeev's avatar
      perf annotate: Add parse function for memory instructions in powerpc · 1acdad68
      Athira Rajeev authored
      Use the raw instruction code and macros to identify memory instructions,
      extract register fields and also offset.
      
      The implementation addresses the D-form, X-form, DS-form instructions.
      Two main functions are added.
      
      New parse function "load_store__parse" as instruction ops parser for
      memory instructions.
      
      Unlike other parsers (like mov__parse), this one fills in the
      "multi_regs" field for source/target and new added "mem_ref" field. No
      other fields are set because, here there is no need to parse the
      disassembled code and arch specific macros will take care of extracting
      offset and regs which is easier and will be precise.
      
      In powerpc, all instructions with a primary opcode from 32 to 63
      are memory instructions. Update "ins__find" function to have "raw_insn"
      also as a parameter.
      Reviewed-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Akanksha J N <akanksha@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Disha Goel <disgoel@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Link: https://lore.kernel.org/lkml/20240718084358.72242-8-atrajeev@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      1acdad68
    • Athira Rajeev's avatar
      perf annotate: Update parameters for reg extract functions to use raw instruction on powerpc · 1b4406d2
      Athira Rajeev authored
      Use the raw instruction code and macros to identify memory instructions,
      extract register fields and also offset.
      
      The implementation addresses the D-form, X-form, DS-form instructions.
      
      Adds "mem_ref" field to check whether source/target has memory
      reference.
      
      Add function "get_powerpc_regs" which will set these fields: reg1, reg2,
      offset depending of where it is source or target ops.
      
      Update "parse" callback for "struct ins_ops" to also pass "struct
      disasm_line" as argument. This is needed in parse functions where opcode
      is used to determine whether to set multi_regs and other fields
      Reviewed-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Akanksha J N <akanksha@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Disha Goel <disgoel@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Link: https://lore.kernel.org/lkml/20240718084358.72242-7-atrajeev@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      1b4406d2
    • Athira Rajeev's avatar
      perf annotate: Add support to capture and parse raw instruction in powerpc... · 0b971e6b
      Athira Rajeev authored
      perf annotate: Add support to capture and parse raw instruction in powerpc using dso__data_read_offset utility
      
      Add support to capture and parse raw instruction in powerpc.
      Currently, the perf tool infrastructure uses two ways to disassemble
      and understand the instruction. One is objdump and other option is
      via libcapstone.
      
      Currently, the perf tool infrastructure uses "--no-show-raw-insn" option
      with "objdump" while disassemble. Example from powerpc with this option
      for an instruction address is:
      
      Snippet from:
      
        objdump  --start-address=<address> --stop-address=<address>  -d --no-show-raw-insn -C <vmlinux>
      
        c0000000010224b4:	lwz     r10,0(r9)
      
      This line "lwz r10,0(r9)" is parsed to extract instruction name,
      registers names and offset. Also to find whether there is a memory
      reference in the operands, "memory_ref_char" field of objdump is used.
      For x86, "(" is used as memory_ref_char to tackle instructions of the
      form "mov  (%rax), %rcx".
      
      In case of powerpc, not all instructions using "(" are the only memory
      instructions. Example, above instruction can also be of extended form (X
      form) "lwzx r10,0,r19". Inorder to easy identify the instruction category
      and extract the source/target registers, patch adds support to use raw
      instruction for powerpc. Approach used is to read the raw instruction
      directly from the DSO file using "dso__data_read_offset" utility which
      is already implemented in perf infrastructure in "util/dso.c".
      
      Example:
      
      38 01 81 e8     ld      r4,312(r1)
      
      Here "38 01 81 e8" is the raw instruction representation. In powerpc,
      this translates to instruction form: "ld RT,DS(RA)" and binary code
      as:
      
         | 58 |  RT  |  RA |      DS       | |
         -------------------------------------
         0    6     11    16              30 31
      
      Function "symbol__disassemble_dso" is updated to read raw instruction
      directly from DSO using dso__data_read_offset utility. In case of
      above example, this captures:
      line:    38 01 81 e8
      
      The above works well when 'perf report' is invoked with only sort keys
      for data type ie type and typeoff.
      
      Because there is no instruction level annotation needed if only data
      type information is requested for.
      
      For annotating sample, along with type and typeoff sort key, "sym" sort
      key is also needed. And by default invoking just "perf report" uses sort
      key "sym" that displays the symbol information.
      
      With approach changes in powerpc which first reads DSO for raw
      instruction, "perf annotate" and "perf report" + a key breaks since
      it doesn't do the instruction level disassembly.
      
      Snippet of result from 'perf report':
      
        Samples: 1K of event 'mem-loads', 4000 Hz, Event count (approx.): 937238
        do_work  /usr/bin/pmlogger [Percent: local period]
        Percent│        ea230010
               │        3a550010
               │        3a600000
      
               │        38f60001
               │        39490008
               │        42400438
         51.44 │        81290008
               │        7d485378
      
      Here, raw instruction is displayed in the output instead of human
      readable annotated form.
      
      One way to get the appropriate data is to specify "--objdump path", by
      which code annotation will be done. But the default behaviour will be
      changed. To fix this breakage, check if "sym" sort key is set. If so
      fallback and use the libcapstone/objdump way of disassmbling the sample.
      
      With the changes and "perf report"
      
      Samples: 1K of event 'mem-loads', 4000 Hz, Event count (approx.): 937238
      do_work  /usr/bin/pmlogger [Percent: local period]
      Percent│        ld        r17,16(r3)
             │        addi      r18,r21,16
             │        li        r19,0
      
             │ 8b0:   rldicl    r10,r10,63,33
             │        addi      r10,r10,1
             │        mtctr     r10
             │      ↓ b         8e4
             │ 8c0:   addi      r7,r22,1
             │        addi      r10,r9,8
             │      ↓ bdz       d00
       51.44 │        lwz       r9,8(r9)
             │        mr        r8,r10
             │        cmpw      r20,r9
      
      Committer notes:
      
      Just add the extern for 'sort_order' in disasm.c so that we don't end up
      breaking the build due to this type colision with capstone and libbpf:
      
        In file included from /usr/include/capstone/capstone.h:325,
                         from /git/perf-6.10.0/tools/perf/util/print_insn.h:23,
                         from builtin-script.c:38:
        /usr/include/capstone/bpf.h:94:14: error: 'bpf_insn' defined as wrong kind of tag
           94 | typedef enum bpf_insn {
      
      I reported this to the bpf mailing list, see one of the links below.
      Reviewed-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Akanksha J N <akanksha@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Disha Goel <disgoel@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Link: https://lore.kernel.org/lkml/20240718084358.72242-6-atrajeev@linux.vnet.ibm.com
      Link: https://lore.kernel.org/bpf/ZqOltPk9VQGgJZAA@x1/T/#uSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      0b971e6b
    • Athira Rajeev's avatar
      perf annotate: Add disasm_line__parse() to parse raw instruction for powerpc · 06dd4c5a
      Athira Rajeev authored
      Currently, the perf tool infrastructure uses the disasm_line__parse
      function to parse disassembled line.
      
      Example snippet from objdump:
      
        objdump  --start-address=<address> --stop-address=<address>  -d --no-show-raw-insn -C <vmlinux>
      
        c0000000010224b4:	lwz     r10,0(r9)
      
      This line "lwz r10,0(r9)" is parsed to extract instruction name,
      registers names and offset.
      
      In powerpc, the approach for data type profiling uses raw instruction
      instead of result from objdump to identify the instruction category and
      extract the source/target registers.
      
      Example: 38 01 81 e8     ld      r4,312(r1)
      
      Here "38 01 81 e8" is the raw instruction representation. Add function
      "disasm_line__parse_powerpc" to handle parsing of raw instruction.
      Also update "struct disasm_line" to save the binary code/
      With the change, function captures:
      
      line -> "38 01 81 e8     ld      r4,312(r1)"
      raw instruction "38 01 81 e8"
      
      Raw instruction is used later to extract the reg/offset fields. Macros
      are added to extract opcode and register fields. "struct disasm_line"
      is updated to carry union of "bytes" and "raw_insn" of 32 bit to carry raw
      code (raw).
      
      Function "disasm_line__parse_powerpc fills the raw instruction hex value
      and can use macros to get opcode. There is no changes in existing code
      paths, which parses the disassembled code.  The size of raw instruction
      depends on architecture.
      
      In case of powerpc, the parsing the disasm line needs to handle cases
      for reading binary code directly from DSO as well as parsing the objdump
      result. Hence adding the logic into separate function instead of
      updating "disasm_line__parse".  The architecture using the instruction
      name and present approach is not altered. Since this approach targets
      powerpc, the macro implementation is added for powerpc as of now.
      
      Since the disasm_line__parse is used in other cases (perf annotate) and
      not only data tye profiling, the powerpc callback includes changes to
      work with binary code as well as mnemonic representation.
      
      Also in case if the DSO read fails and libcapstone is not supported, the
      approach fallback to use objdump as option. Hence as option, patch has
      changes to ensure objdump option also works well.
      Reviewed-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Akanksha J N <akanksha@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Disha Goel <disgoel@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Link: https://lore.kernel.org/lkml/20240718084358.72242-5-atrajeev@linux.vnet.ibm.com
      [ Add check for strndup() result ]
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      06dd4c5a
    • Athira Rajeev's avatar
      perf annotate: Update TYPE_STATE_MAX_REGS to include max of regs in powerpc · b1d8d968
      Athira Rajeev authored
      TYPE_STATE_MAX_REGS is arch-dependent. Currently this is defined to be
      16.
      
      While checking if reg is valid using has_reg_type, max value is checked
      using TYPE_STATE_MAX_REGS value.
      
      Define this conditionally for powerpc.
      Reviewed-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Akanksha J N <akanksha@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Disha Goel <disgoel@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Link: https://lore.kernel.org/lkml/20240718084358.72242-4-atrajeev@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      b1d8d968
    • Athira Rajeev's avatar
      perf annotate: Add "update_insn_state" callback function to handle arch... · 782959ac
      Athira Rajeev authored
      perf annotate: Add "update_insn_state" callback function to handle arch specific instruction tracking
      
      Add "update_insn_state" callback to "struct arch" to handle instruction
      tracking. Currently updating instruction state is handled by static
      function "update_insn_state_x86" which is defined in "annotate-data.c".
      
      Make this as a callback for specific arch and move to archs specific
      file "arch/x86/annotate/instructions.c" . This will help to add helper
      function for other platforms in file:
      "arch/<platform>/annotate/instructions.c" and make changes/updates
      easier.
      
      Define callback "update_insn_state" as part of "struct arch", also make
      some of the debug functions non-static so that it can be referenced from
      other places.
      Reviewed-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Akanksha J N <akanksha@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Disha Goel <disgoel@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Link: https://lore.kernel.org/lkml/20240718084358.72242-3-atrajeev@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      782959ac
    • Athira Rajeev's avatar
      perf annotate: Move the data structures related to register type to header file · 1d303dee
      Athira Rajeev authored
      Data type profiling uses instruction tracking by checking each
      instruction and updating the register type state in some data
      structures.
      
      This is useful to find the data type in cases when the register state
      gets transferred from one reg to another.
      
      Example, in x86, "mov" instruction and in powerpc, "mr" instruction.
      
      Currently these structures are defined in annotate-data.c and
      instruction tracking is implemented only for x86.
      
      Move these data structures to "annotate-data.h" header file so that
      other arch implementations can use it in arch specific files as well.
      Reviewed-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: default avatarKajol Jain <kjain@linux.ibm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Akanksha J N <akanksha@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Disha Goel <disgoel@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Link: https://lore.kernel.org/lkml/20240718084358.72242-2-atrajeev@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      1d303dee
    • Ian Rogers's avatar
      perf test: Avoid python leak sanitizer test failures · e293f4b1
      Ian Rogers authored
      Leak sanitizer will report memory leaks from python and the leak
      sanitizer output causes tests to fail. For example:
      
        ```
        $ perf test 98 -v
         98: perf script tests:
        --- start ---
        test child forked, pid 1272962
        DB test
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.046 MB /tmp/perf-test-script.x0EktdCel8/perf.data (8 samples) ]
        call_path_table((1, 0, 0, 0)
        call_path_table((2, 1, 0, 140339508617447)
        call_path_table((3, 2, 2, 0)
        call_path_table((4, 3, 3, 0)
        call_path_table((5, 4, 4, 0)
        call_path_table((6, 5, 5, 0)
        call_path_table((7, 6, 6, 0)
        call_path_table((8, 7, 7, 0)
        call_path_table((9, 8, 8, 0)
        call_path_table((10, 9, 9, 0)
        call_path_table((11, 10, 10, 0)
        call_path_table((12, 11, 11, 0)
        call_path_table((13, 12, 1, 0)
        sample_table((1, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954119000, -1, 0, 0, 0, 0, 1, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
        sample_table((2, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954137053, -1, 0, 0, 0, 0, 1, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
        sample_table((3, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954140089, -1, 0, 0, 0, 0, 9, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
        sample_table((4, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954142376, -1, 0, 0, 0, 0, 155, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
        sample_table((5, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954144045, -1, 0, 0, 0, 0, 2493, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
        sample_table((6, 1, 1, 1, 1, 1, 12, 77, -2046828595, 588306954145722, -1, 0, 0, 0, 0, 47555, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
        call_path_table((14, 9, 14, 0)
        call_path_table((15, 14, 15, 0)
        call_path_table((16, 15, 0, -1040969624)
        call_path_table((17, 16, 16, 0)
        call_path_table((18, 17, 17, 0)
        call_path_table((19, 18, 18, 0)
        call_path_table((20, 19, 19, 0)
        call_path_table((21, 20, 13, 0)
        sample_table((7, 1, 1, 1, 2, 1, 13, 46, -2053700898, 588306954157436, -1, 0, 0, 0, 0, 964078, 0, 0, 128933429281, 0, 0, 21, 0, 0, 0, -1, -1))
        call_path_table((22, 1, 21, 0)
        call_path_table((23, 22, 22, 0)
        call_path_table((24, 23, 23, 0)
        call_path_table((25, 24, 24, 0)
        call_path_table((26, 25, 25, 0)
        call_path_table((27, 26, 26, 0)
        call_path_table((28, 27, 27, 0)
        call_path_table((29, 28, 28, 0)
        call_path_table((30, 29, 29, 0)
        call_path_table((31, 30, 30, 0)
        call_path_table((32, 31, 31, 0)
        call_path_table((33, 32, 32, 0)
        call_path_table((34, 33, 33, 0)
        call_path_table((35, 34, 20, 0)
        sample_table((8, 1, 1, 1, 2, 1, 20, 49, -2046878127, 588306954378624, -1, 0, 0, 0, 0, 2534317, 0, 0, 128933429281, 0, 0, 35, 0, 0, 0, -1, -1))
      
        =================================================================
        ==1272975==ERROR: LeakSanitizer: detected memory leaks
      
        Direct leak of 13628 byte(s) in 6 object(s) allocated from:
            #0 0x56354f60c092 in malloc (/tmp/perf/perf+0x29c092)
            #1 0x7ff25c7d02e7 in _PyObject_Malloc /build/python3.11/../Objects/obmalloc.c:2003:11
            #2 0x7ff25c7d02e7 in _PyObject_Malloc /build/python3.11/../Objects/obmalloc.c:1996:1
      
        SUMMARY: AddressSanitizer: 13628 byte(s) leaked in 6 allocation(s).
        --- Cleaning up ---
        ---- end(-1) ----
         98: perf script tests                                               : FAILED!
        ```
      
      Disable leak sanitizer when running specific perf+python tests to
      avoid this. This causes the tests to pass when run with leak
      sanitizer.
      Reviewed-by: default avatarAditya Gupta <adityag@linux.ibm.com>
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      e293f4b1
    • Arnaldo Carvalho de Melo's avatar
      perf trace: Remove arg_fmt->is_enum, we can get that from the BTF type · c3d74713
      Arnaldo Carvalho de Melo authored
      This is to pave the way for other BTF types, i.e. we try to find BTF
      type then use things like btf_is_enum(btf_type) that we cached to find
      the right strtoul and scnprintf routines.
      
      For now only enum is supported, all the other types simple return zero
      for scnprintf which makes it have the same behaviour as when BTF isn't
      available, i.e. fallback to no pretty printing. Ditto for strtoul.
      
        root@x1:~# perf test -v enum
        124: perf trace enum augmentation tests                              : Ok
        root@x1:~# perf test -v enum
        124: perf trace enum augmentation tests                              : Ok
        root@x1:~# perf test -v enum
        124: perf trace enum augmentation tests                              : Ok
        root@x1:~# perf test -v enum
        124: perf trace enum augmentation tests                              : Ok
        root@x1:~# perf test -v enum
        124: perf trace enum augmentation tests                              : Ok
        root@x1:~#
      Signed-off-by: default avatarHoward Chu <howardchu95@gmail.com>
      Tested-by: default avatarHoward Chu <howardchu95@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Howard Chu <howardchu95@gmail.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240624181345.124764-9-howardchu95@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      c3d74713
    • Arnaldo Carvalho de Melo's avatar
      perf trace: Introduce trace__btf_scnprintf() · 62284329
      Arnaldo Carvalho de Melo authored
      To have a central place that will look at the BTF type and call the
      right scnprintf routine or return zero, meaning BTF pretty printing
      isn't available or not implemented for a specific type.
      Signed-off-by: default avatarHoward Chu <howardchu95@gmail.com>
      Tested-by: default avatarHoward Chu <howardchu95@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240624181345.124764-8-howardchu95@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      62284329
    • Howard Chu's avatar
      perf test trace_btf_enum: Add regression test for the BTF augmentation of enums in 'perf trace' · d66763fe
      Howard Chu authored
      Trace landlock_add_rule syscall to see if the output is desirable.
      
      Trace the non-syscall tracepoint 'timer:hrtimer_init' and
      'timer:hrtimer_start', see if the 'mode' argument is augmented,
      the 'mode' enum argument has the prefix of 'HRTIMER_MODE_'
      in its name.
      
      Committer testing:
      
        root@x1:~# perf test enum
        124: perf trace enum augmentation tests                              : Ok
        root@x1:~# perf test -v enum
        124: perf trace enum augmentation tests                              : Ok
        root@x1:~# perf trace -e landlock_add_rule perf test -v enum
             0.000 ( 0.010 ms): perf/749827 landlock_add_rule(ruleset_fd: 11, rule_type: LANDLOCK_RULE_PATH_BENEATH, rule_attr: 0x7ffd324171d4, flags: 45) = -1 EINVAL (Invalid argument)
             0.012 ( 0.002 ms): perf/749827 landlock_add_rule(ruleset_fd: 11, rule_type: LANDLOCK_RULE_NET_PORT, rule_attr: 0x7ffd324171e0, flags: 45) = -1 EINVAL (Invalid argument)
           457.821 ( 0.007 ms): perf/749830 landlock_add_rule(ruleset_fd: 11, rule_type: LANDLOCK_RULE_PATH_BENEATH, rule_attr: 0x7ffd4acd31e4, flags: 45) = -1 EINVAL (Invalid argument)
           457.832 ( 0.003 ms): perf/749830 landlock_add_rule(ruleset_fd: 11, rule_type: LANDLOCK_RULE_NET_PORT, rule_attr: 0x7ffd4acd31f0, flags: 45) = -1 EINVAL (Invalid argument)
        124: perf trace enum augmentation tests                              : Ok
        root@x1:~#
      Suggested-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Signed-off-by: default avatarHoward Chu <howardchu95@gmail.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/lkml/20240619082042.4173621-6-howardchu95@gmail.com
      Link: https://lore.kernel.org/r/20240624181345.124764-7-howardchu95@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      d66763fe
    • Howard Chu's avatar
      perf test: Add landlock workload · 3656e566
      Howard Chu authored
      We'll use it to add a regression test for the BTF augmentation of enum
      arguments for tracepoints in 'perf trace':
      
        root@x1:~# perf trace -e landlock_add_rule perf test -w landlock
             0.000 ( 0.009 ms): perf/747160 landlock_add_rule(ruleset_fd: 11, rule_type: LANDLOCK_RULE_PATH_BENEATH, rule_attr: 0x7ffd8e258594, flags: 45) = -1 EINVAL (Invalid argument)
             0.011 ( 0.002 ms): perf/747160 landlock_add_rule(ruleset_fd: 11, rule_type: LANDLOCK_RULE_NET_PORT, rule_attr: 0x7ffd8e2585a0, flags: 45) = -1 EINVAL (Invalid argument)
        root@x1:~#
      
      Committer notes:
      
      It was agreed on the discussion (see Link below) to shorten then name of
      the workload from 'landlock_add_rule' to 'landlock', and I moved it to a
      separate patch.
      
      Also, to address a build failure from Namhyung, I stopped loading
      linux/landlock.h and instead added the used defines, enums and types to
      make this build in older systems. All we want is to emit the syscall and
      intercept it.
      Suggested-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Signed-off-by: default avatarHoward Chu <howardchu95@gmail.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/lkml/CAH0uvohaypdTV6Z7O5QSK+va_qnhZ6BP6oSJ89s1c1E0CjgxDA@mail.gmail.com
      Link: https://lore.kernel.org/r/20240624181345.124764-1-howardchu95@gmail.com
      Link: https://lore.kernel.org/r/20240624181345.124764-6-howardchu95@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      3656e566
    • Howard Chu's avatar
      perf trace: Filter enum arguments with enum names · 95586588
      Howard Chu authored
      Before:
      
      perf $ ./perf trace -e timer:hrtimer_start --filter='mode!=HRTIMER_MODE_ABS_PINNED_HARD' --max-events=1
      No resolver (strtoul) for "mode" in "timer:hrtimer_start", can't set filter "(mode!=HRTIMER_MODE_ABS_PINNED_HARD) && (common_pid != 281988)"
      
      After:
      
      perf $ ./perf trace -e timer:hrtimer_start --filter='mode!=HRTIMER_MODE_ABS_PINNED_HARD' --max-events=1
           0.000 :0/0 timer:hrtimer_start(hrtimer: 0xffff9498a6ca5f18, function: 0xffffffffa77a5be0, expires: 12351248764875, softexpires: 12351248764875, mode: HRTIMER_MODE_ABS)
      
      && and ||:
      
      perf $ ./perf trace -e timer:hrtimer_start --filter='mode != HRTIMER_MODE_ABS_PINNED_HARD && mode != HRTIMER_MODE_ABS' --max-events=1
           0.000 Hyprland/534 timer:hrtimer_start(hrtimer: 0xffff9497801a84d0, function: 0xffffffffc04cdbe0, expires: 12639434638458, softexpires: 12639433638458, mode: HRTIMER_MODE_REL)
      
      perf $ ./perf trace -e timer:hrtimer_start --filter='mode == HRTIMER_MODE_REL || mode == HRTIMER_MODE_PINNED' --max-events=1
           0.000 ldlck-test/60639 timer:hrtimer_start(hrtimer: 0xffffb16404ee7bf8, function: 0xffffffffa7790420, expires: 12772614418016, softexpires: 12772614368016, mode: HRTIMER_MODE_REL)
      
      Switching it up, using both enum name and integer value(--filter='mode == HRTIMER_MODE_ABS_PINNED_HARD || mode == 0'):
      
      perf $ ./perf trace -e timer:hrtimer_start --filter='mode == HRTIMER_MODE_ABS_PINNED_HARD || mode == 0' --max-events=3
           0.000 :0/0 timer:hrtimer_start(hrtimer: 0xffff9498a6ca5f18, function: 0xffffffffa77a5be0, expires: 12601748739825, softexpires: 12601748739825, mode: HRTIMER_MODE_ABS_PINNED_HARD)
           0.036 :0/0 timer:hrtimer_start(hrtimer: 0xffff9498a6ca5f18, function: 0xffffffffa77a5be0, expires: 12518758748124, softexpires: 12518758748124, mode: HRTIMER_MODE_ABS_PINNED_HARD)
           0.172 tmux: server/41881 timer:hrtimer_start(hrtimer: 0xffffb164081e7838, function: 0xffffffffa7790420, expires: 12518768255836, softexpires: 12518768205836, mode: HRTIMER_MODE_ABS)
      
      P.S.
      perf $ pahole hrtimer_mode
      enum hrtimer_mode {
              HRTIMER_MODE_ABS             = 0,
              HRTIMER_MODE_REL             = 1,
              HRTIMER_MODE_PINNED          = 2,
              HRTIMER_MODE_SOFT            = 4,
              HRTIMER_MODE_HARD            = 8,
              HRTIMER_MODE_ABS_PINNED      = 2,
              HRTIMER_MODE_REL_PINNED      = 3,
              HRTIMER_MODE_ABS_SOFT        = 4,
              HRTIMER_MODE_REL_SOFT        = 5,
              HRTIMER_MODE_ABS_PINNED_SOFT = 6,
              HRTIMER_MODE_REL_PINNED_SOFT = 7,
              HRTIMER_MODE_ABS_HARD        = 8,
              HRTIMER_MODE_REL_HARD        = 9,
              HRTIMER_MODE_ABS_PINNED_HARD = 10,
              HRTIMER_MODE_REL_PINNED_HARD = 11,
      };
      
      Committer testing:
      
        root@x1:~# perf trace -e timer:hrtimer_start --filter='mode != HRTIMER_MODE_ABS' --max-events=2
             0.000 :0/0 timer:hrtimer_start(hrtimer: 0xffff8d4eff2a5050, function: 0xffffffff9e22ddd0, expires: 241502326000000, softexpires: 241502326000000, mode: HRTIMER_MODE_ABS_PINNED_HARD)
        18446744073709.488 :0/0 timer:hrtimer_start(hrtimer: 0xffff8d4eff425050, function: 0xffffffff9e22ddd0, expires: 241501814000000, softexpires: 241501814000000, mode: HRTIMER_MODE_ABS_PINNED_HARD)
        root@x1:~# perf trace -e timer:hrtimer_start --filter='mode != HRTIMER_MODE_ABS && mode != HRTIMER_MODE_ABS_PINNED_HARD' --max-events=2
             0.000 podman/510644 timer:hrtimer_start(hrtimer: 0xffffa2024f5f7dd0, function: 0xffffffff9e2170c0, expires: 241530497418194, softexpires: 241530497368194, mode: HRTIMER_MODE_REL)
            40.251 gnome-shell/2484 timer:hrtimer_start(hrtimer: 0xffff8d48bda17650, function: 0xffffffffc0661550, expires: 241550528619247, softexpires: 241550527619247, mode: HRTIMER_MODE_REL)
        root@x1:~# perf trace -v -e timer:hrtimer_start --filter='mode != HRTIMER_MODE_ABS && mode != HRTIMER_MODE_ABS_PINNED_HARD && mode != HRTIMER_MODE_REL' --max-events=2
        Using CPUID GenuineIntel-6-BA-3
        vmlinux BTF loaded
        <SNIP>
        0
        0xa
        0x1
        New filter for timer:hrtimer_start: (mode != 0 && mode != 0xa && mode != 0x1) && (common_pid != 524049 && common_pid != 4041)
        mmap size 528384B
        ^Croot@x1:~#
      Suggested-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Signed-off-by: default avatarHoward Chu <howardchu95@gmail.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/lkml/ZnCcliuecJABD5FN@x1
      Link: https://lore.kernel.org/r/20240624181345.124764-5-howardchu95@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      95586588
    • Howard Chu's avatar
      perf trace: Augment non-syscall tracepoints with enum arguments with BTF · 607bbdb4
      Howard Chu authored
      Before:
      
      perf $ ./perf trace -e timer:hrtimer_start --max-events=1
           0.000 :0/0 timer:hrtimer_start(hrtimer: 0xffff974466c25f18, function: 0xffffffff89da5be0, expires: 377432432256753, softexpires: 377432432256753, mode: 10)
      
      After:
      
      perf $ ./perf trace -e timer:hrtimer_start --max-events=1
           0.000 :0/0 timer:hrtimer_start(hrtimer: 0xffff9498a6ca5f18, function: 0xffffffffa77a5be0, expires: 4382442895089, softexpires: 4382442895089, mode: HRTIMER_MODE_ABS_PINNED_HARD)
      
      in which HRTIMER_MODE_ABS_PINNED_HARD is:
      
      perf $ pahole hrtimer_mode
      enum hrtimer_mode {
              HRTIMER_MODE_ABS             = 0,
              HRTIMER_MODE_REL             = 1,
              HRTIMER_MODE_PINNED          = 2,
              HRTIMER_MODE_SOFT            = 4,
              HRTIMER_MODE_HARD            = 8,
              HRTIMER_MODE_ABS_PINNED      = 2,
              HRTIMER_MODE_REL_PINNED      = 3,
              HRTIMER_MODE_ABS_SOFT        = 4,
              HRTIMER_MODE_REL_SOFT        = 5,
              HRTIMER_MODE_ABS_PINNED_SOFT = 6,
              HRTIMER_MODE_REL_PINNED_SOFT = 7,
              HRTIMER_MODE_ABS_HARD        = 8,
              HRTIMER_MODE_REL_HARD        = 9,
              HRTIMER_MODE_ABS_PINNED_HARD = 10,
              HRTIMER_MODE_REL_PINNED_HARD = 11,
      };
      
      Can also be tested by
      
      ./perf trace -e pagemap:mm_lru_insertion,timer:hrtimer_start,timer:hrtimer_init,skb:kfree_skb --max-events=10
      
      (Chose these 4 events because they happen quite frequently.)
      
      However some enum arguments may not be contained in vmlinux BTF. To see
      what enum arguments are supported, use:
      
      vmlinux_dir $ bpftool btf dump file /sys/kernel/btf/vmlinux > vmlinux
      
      vmlinux_dir $  while read l; do grep "ENUM '$l'" vmlinux; done < <(grep field:enum /sys/kernel/tracing/events/*/*/format | awk '{print $3}' | sort | uniq) | awk '{print $3}' | sed "s/'\(.*\)'/\1/g"
      dev_pm_qos_req_type
      error_detector
      hrtimer_mode
      i2c_slave_event
      ieee80211_bss_type
      lru_list
      migrate_mode
      nl80211_auth_type
      nl80211_band
      nl80211_iftype
      numa_vmaskip_reason
      pm_qos_req_action
      pwm_polarity
      skb_drop_reason
      thermal_trip_type
      xen_lazy_mode
      xen_mc_extend_args
      xen_mc_flush_reason
      zone_type
      
      And what tracepoints have these enum types as their arguments:
      
      vmlinux_dir $ while read l; do grep "ENUM '$l'" vmlinux; done < <(grep field:enum /sys/kernel/tracing/events/*/*/format | awk '{print $3}' | sort | uniq) | awk '{print $3}' | sed "s/'\(.*\)'/\1/g" > good_enums
      
      vmlinux_dir $ cat good_enums
      dev_pm_qos_req_type
      error_detector
      hrtimer_mode
      i2c_slave_event
      ieee80211_bss_type
      lru_list
      migrate_mode
      nl80211_auth_type
      nl80211_band
      nl80211_iftype
      numa_vmaskip_reason
      pm_qos_req_action
      pwm_polarity
      skb_drop_reason
      thermal_trip_type
      xen_lazy_mode
      xen_mc_extend_args
      xen_mc_flush_reason
      zone_type
      
      vmlinux_dir $ grep -f good_enums -l /sys/kernel/tracing/events/*/*/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_chandef_dfs_required/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_ch_switch_notify/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_ch_switch_started_notify/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_get_bss/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_ibss_joined/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_inform_bss_frame/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_radar_event/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_ready_on_channel_expired/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_ready_on_channel/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_reg_can_beacon/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_return_bss/format
      /sys/kernel/tracing/events/cfg80211/cfg80211_tx_mgmt_expired/format
      /sys/kernel/tracing/events/cfg80211/rdev_add_virtual_intf/format
      /sys/kernel/tracing/events/cfg80211/rdev_auth/format
      /sys/kernel/tracing/events/cfg80211/rdev_change_virtual_intf/format
      /sys/kernel/tracing/events/cfg80211/rdev_channel_switch/format
      /sys/kernel/tracing/events/cfg80211/rdev_connect/format
      /sys/kernel/tracing/events/cfg80211/rdev_inform_bss/format
      /sys/kernel/tracing/events/cfg80211/rdev_libertas_set_mesh_channel/format
      /sys/kernel/tracing/events/cfg80211/rdev_mgmt_tx/format
      /sys/kernel/tracing/events/cfg80211/rdev_remain_on_channel/format
      /sys/kernel/tracing/events/cfg80211/rdev_return_chandef/format
      /sys/kernel/tracing/events/cfg80211/rdev_return_int_survey_info/format
      /sys/kernel/tracing/events/cfg80211/rdev_set_ap_chanwidth/format
      /sys/kernel/tracing/events/cfg80211/rdev_set_monitor_channel/format
      /sys/kernel/tracing/events/cfg80211/rdev_set_radar_background/format
      /sys/kernel/tracing/events/cfg80211/rdev_start_ap/format
      /sys/kernel/tracing/events/cfg80211/rdev_start_radar_detection/format
      /sys/kernel/tracing/events/cfg80211/rdev_tdls_channel_switch/format
      /sys/kernel/tracing/events/compaction/mm_compaction_defer_compaction/format
      /sys/kernel/tracing/events/compaction/mm_compaction_deferred/format
      /sys/kernel/tracing/events/compaction/mm_compaction_defer_reset/format
      /sys/kernel/tracing/events/compaction/mm_compaction_finished/format
      /sys/kernel/tracing/events/compaction/mm_compaction_kcompactd_wake/format
      /sys/kernel/tracing/events/compaction/mm_compaction_suitable/format
      /sys/kernel/tracing/events/compaction/mm_compaction_wakeup_kcompactd/format
      /sys/kernel/tracing/events/error_report/error_report_end/format
      /sys/kernel/tracing/events/i2c_slave/i2c_slave/format
      /sys/kernel/tracing/events/migrate/mm_migrate_pages/format
      /sys/kernel/tracing/events/migrate/mm_migrate_pages_start/format
      /sys/kernel/tracing/events/pagemap/mm_lru_insertion/format
      /sys/kernel/tracing/events/power/dev_pm_qos_add_request/format
      /sys/kernel/tracing/events/power/dev_pm_qos_remove_request/format
      /sys/kernel/tracing/events/power/dev_pm_qos_update_request/format
      /sys/kernel/tracing/events/power/pm_qos_update_flags/format
      /sys/kernel/tracing/events/power/pm_qos_update_target/format
      /sys/kernel/tracing/events/pwm/pwm_apply/format
      /sys/kernel/tracing/events/pwm/pwm_get/format
      /sys/kernel/tracing/events/sched/sched_skip_vma_numa/format
      /sys/kernel/tracing/events/skb/kfree_skb/format
      /sys/kernel/tracing/events/thermal/thermal_zone_trip/format
      /sys/kernel/tracing/events/timer/hrtimer_init/format
      /sys/kernel/tracing/events/timer/hrtimer_start/format
      /sys/kernel/tracing/events/xen/xen_mc_batch/format
      /sys/kernel/tracing/events/xen/xen_mc_extend_args/format
      /sys/kernel/tracing/events/xen/xen_mc_flush_reason/format
      /sys/kernel/tracing/events/xen/xen_mc_issue/format
      
      Committer testing:
      
        root@x1:~# perf trace -e timer:hrtimer_start --max-events=2
             0.000 :0/0 timer:hrtimer_start(hrtimer: 0xffff8d4eff225050, function: 0xffffffff9e22ddd0, expires: 241152380000000, softexpires: 241152380000000, mode: HRTIMER_MODE_ABS)
             0.028 :0/0 timer:hrtimer_start(hrtimer: 0xffff8d4eff225050, function: 0xffffffff9e22ddd0, expires: 241153654000000, softexpires: 241153654000000, mode: HRTIMER_MODE_ABS_PINNED_HARD)
        root@x1:~#
      Suggested-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Reviewed-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Signed-off-by: default avatarHoward Chu <howardchu95@gmail.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/lkml/20240615032743.112750-1-howardchu95@gmail.com
      Link: https://lore.kernel.org/r/20240624181345.124764-4-howardchu95@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      607bbdb4
    • Howard Chu's avatar
      perf trace: BTF-based enum pretty printing for syscall args · 45a0c928
      Howard Chu authored
      In this patch, BTF is used to turn enum value to the corresponding
      name. There is only one system call that uses enum value as its
      argument, that is `landlock_add_rule()`.
      
      The vmlinux btf is loaded lazily, when user decided to trace the
      `landlock_add_rule` syscall. But if one decide to run `perf trace`
      without any arguments, the behaviour is to trace `landlock_add_rule`,
      so vmlinux btf will be loaded by default.
      
      The laziest behaviour is to load vmlinux btf when a
      `landlock_add_rule` syscall hits. But I think you could lose some
      samples when loading vmlinux btf at run time, for it can delay the
      handling of other samples. I might need your precious opinions on
      this...
      
      before:
      
      ```
      perf $ ./perf trace -e landlock_add_rule
           0.000 ( 0.008 ms): ldlck-test/438194 landlock_add_rule(rule_type: 2) = -1 EBADFD (File descriptor in bad state)
           0.010 ( 0.001 ms): ldlck-test/438194 landlock_add_rule(rule_type: 1) = -1 EBADFD (File descriptor in bad state)
      ```
      
      after:
      
      ```
      perf $ ./perf trace -e landlock_add_rule
           0.000 ( 0.029 ms): ldlck-test/438194 landlock_add_rule(rule_type: LANDLOCK_RULE_NET_PORT)     = -1 EBADFD (File descriptor in bad state)
           0.036 ( 0.004 ms): ldlck-test/438194 landlock_add_rule(rule_type: LANDLOCK_RULE_PATH_BENEATH) = -1 EBADFD (File descriptor in bad state)
      ```
      
      Committer notes:
      
      Made it build with NO_LIBBPF=1, simplified btf_enum_fprintf(), see [1]
      for the discussion.
      Signed-off-by: default avatarHoward Chu <howardchu95@gmail.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Günther Noack <gnoack@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mickaël Salaün <mic@digikod.net>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/lkml/20240613022757.3589783-1-howardchu95@gmail.com
      Link: https://lore.kernel.org/lkml/ZnXAhFflUl_LV1QY@x1 # [1]
      Link: https://lore.kernel.org/r/20240624181345.124764-3-howardchu95@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      45a0c928
    • Linus Torvalds's avatar
      Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · e4fc196f
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - fix regression in extent map rework when handling insertion of
         overlapping compressed extent
      
       - fix unexpected file length when appending to a file using direct io
         and buffer not faulted in
      
       - in zoned mode, fix accounting of unusable space when flipping
         read-only block group back to read-write
      
       - fix page locking when COWing an inline range, assertion failure found
         by syzbot
      
       - fix calculation of space info in debugging print
      
       - tree-checker, add validation of data reference item
      
       - fix a few -Wmaybe-uninitialized build warnings
      
      * tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
        btrfs: fix corruption after buffer fault in during direct IO append write
        btrfs: zoned: fix zone_unusable accounting on making block group read-write again
        btrfs: do not subtract delalloc from avail bytes
        btrfs: make cow_file_range_inline() honor locked_page on error
        btrfs: fix corrupt read due to bad offset of a compressed extent map
        btrfs: tree-checker: validate dref root and objectid
      e4fc196f
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v6.11-2024-07-30' of... · e254e0c5
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v6.11-2024-07-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
      
      Pull perf tools fixes from Namhyung Kim:
       "Some more build fixes and a random crash fix:
      
         - Fix cross-build by setting pkg-config env according to the arch
      
         - Fix static build for missing library dependencies
      
         - Fix Segfault when callchain has no symbols"
      
      * tag 'perf-tools-fixes-for-v6.11-2024-07-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
        perf docs: Document cross compilation
        perf: build: Link lib 'zstd' for static build
        perf: build: Link lib 'lzma' for static build
        perf: build: Only link libebl.a for old libdw
        perf: build: Set Python configuration for cross compilation
        perf: build: Setup PKG_CONFIG_LIBDIR for cross compilation
        perf tool: fix dereferencing NULL al->maps
      e254e0c5
  2. 30 Jul, 2024 4 commits
    • Linus Torvalds's avatar
      Merge tag 'chrome-platform-fixes-for-v6.11-rc2' of... · c91a7dee
      Linus Torvalds authored
      Merge tag 'chrome-platform-fixes-for-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
      
      Pull chrome-platform fix from Tzung-Bi Shih:
       "Fix a race condition that sends multiple host commands at a time"
      
      * tag 'chrome-platform-fixes-for-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
        platform/chrome: cros_ec_proto: Lock device when updating MKBP version
      c91a7dee
    • Linus Torvalds's avatar
      minmax: improve macro expansion and type checking · 22f54687
      Linus Torvalds authored
      This clarifies the rules for min()/max()/clamp() type checking and makes
      them a much more efficient macro expansion.
      
      In particular, we now look at the type and range of the inputs to see
      whether they work together, generating a mask of acceptable comparisons,
      and then just verifying that the inputs have a shared case:
      
       - an expression with a signed type can be used for
          (1) signed comparisons
          (2) unsigned comparisons if it is statically known to have a
              non-negative value
      
       - an expression with an unsigned type can be used for
          (3) unsigned comparison
          (4) signed comparisons if the type is smaller than 'int' and thus
              the C integer promotion rules will make it signed anyway
      
      Here rule (1) and (3) are obvious, and rule (2) is important in order to
      allow obvious trivial constants to be used together with unsigned
      values.
      
      Rule (4) is not necessarily a good idea, but matches what we used to do,
      and we have extant cases of this situation in the kernel.  Notably with
      bcachefs having an expression like
      
      	min(bch2_bucket_sectors_dirty(a), ca->mi.bucket_size)
      
      where bch2_bucket_sectors_dirty() returns an 's64', and
      'ca->mi.bucket_size' is of type 'u16'.
      
      Technically that bcachefs comparison is clearly sensible on a C type
      level, because the 'u16' will go through the normal C integer promotion,
      and become 'int', and then we're comparing two signed values and
      everything looks sane.
      
      However, it's not entirely clear that a 'min(s64,u16)' operation makes a
      lot of conceptual sense, and it's possible that we will remove rule (4).
      After all, the _reason_ we have these complicated type checks is exactly
      that the C type promotion rules are not very intuitive.
      
      But at least for now the rule is in place for backwards compatibility.
      
      Also note that rule (2) existed before, but is hugely relaxed by this
      commit.  It used to be true only for the simplest compile-time
      non-negative integer constants.  The new macro model will allow cases
      where the compiler can trivially see that an expression is non-negative
      even if it isn't necessarily a constant.
      
      For example, the amdgpu driver does
      
      	min_t(size_t, sizeof(fru_info->serial), pia[addr] & 0x3F));
      
      because our old 'min()' macro would see that 'pia[addr] & 0x3F' is of
      type 'int' and clearly not a C constant expression, so doing a 'min()'
      with a 'size_t' is a signedness violation.
      
      Our new 'min()' macro still sees that 'pia[addr] & 0x3F' is of type
      'int', but is smart enough to also see that it is clearly non-negative,
      and thus would allow that case without any complaints.
      
      Cc: Arnd Bergmann <arnd@kernel.org>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22f54687
    • David Sterba's avatar
      btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry() · b8e947e9
      David Sterba authored
      Some arch + compiler combinations report a potentially unused variable
      location in btrfs_lookup_dentry(). This is a false alert as the variable
      is passed by value and always valid or there's an error. The compilers
      cannot probably reason about that although btrfs_inode_by_name() is in
      the same file.
      
         >  + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
         +uninitialized in this function [-Werror=maybe-uninitialized]:  => 5603:9
         >  + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
         +uninitialized in this function [-Werror=maybe-uninitialized]:  => 5674:5
      
         m68k-gcc8/m68k-allmodconfig
         mips-gcc8/mips-allmodconfig
         powerpc-gcc5/powerpc-all{mod,yes}config
         powerpc-gcc5/ppc64_defconfig
      
      Initialize it to zero, this should fix the warnings and won't change the
      behaviour as btrfs_inode_by_name() accepts only a root or inode item
      types, otherwise returns an error.
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8e947e9
    • Patryk Duda's avatar
      platform/chrome: cros_ec_proto: Lock device when updating MKBP version · df615907
      Patryk Duda authored
      The cros_ec_get_host_command_version_mask() function requires that the
      caller must have ec_dev->lock mutex before calling it. This requirement
      was not met and as a result it was possible that two commands were sent
      to the device at the same time.
      
      The problem was observed while using UART backend which doesn't use any
      additional locks, unlike SPI backend which locks the controller until
      response is received.
      
      Fixes: f74c7557 ("platform/chrome: cros_ec_proto: Update version on GET_NEXT_EVENT failure")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPatryk Duda <patrykd@google.com>
      Link: https://lore.kernel.org/r/20240730104425.607083-1-patrykd@google.comSigned-off-by: default avatarTzung-Bi Shih <tzungbi@kernel.org>
      df615907
  3. 29 Jul, 2024 15 commits
    • Linus Torvalds's avatar
      profiling: remove stale percpu flip buffer variables · 94ede2a3
      Linus Torvalds authored
      For some reason I didn't see this issue on my arm64 or x86-64 builds,
      but Stephen Rothwell reports that commit 2accfdb7 ("profiling:
      attempt to remove per-cpu profile flip buffer") left these static
      variables around, and the powerpc build is unhappy about them:
      
        kernel/profile.c:52:28: warning: 'cpu_profile_flip' defined but not used [-Wunused-variable]
           52 | static DEFINE_PER_CPU(int, cpu_profile_flip);
              |                            ^~~~~~~~~~~~~~~~
        ..
      
      So remove these stale left-over remnants too.
      
      Fixes: 2accfdb7 ("profiling: attempt to remove per-cpu profile flip buffer")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94ede2a3
    • Linus Torvalds's avatar
      Merge tag 'for-linus-2024072901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · 6b5faec9
      Linus Torvalds authored
      Pull HID fixes from Benjamin Tissoires:
      
       - fixes for HID-BPF after the merge with the bpf tree (Arnd Bergmann
         and Benjamin Tissoires)
      
       - some tool type fix for the Wacom driver (Tatsunosuke Tobita)
      
       - a reorder of the sensor discovery to ensure the HID AMD SFH is
         removed when no sensors are available (Basavaraj Natikar)
      
      * tag 'for-linus-2024072901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
        selftests/hid: add test for attaching multiple time the same struct_ops
        HID: bpf: prevent the same struct_ops to be attached more than once
        selftests/hid: disable struct_ops auto-attach
        selftests/hid: fix bpf_wq new API
        HID: amd_sfh: Move sensor discovery before HID device initialization
        hid: bpf: add BPF_JIT dependency
        HID: wacom: more appropriate tool type categorization
        HID: wacom: Modify pen IDs
      6b5faec9
    • Linus Torvalds's avatar
      Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · 10826505
      Linus Torvalds authored
      Pull virtio fixes from Michael Tsirkin:
       "The biggest thing here is the adminq change - but it looks like the
        only way to avoid headq blocking causing indefinite stalls.
      
        This fixes three issues:
      
         - Prevent admin commands on one VF blocking another.
      
           This prevents a bad VF from blocking a good one, as well as fixing
           a scalability issue with large # of VFs
      
         - Correctly return error on command failure on octeon. We used to
           treat failed commands as a success.
      
         - Fix modpost warning when building virtio_dma_buf. Harmless, but the
           fix is trivial"
      
      * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
        virtio_pci_modern: remove admin queue serialization lock
        virtio_pci_modern: use completion instead of busy loop to wait on admin cmd result
        virtio_pci_modern: pass cmd as an identification token
        virtio_pci_modern: create admin queue of queried size
        virtio: create admin queues alongside other virtqueues
        virtio_pci: pass vq info as an argument to vp_setup_vq()
        virtio: push out code to vp_avq_index()
        virtio_pci_modern: treat vp_dev->admin_vq.info.vq pointer as static
        virtio_pci: introduce vector allocation fallback for slow path virtqueues
        virtio_pci: pass vector policy enum to vp_find_one_vq_msix()
        virtio_pci: pass vector policy enum to vp_find_vqs_msix()
        virtio_pci: simplify vp_request_msix_vectors() call a bit
        virtio_pci: push out single vq find code to vp_find_one_vq_msix()
        vdpa/octeon_ep: Fix error code in octep_process_mbox()
        virtio: add missing MODULE_DESCRIPTION() macro
      10826505
    • Linus Torvalds's avatar
      task_work: make TWA_NMI_CURRENT handling conditional on IRQ_WORK · cec6937d
      Linus Torvalds authored
      The TWA_NMI_CURRENT handling very much depends on IRQ_WORK, but that
      isn't universally enabled everywhere.
      
      Maybe the IRQ_WORK infrastructure should just be unconditional - x86
      ends up indirectly enabling it through unconditionally enabling
      PERF_EVENTS, for example.  But it also gets enabled by having SMP
      support, or even if you just have PRINTK enabled.
      
      But in the meantime TWA_NMI_CURRENT causes tons of build failures on
      various odd minimal configs.  Which did show up in linux-next, but
      despite that nobody bothered to fix it or even inform me until -rc1 was
      out.
      
      Fixes: 466e4d80 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode")
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Reported-by: default avatarkernelci.org bot <bot@kernelci.org>
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cec6937d
    • Linus Torvalds's avatar
      profiling: attempt to remove per-cpu profile flip buffer · 2accfdb7
      Linus Torvalds authored
      This is the really old legacy kernel profiling code, which has long
      since been obviated by "real profiling" (ie 'prof' and company), and
      mainly remains as a source of syzbot reports.
      
      There are anecdotal reports that people still use it for boot-time
      profiling, but it's unlikely that such use would care about the old NUMA
      optimizations in this code from 2004 (commit ad02973d: "profile: 512x
      Altix timer interrupt livelock fix" in the BK import archive at [1])
      
      So in order to head off future syzbot reports, let's try to simplify
      this code and get rid of the per-cpu profile buffers that are quite a
      large portion of the complexity footprint of this thing (including CPU
      hotplug callbacks etc).
      
      It's unlikely anybody will actually notice, or possibly, as Thomas put
      it: "Only people who indulge in nostalgia will notice :)".
      
      That said, if it turns out that this code is actually actively used by
      somebody, we can always revert this removal.  Thus the "attempt" in the
      summary line.
      
      [ Note: in a small nod to "the profiling code can cause NUMA problems",
        this also removes the "increment the last entry in the profiling array
        on any unknown hits" logic. That would account any program counter in
        a module to that single counter location, and might exacerbate any
        NUMA cacheline bouncing issues ]
      
      Link: https://lore.kernel.org/all/CAHk-=wgs52BxT4Zjmjz8aNvHWKxf5_ThBY4bYL1Y6CTaNL2dTw@mail.gmail.com/
      Link:  https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git [1]
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2accfdb7
    • Tetsuo Handa's avatar
      profiling: remove prof_cpu_mask · 7c51f7bb
      Tetsuo Handa authored
      syzbot is reporting uninit-value at profile_hits(), for there is a race
      window between
      
        if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
          return -ENOMEM;
        cpumask_copy(prof_cpu_mask, cpu_possible_mask);
      
      in profile_init() and
      
        cpumask_available(prof_cpu_mask) &&
        cpumask_test_cpu(smp_processor_id(), prof_cpu_mask))
      
      in profile_tick(); prof_cpu_mask remains uninitialzed until cpumask_copy()
      completes while cpumask_available(prof_cpu_mask) returns true as soon as
      alloc_cpumask_var(&prof_cpu_mask) completes.
      
      We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
      call cpumask_copy() from create_proc_profile() on only UP kernels, for
      profile_online_cpu() calls cpumask_set_cpu() as needed via
      cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
      removes prof_cpu_mask because it seems unnecessary.
      
      The cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) test
      in profile_tick() is likely always true due to
      
        a CPU cannot call profile_tick() if that CPU is offline
      
      and
      
        cpumask_set_cpu(cpu, prof_cpu_mask) is called when that CPU becomes
        online and cpumask_clear_cpu(cpu, prof_cpu_mask) is called when that
        CPU becomes offline
      
      . This test could be false during transition between online and offline.
      
      But according to include/linux/cpuhotplug.h , CPUHP_PROFILE_PREPARE
      belongs to PREPARE section, which means that the CPU subjected to
      profile_dead_cpu() cannot be inside profile_tick() (i.e. no risk of
      use-after-free bug) because interrupt for that CPU is disabled during
      PREPARE section. Therefore, this test is guaranteed to be true, and
      can be removed. (Since profile_hits() checks prof_buffer != NULL, we
      don't need to check prof_buffer != NULL here unless get_irq_regs() or
      user_mode() is such slow that we want to avoid when prof_buffer == NULL).
      
      do_profile_hits() is called from profile_tick() from timer interrupt
      only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and
      prof_buffer is not NULL. But syzbot is also reporting that sometimes
      do_profile_hits() is called while current thread is still doing vzalloc(),
      where prof_buffer must be NULL at this moment. This indicates that multiple
      threads concurrently tried to write to /sys/kernel/profiling interface,
      which caused that somebody else try to re-allocate prof_buffer despite
      somebody has already allocated prof_buffer. Fix this by using
      serialization.
      Reported-by: default avatarsyzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=b1a83ab2a9eb9321fbddSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Tested-by: default avatarsyzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c51f7bb
    • Tetsuo Handa's avatar
      Input: MT - limit max slots · 99d3bf5f
      Tetsuo Handa authored
      syzbot is reporting too large allocation at input_mt_init_slots(), for
      num_slots is supplied from userspace using ioctl(UI_DEV_CREATE).
      
      Since nobody knows possible max slots, this patch chose 1024.
      Reported-by: default avatarsyzbot <syzbot+0122fa359a69694395d5@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=0122fa359a69694395d5Suggested-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99d3bf5f
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux · 3894840a
      Linus Torvalds authored
      Pull ARM updates from Russell King:
      
       - ftrace: don't assume stack frames are contiguous in memory
      
       - remove unused mod_inwind_map structure
      
       - spelling fixes
      
       - allow use of LD dead code/data elimination
      
       - fix callchain_trace() return value
      
       - add support for stackleak gcc plugin
      
       - correct some reset asm function prototypes for CFI
      
      [ Missed the merge window because Russell forgot to push out ]
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
        ARM: 9408/1: mm: CFI: Fix some erroneous reset prototypes
        ARM: 9407/1: Add support for STACKLEAK gcc plugin
        ARM: 9406/1: Fix callchain_trace() return value
        ARM: 9404/1: arm32: enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION
        ARM: 9403/1: Alpine: Spelling s/initialiing/initializing/
        ARM: 9402/1: Kconfig: Spelling s/Cortex A-/Cortex-A/
        ARM: 9400/1: Remove unused struct 'mod_unwind_map'
      3894840a
    • Filipe Manana's avatar
      btrfs: fix corruption after buffer fault in during direct IO append write · 939b656b
      Filipe Manana authored
      During an append (O_APPEND write flag) direct IO write if the input buffer
      was not previously faulted in, we can corrupt the file in a way that the
      final size is unexpected and it includes an unexpected hole.
      
      The problem happens like this:
      
      1) We have an empty file, with size 0, for example;
      
      2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
         buffer is not currently faulted in;
      
      3) We enter btrfs_direct_write(), lock the inode and call
         generic_write_checks(), which calls generic_write_checks_count(), and
         that function sets the iocb position to 0 with the following code:
      
      	if (iocb->ki_flags & IOCB_APPEND)
      		iocb->ki_pos = i_size_read(inode);
      
      4) We call btrfs_dio_write() and enter into iomap, which will end up
         calling btrfs_dio_iomap_begin() and that calls
         btrfs_get_blocks_direct_write(), where we update the i_size of the
         inode to 4096 bytes;
      
      5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
         the page of the write input buffer (at iomap_dio_bio_iter(), with a
         call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
         returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
      
      6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
         fault in the write buffer and then goto to the label 'relock';
      
      7) We lock again the inode, do all the necessary checks again and call
         again generic_write_checks(), which calls generic_write_checks_count()
         again, and there we set the iocb's position to 4K, which is the current
         i_size of the inode, with the following code pointed above:
      
              if (iocb->ki_flags & IOCB_APPEND)
                      iocb->ki_pos = i_size_read(inode);
      
      8) Then we go again to btrfs_dio_write() and enter iomap and the write
         succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
         the [0, 4K) range and an i_size of 8K, which goes against the
         expectations of having the data written to the range [0, 4K) and get an
         i_size of 4K.
      
      Fix this by not unlocking the inode before faulting in the input buffer,
      in case we get -EFAULT or an incomplete write, and not jumping to the
      'relock' label after faulting in the buffer - instead jump to a location
      immediately before calling iomap, skipping all the write checks and
      relocking. This solves this problem and it's fine even in case the input
      buffer is memory mapped to the same file range, since only holding the
      range locked in the inode's io tree can cause a deadlock, it's safe to
      keep the inode lock (VFS lock), as was fixed and described in commit
      51bd9563 ("btrfs: fix deadlock due to page faults during direct IO
      reads and writes").
      
      A sample reproducer provided by a reporter is the following:
      
         $ cat test.c
         #ifndef _GNU_SOURCE
         #define _GNU_SOURCE
         #endif
      
         #include <fcntl.h>
         #include <stdio.h>
         #include <sys/mman.h>
         #include <sys/stat.h>
         #include <unistd.h>
      
         int main(int argc, char *argv[])
         {
             if (argc < 2) {
                 fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
                 return 1;
             }
      
             int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
                           O_APPEND, 0644);
             if (fd < 0) {
                 perror("creating test file");
                 return 1;
             }
      
             char *buf = mmap(NULL, 4096, PROT_READ,
                              MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
             ssize_t ret = write(fd, buf, 4096);
             if (ret < 0) {
                 perror("pwritev2");
                 return 1;
             }
      
             struct stat stbuf;
             ret = fstat(fd, &stbuf);
             if (ret < 0) {
                 perror("stat");
                 return 1;
             }
      
             printf("size: %llu\n", (unsigned long long)stbuf.st_size);
             return stbuf.st_size == 4096 ? 0 : 1;
         }
      
      A test case for fstests will be sent soon.
      Reported-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
      Fixes: 8184620a ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
      CC: stable@vger.kernel.org # 6.1+
      Tested-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      939b656b
    • Naohiro Aota's avatar
      btrfs: zoned: fix zone_unusable accounting on making block group read-write again · 8cd44dd1
      Naohiro Aota authored
      When btrfs makes a block group read-only, it adds all free regions in the
      block group to space_info->bytes_readonly. That free space excludes
      reserved and pinned regions. OTOH, when btrfs makes the block group
      read-write again, it moves all the unused regions into the block group's
      zone_unusable. That unused region includes reserved and pinned regions.
      As a result, it counts too much zone_unusable bytes.
      
      Fortunately (or unfortunately), having erroneous zone_unusable does not
      affect the calculation of space_info->bytes_readonly, because free
      space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
      the erroneous zone_unusable and it reduces the num_bytes just to cancel the
      error.
      
      This behavior can be easily discovered by adding a WARN_ON to check e.g,
      "bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
      case like btrfs/282.
      
      Fix it by properly considering pinned and reserved in
      btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
      btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
      
      Fixes: 169e0da9 ("btrfs: zoned: track unusable bytes for zones")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cd44dd1
    • Naohiro Aota's avatar
      btrfs: do not subtract delalloc from avail bytes · d89c285d
      Naohiro Aota authored
      The block group's avail bytes printed when dumping a space info subtract
      the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
      btrfs_free_reserved_bytes(), it is added or subtracted along with
      "reserved" for the delalloc case, which means the "delalloc_bytes" is a
      part of the "reserved" bytes. So, excluding it to calculate the avail space
      counts delalloc_bytes twice, which can lead to an invalid result.
      
      Fixes: e50b122b ("btrfs: print available space for a block group when dumping a space info")
      CC: stable@vger.kernel.org # 6.6+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d89c285d
    • Boris Burkov's avatar
      btrfs: make cow_file_range_inline() honor locked_page on error · 47857437
      Boris Burkov authored
      The btrfs buffered write path runs through __extent_writepage() which
      has some tricky return value handling for writepage_delalloc().
      Specifically, when that returns 1, we exit, but for other return values
      we continue and end up calling btrfs_folio_end_all_writers(). If the
      folio has been unlocked (note that we check the PageLocked bit at the
      start of __extent_writepage()), this results in an assert panic like
      this one from syzbot:
      
        BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
        BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
        BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
        assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/subpage.c:871!
        Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
        CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
        6.10.0-syzkaller-05505-gb1bc554e #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 06/27/2024
        RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
        Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
        0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
        6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
        RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
        RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
        RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
        RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
        R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
        R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
        FS:  0000000000000000(0000) GS:ffff8880b9500000(0000)
        knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
        <TASK>
        __extent_writepage fs/btrfs/extent_io.c:1597 [inline]
        extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
        btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
        do_writepages+0x359/0x870 mm/page-writeback.c:2656
        filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
        __filemap_fdatawrite_range mm/filemap.c:430 [inline]
        __filemap_fdatawrite mm/filemap.c:436 [inline]
        filemap_flush+0xdf/0x130 mm/filemap.c:463
        btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
        __fput+0x24a/0x8a0 fs/file_table.c:422
        task_work_run+0x24f/0x310 kernel/task_work.c:222
        exit_task_work include/linux/task_work.h:40 [inline]
        do_exit+0xa2f/0x27f0 kernel/exit.c:877
        do_group_exit+0x207/0x2c0 kernel/exit.c:1026
        __do_sys_exit_group kernel/exit.c:1037 [inline]
        __se_sys_exit_group kernel/exit.c:1035 [inline]
        __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
        x64_sys_call+0x2634/0x2640
        arch/x86/include/generated/asm/syscalls_64.h:232
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
        entry_SYSCALL_64_after_hwframe+0x77/0x7f
        RIP: 0033:0x7f5f075b70c9
        Code: Unable to access opcode bytes at
        0x7f5f075b709f.
      
      I was hitting the same issue by doing hundreds of accelerated runs of
      generic/475, which also hits IO errors by design.
      
      I instrumented that reproducer with bpftrace and found that the
      undesirable folio_unlock was coming from the following callstack:
      
        folio_unlock+5
        __process_pages_contig+475
        cow_file_range_inline.constprop.0+230
        cow_file_range+803
        btrfs_run_delalloc_range+566
        writepage_delalloc+332
        __extent_writepage # inlined in my stacktrace, but I added it here
        extent_write_cache_pages+622
      
      Looking at the bisected-to patch in the syzbot report, Josef realized
      that the logic of the cow_file_range_inline error path subtly changing.
      In the past, on error, it jumped to out_unlock in cow_file_range(),
      which honors the locked_page, so when we ultimately call
      folio_end_all_writers(), the folio of interest is still locked. After
      the change, we always unlocked ignoring the locked_page, on both success
      and error. On the success path, this all results in returning 1 to
      __extent_writepage(), which skips the folio_end_all_writers() call,
      which makes it OK to have unlocked.
      
      Fix the bug by wiring the locked_page into cow_file_range_inline() and
      only setting locked_page to NULL on success.
      
      Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com
      Fixes: 0586d0a8 ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
      CC: stable@vger.kernel.org # 6.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47857437
    • Linus Torvalds's avatar
      minmax: simplify min()/max()/clamp() implementation · dc1c8034
      Linus Torvalds authored
      Now that we no longer have any C constant expression contexts (ie array
      size declarations or static initializers) that use min() or max(), we
      can simpify the implementation by not having to worry about the result
      staying as a C constant expression.
      
      So now we can unconditionally just use temporary variables of the right
      type, and get rid of the excessive expansion that used to come from the
      use of
      
         __builtin_choose_expr(__is_constexpr(...), ..
      
      to pick the specialized code for constant expressions.
      
      Another expansion simplification is to pass the temporary variables (in
      addition to the original expression) to our __types_ok() macro.  That
      may superficially look like it complicates the macro, but when we only
      want the type of the expression, expanding the temporary variable names
      is much simpler and smaller than expanding the potentially complicated
      original expression.
      
      As a result, on my machine, doing a
      
        $ time make drivers/staging/media/atomisp/pci/isp/kernels/ynr/ynr_1.0/ia_css_ynr.host.i
      
      goes from
      
      	real	0m16.621s
      	user	0m15.360s
      	sys	0m1.221s
      
      to
      
      	real	0m2.532s
      	user	0m2.091s
      	sys	0m0.452s
      
      because the token expansion goes down dramatically.
      
      In particular, the longest line expansion (which was line 71 of that
      'ia_css_ynr.host.c' file) shrinks from 23,338kB (yes, 23MB for one
      single line) to "just" 1,444kB (now "only" 1.4MB).
      
      And yes, that line is still the line from hell, because it's doing
      multiple levels of "min()/max()" expansion thanks to some of them being
      hidden inside the uDIGIT_FITTING() macro.
      
      Lorenzo has a nice cleanup patch that makes that driver use inline
      functions instead of macros for sDIGIT_FITTING() and uDIGIT_FITTING(),
      which will fix that line once and for all, but the 16-fold reduction in
      this case does show why we need to simplify these helpers.
      
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc1c8034
    • Linus Torvalds's avatar
      minmax: don't use max() in situations that want a C constant expression · cb04e8b1
      Linus Torvalds authored
      We only had a couple of array[] declarations, and changing them to just
      use 'MAX()' instead of 'max()' fixes the issue.
      
      This will allow us to simplify our min/max macros enormously, since they
      can now unconditionally use temporary variables to avoid using the
      argument values multiple times.
      
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb04e8b1
    • Linus Torvalds's avatar
      minmax: scsi: fix mis-use of 'clamp()' in sr.c · 9f499b8c
      Linus Torvalds authored
      While working on simplifying the minmax functions, and avoiding
      excessive macro expansion, it turns out that the sr.c use of the
      'clamp()' macro has the arguments the wrong way around.
      
      The clamp logic is
      
      	val = clamp(in, low, high);
      
      and it returns the input clamped to the low/high limits. But sr.c ddid
      
      	speed = clamp(0, speed, 0xffff / 177);
      
      which clamps the value '0' to the range '[speed, 0xffff / 177]' and ends
      up being nonsensical.
      
      Happily, I don't think anybody ever cared.
      
      Fixes: 9fad9d56 ("scsi: sr: Fix unintentional arithmetic wraparound")
      Cc: Justin Stitt <justinstitt@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f499b8c
  4. 28 Jul, 2024 4 commits
    • Linus Torvalds's avatar
      minmax: make generic MIN() and MAX() macros available everywhere · 1a251f52
      Linus Torvalds authored
      This just standardizes the use of MIN() and MAX() macros, with the very
      traditional semantics.  The goal is to use these for C constant
      expressions and for top-level / static initializers, and so be able to
      simplify the min()/max() macros.
      
      These macro names were used by various kernel code - they are very
      traditional, after all - and all such users have been fixed up, with a
      few different approaches:
      
       - trivial duplicated macro definitions have been removed
      
         Note that 'trivial' here means that it's obviously kernel code that
         already included all the major kernel headers, and thus gets the new
         generic MIN/MAX macros automatically.
      
       - non-trivial duplicated macro definitions are guarded with #ifndef
      
         This is the "yes, they define their own versions, but no, the include
         situation is not entirely obvious, and maybe they don't get the
         generic version automatically" case.
      
       - strange use case #1
      
         A couple of drivers decided that the way they want to describe their
         versioning is with
      
      	#define MAJ 1
      	#define MIN 2
      	#define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)
      
         which adds zero value and I just did my Alexander the Great
         impersonation, and rewrote that pointless Gordian knot as
      
      	#define DRV_VERSION "1.2"
      
         instead.
      
       - strange use case #2
      
         A couple of drivers thought that it's a good idea to have a random
         'MIN' or 'MAX' define for a value or index into a table, rather than
         the traditional macro that takes arguments.
      
         These values were re-written as C enum's instead. The new
         function-line macros only expand when followed by an open
         parenthesis, and thus don't clash with enum use.
      
      Happily, there weren't really all that many of these cases, and a lot of
      users already had the pattern of using '#ifndef' guarding (or in one
      case just using '#undef MIN') before defining their own private version
      that does the same thing. I left such cases alone.
      
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a251f52
    • Linus Torvalds's avatar
      Linux 6.11-rc1 · 8400291e
      Linus Torvalds authored
      8400291e
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v6.11' of... · a0c04bd5
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Fix RPM package build error caused by an incorrect locale setup
      
       - Mark modules.weakdep as ghost in RPM package
      
       - Fix the odd combination of -S and -c in stack protector scripts,
         which is an error with the latest Clang
      
      * tag 'kbuild-fixes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: Fix '-S -c' in x86 stack protector scripts
        kbuild: rpm-pkg: ghost modules.weakdep file
        kbuild: rpm-pkg: Fix C locale setup
      a0c04bd5
    • Linus Torvalds's avatar
      minmax: simplify and clarify min_t()/max_t() implementation · 017fa3e8
      Linus Torvalds authored
      This simplifies the min_t() and max_t() macros by no longer making them
      work in the context of a C constant expression.
      
      That means that you can no longer use them for static initializers or
      for array sizes in type definitions, but there were only a couple of
      such uses, and all of them were converted (famous last words) to use
      MIN_T/MAX_T instead.
      
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      017fa3e8