1. 09 Oct, 2019 11 commits
    • Arnaldo Carvalho de Melo's avatar
      perf trace: Add a strtoul() method to 'struct syscall_arg_fmt' · 3f41b778
      Arnaldo Carvalho de Melo authored
      This will go from a string to a number, so that filter expressions can
      be constructed with strings and then, before applying the tracepoint
      filters (or eBPF, in the future) we can map those strings to numbers.
      
      The first one will be for 'msr' tracepoint arguments, but real quickly
      we will be able to reuse all strarrays for that.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-wgqq48agcgr95b8dmn6fygtr@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      3f41b778
    • Arnaldo Carvalho de Melo's avatar
      perf trace: Introduce --filter for tracepoint events · d4097f19
      Arnaldo Carvalho de Melo authored
      Similar to what is in 'perf record', works just like there:
      
        # perf trace -e msr:*
         328.297 :0/0 msr:write_msr(msr: FS_BASE, val: 140240388381888)
         328.302 :0/0 msr:write_msr(msr: FS_BASE, val: 140240388381888)
         328.306 :0/0 msr:write_msr(msr: FS_BASE, val: 140240388381888)
         328.317 :0/0 msr:write_msr(msr: FS_BASE, val: 140240388381888)
         328.322 :0/0 msr:write_msr(msr: FS_BASE, val: 140240388381888)
         328.327 :0/0 msr:write_msr(msr: FS_BASE, val: 140240388381888)
         328.331 :0/0 msr:write_msr(msr: FS_BASE, val: 140240388381888)
         328.336 :0/0 msr:write_msr(msr: FS_BASE, val: 140240388381888)
         328.340 :0/0 ^Cmsr:write_msr(msr: FS_BASE, val: 140240388381888)
        #
      
      So, for a system wide trace session looking at the write_msr tracepoint
      we see a flood of MSR_FS_BASE, we need to get the number for that:
      
        # grep FS_BASE /tmp/build/perf/trace/beauty/generated/x86_arch_MSRs_array.c
      	[0xc0000100 - x86_64_specific_MSRs_offset] = "FS_BASE",
        #
      
      And then use it in a filter:
      
        # perf trace -e msr:* --filter="msr!=0xc0000100"
        <SNIP>
         942.177 :0/0 msr:write_msr(msr: IA32_TSC_DEADLINE, val: 3056931068232)
         942.199 :0/0 msr:write_msr(msr: IA32_TSC_DEADLINE, val: 3057135655252)
         942.203 :0/0 msr:write_msr(msr: IA32_TSC_DEADLINE, val: 3056931068222)
         942.231 :0/0 msr:write_msr(msr: IA32_TSC_DEADLINE, val: 3056998373022)
         942.241 :0/0 msr:write_msr(msr: IA32_TSC_DEADLINE, val: 3056931068236)
        <SNIP>
        #
      
      Ok, lets filter that too, too noisy:
      
        # grep TSC_DEADLINE /tmp/build/perf/trace/beauty/generated/x86_arch_MSRs_array.c
      	[0x000006E0] = "IA32_TSC_DEADLINE",
        #
      
        # perf trace -e msr:* --filter="msr!=0xc0000100 && msr!=0x6e0" -a sleep 0.1
           0.000 :0/0 msr:read_msr(msr: IA32_TSC_ADJUST)
           0.066 CPU 0/KVM/4895 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6)
           0.070 CPU 0/KVM/4895 msr:write_msr(msr: 0x830, val: 34359740667)
           0.099 CPU 0/KVM/4895 msr:read_msr(msr: IA32_SYSENTER_ESP, val: -2199021993472)
           0.100 CPU 0/KVM/4895 msr:read_msr(msr: IA32_APICBASE, val: 4276096000)
           0.101 CPU 0/KVM/4895 msr:read_msr(msr: IA32_DEBUGCTLMSR)
           0.109 :0/0 msr:write_msr(msr: IA32_SPEC_CTRL)
           1.000 :0/0 msr:write_msr(msr: 0x830, val: 17179871485)
          18.893 :0/0 msr:write_msr(msr: 0x83f, val: 246)
          28.810 :0/0 msr:write_msr(msr: 0x830, val: 68719479037)
          40.117 CPU 0/KVM/4895 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6)
          40.127 CPU 0/KVM/4895 msr:read_msr(msr: IA32_DEBUGCTLMSR)
          40.139 CPU 0/KVM/4895 msr:write_msr(msr: LSTAR, val: -2130661312)
          40.141 CPU 0/KVM/4895 msr:write_msr(msr: SYSCALL_MASK, val: 14080)
          40.142 CPU 0/KVM/4895 msr:write_msr(msr: TSC_AUX)
          40.144 CPU 0/KVM/4895 msr:write_msr(msr: KERNEL_GS_BASE)
          40.147 CPU 0/KVM/4895 msr:write_msr(msr: IA32_SPEC_CTRL)
          40.148 CPU 0/KVM/4895 msr:write_msr(msr: IA32_FLUSH_CMD, val: 1)
          40.151 CPU 0/KVM/4895 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6)
        ^C
        #
      
      One can combine that with filtering pids as well:
      
        # perf trace -e msr:* --filter="msr!=0xc0000100 && msr!=0x6e0" --filter-pids 4895 -a sleep 0.09
           0.000 :0/0 msr:write_msr(msr: 0x830, val: 4294969597)
           0.291 gnome-terminal/2790 msr:write_msr(msr: SYSCALL_MASK, val: 292608)
           0.294 gnome-terminal/2790 msr:write_msr(msr: LSTAR, val: -1935671280)
           0.295 gnome-terminal/2790 msr:write_msr(msr: TSC_AUX, val: 6)
          10.940 gnome-terminal/2790 msr:write_msr(msr: 0x830, val: 4294969597)
          15.943 gnome-shell/2096 msr:write_msr(msr: 0x830, val: 4294969597)
          16.975 :0/0 msr:write_msr(msr: 0x830, val: 4294969597)
          19.560 :0/0 msr:write_msr(msr: 0x83f, val: 246)
          25.162 :0/0 msr:read_msr(msr: IA32_TSC_ADJUST)
          25.807 JS Watchdog/3635 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6)
          25.820 :0/0 msr:write_msr(msr: IA32_SPEC_CTRL)
          25.941 gnome-terminal/2790 msr:write_msr(msr: 0x830, val: 4294969597)
          26.941 gnome-terminal/2790 msr:write_msr(msr: 0x830, val: 4294969597)
          29.942 gnome-terminal/2790 msr:write_msr(msr: 0x830, val: 4294969597)
          45.313 :0/0 msr:write_msr(msr: 0x83f, val: 246)
          56.945 gnome-terminal/2790 msr:write_msr(msr: 0x830, val: 4294969597)
          60.946 gnome-terminal/2790 msr:write_msr(msr: 0x830, val: 4294969597)
          74.096 JS Watchdog/8971 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6)
          74.130 :0/0 msr:write_msr(msr: IA32_SPEC_CTRL)
          79.673 :0/0 msr:write_msr(msr: 0x83f, val: 246)
          79.947 gnome-terminal/2790 msr:write_msr(msr: 0x830, val: 17179871485)
        #
      
      Or for just a pid, with callchains:
      
        # grep SYSCALL_MAS /tmp/build/perf/trace/beauty/generated/x86_arch_MSRs_array.c
      	[0xc0000084 - x86_64_specific_MSRs_offset] = "SYSCALL_MASK",
        # perf trace -e msr:* --filter="msr==0xc0000084" --pid 2790 --call-graph=dwarf
      
           0.000 gnome-terminal/2790 msr:write_msr(msr: SYSCALL_MASK, val: 292608)
                                             do_trace_write_msr ([kernel.kallsyms])
                                             do_trace_write_msr ([kernel.kallsyms])
                                             kvm_on_user_return ([kvm])
                                             fire_user_return_notifiers ([kernel.kallsyms])
                                             exit_to_usermode_loop ([kernel.kallsyms])
                                             do_syscall_64 ([kernel.kallsyms])
                                             entry_SYSCALL_64 ([kernel.kallsyms])
                                             __GI___poll (inlined)
        9299.073 gnome-terminal/2790 msr:write_msr(msr: SYSCALL_MASK, val: 292608)
                                             do_trace_write_msr ([kernel.kallsyms])
                                             do_trace_write_msr ([kernel.kallsyms])
                                             kvm_on_user_return ([kvm])
                                             fire_user_return_notifiers ([kernel.kallsyms])
                                             exit_to_usermode_loop ([kernel.kallsyms])
                                             do_syscall_64 ([kernel.kallsyms])
                                             entry_SYSCALL_64 ([kernel.kallsyms])
                                             __GI___poll (inlined)
        9348.374 gnome-terminal/2790 msr:write_msr(msr: SYSCALL_MASK, val: 292608)
                                             do_trace_write_msr ([kernel.kallsyms])
                                             do_trace_write_msr ([kernel.kallsyms])
                                             kvm_on_user_return ([kvm])
                                             fire_user_return_notifiers ([kernel.kallsyms])
                                             exit_to_usermode_loop ([kernel.kallsyms])
                                             do_syscall_64 ([kernel.kallsyms])
                                             entry_SYSCALL_64 ([kernel.kallsyms])
                                             __GI___poll (inlined)
        <SNIP>
        #
      
      Ok, just another form of KVM to emit MSRs :-)
      
      Next step: elliminate those greps by getting the filter expression,
      looking for arg names, then for the arrays associated with it to do a
      reverse lookup.
      
      Also allow those filters to be associated with strace-like syscall
      names.
      
      After that: augment the 'val' arg for 'msr:write_msr' based on the first
      arg, 'msr'.
      
      Then, do that with eBPF too, not just with tracepoint filters.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-95bfe5d4tzy5f66bx49d05rj@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      d4097f19
    • Arnaldo Carvalho de Melo's avatar
      perf evlist: Introduce append_tp_filter_pid() and append_tp_filter_pids() · 1827ab5b
      Arnaldo Carvalho de Melo authored
      We'll need this to support 'perf trace e tracepoint --filter=expr', as
      the command line tracepoint filter is attache to the preceding evsel,
      just like in 'perf record' and when we go to set pid filters, which we
      do at the minimum to filter 'perf trace' own syscalls, we need to
      append, not set the tp filter.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-daynpknni44ywuzi8iua57nn@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      1827ab5b
    • Arnaldo Carvalho de Melo's avatar
      perf evlist: Introduce append_tp_filter() method · 53c92f73
      Arnaldo Carvalho de Melo authored
      Will be used by 'perf trace' to support 'perf trace --filter', we need
      to append to any pre-existing filter.
      
      When parse_filter() gets invoked to process --filter, it'll set the
      filter to that specified on the command line, later on, when we filter
      out 'perf trace' own pid to avoid an event feedback loop, we need to
      preserve the command line filter put in place by parse_filter().
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-h9rot08qmxlnfmte0holt68x@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      53c92f73
    • Arnaldo Carvalho de Melo's avatar
      perf evlist: Factor out asprintf routine to build a tracepoint pid filter · 05cea449
      Arnaldo Carvalho de Melo authored
      Will be used to append such lists to existing filters.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-798vlyqfqw938ehoe8etivx1@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      05cea449
    • Arnaldo Carvalho de Melo's avatar
      perf trace: Associate the "msr" tracepoint arg name with x86_MSR__scnprintf() · c330ef28
      Arnaldo Carvalho de Melo authored
      So that we can go from:
      
        # perf trace -e msr:write_msr --max-stack=16 sleep 1
             0.000 sleep/6740 msr:write_msr(msr: 3221225728, val: 139636317451648)
                                               do_trace_write_msr ([kernel.kallsyms])
                                               do_trace_write_msr ([kernel.kallsyms])
                                               do_arch_prctl_64 ([kernel.kallsyms])
                                               __x64_sys_arch_prctl ([kernel.kallsyms])
                                               do_syscall_64 ([kernel.kallsyms])
                                               entry_SYSCALL_64 ([kernel.kallsyms])
                                               init_tls (/usr/lib64/ld-2.29.so)
                                               dl_main (/usr/lib64/ld-2.29.so)
                                               _dl_sysdep_start (/usr/lib64/ld-2.29.so)
                                               _dl_start (/usr/lib64/ld-2.29.so)
        #
      
      To:
      
        # perf trace -e msr:write_msr --max-stack=16 sleep 1
           0.000 sleep/8519 msr:write_msr(msr: FS_BASE, val: 139878031705472)
                                             do_trace_write_msr ([kernel.kallsyms])
                                             do_trace_write_msr ([kernel.kallsyms])
                                             do_arch_prctl_64 ([kernel.kallsyms])
                                             __x64_sys_arch_prctl ([kernel.kallsyms])
                                             do_syscall_64 ([kernel.kallsyms])
                                             entry_SYSCALL_64 ([kernel.kallsyms])
                                             init_tls (/usr/lib64/ld-2.29.so)
                                             dl_main (/usr/lib64/ld-2.29.so)
                                             _dl_sysdep_start (/usr/lib64/ld-2.29.so)
                                             _dl_start (/usr/lib64/ld-2.29.so)
        #
      
      This, in reverse, will allow for symbolic system call/tracepoint
      filtering.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-q1q4unmqja5ex7dy0kb5cjaa@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      c330ef28
    • Arnaldo Carvalho de Melo's avatar
      perf trace beauty: Add the glue for the autogenerated MSR arrays · 646b3e2c
      Arnaldo Carvalho de Melo authored
      We need to wrap those autogenerated string arrays with the
      strarrays__scnprintf() formatter, do it.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-wqjz4kwi4a0ot6lsis3kc65j@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      646b3e2c
    • Arnaldo Carvalho de Melo's avatar
      perf trace: Allow associating scnprintf routines with well known arg names · 5d88099b
      Arnaldo Carvalho de Melo authored
      For instance 'msr' appears in several tracepoints, so we can associate
      it with a single scnprintf() routine auto-generated from kernel headers,
      as will be done in followup patches.
      
      Start with an empty array of associations.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-89ptht6s5fez82lykuwq1eyb@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      5d88099b
    • Arnaldo Carvalho de Melo's avatar
      perf beauty: Hook up the x86 MSR table generator · fd218347
      Arnaldo Carvalho de Melo authored
      This way we generate the source with the table for later use by plugins,
      etc.
      
      I.e. after running:
      
        $ make -C tools/perf O=/tmp/build/perf
      
      We end up with:
      
        $ head /tmp/build/perf/trace/beauty/generated/x86_arch_MSRs_array.c
        static const char *x86_MSRs[] = {
        	[0x00000000] = "IA32_P5_MC_ADDR",
        	[0x00000001] = "IA32_P5_MC_TYPE",
        	[0x00000010] = "IA32_TSC",
        	[0x00000017] = "IA32_PLATFORM_ID",
        	[0x0000001b] = "IA32_APICBASE",
        	[0x00000020] = "KNC_PERFCTR0",
        	[0x00000021] = "KNC_PERFCTR1",
        	[0x00000028] = "KNC_EVNTSEL0",
        	[0x00000029] = "KNC_EVNTSEL1",
        $
      
      Now its just a matter of using it, first in a libtracevent plugin.
      
      At some point we should move tools/perf/trace/beauty to tools/beauty/,
      so that it can be used more generally and even made available externally
      like libbpf, libperf, libtraevent, etc.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-b3rmutg4igcohx6kpo67qh4j@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      fd218347
    • Arnaldo Carvalho de Melo's avatar
      perf trace beauty: Add a x86 MSR cmd id->str table generator · 693d3458
      Arnaldo Carvalho de Melo authored
      Without parameters it'll parse tools/arch/x86/include/asm/msr-index.h
      and output a table usable by tools, that will be wired up later to a
      libtraceevent plugin registered from perf's glue code:
      
        $ tools/perf/trace/beauty/tracepoints/x86_msr.sh
        static const char *x86_MSRs[] = {
       <SNIP>
        	[0x00000034] = "SMI_COUNT",
        	[0x0000003a] = "IA32_FEATURE_CONTROL",
        	[0x0000003b] = "IA32_TSC_ADJUST",
        	[0x00000040] = "LBR_CORE_FROM",
        	[0x00000048] = "IA32_SPEC_CTRL",
        	[0x00000049] = "IA32_PRED_CMD",
       <SNIP>
        	[0x0000010b] = "IA32_FLUSH_CMD",
        	[0x0000010F] = "TSX_FORCE_ABORT",
       <SNIP>
        	[0x00000198] = "IA32_PERF_STATUS",
        	[0x00000199] = "IA32_PERF_CTL",
        <SNIP>
        	[0x00000da0] = "IA32_XSS",
        	[0x00000dc0] = "LBR_INFO_0",
        	[0x00000ffc] = "IA32_BNDCFGS_RSVD",
        };
      
        #define x86_64_specific_MSRs_offset 0xc0000080
        static const char *x86_64_specific_MSRs[] = {
        	[0xc0000080 - x86_64_specific_MSRs_offset] = "EFER",
        	[0xc0000081 - x86_64_specific_MSRs_offset] = "STAR",
        	[0xc0000082 - x86_64_specific_MSRs_offset] = "LSTAR",
        	[0xc0000083 - x86_64_specific_MSRs_offset] = "CSTAR",
        	[0xc0000084 - x86_64_specific_MSRs_offset] = "SYSCALL_MASK",
        <SNIP>
        	[0xc0000103 - x86_64_specific_MSRs_offset] = "TSC_AUX",
        	[0xc0000104 - x86_64_specific_MSRs_offset] = "AMD64_TSC_RATIO",
        };
      
        #define x86_AMD_V_KVM_MSRs_offset 0xc0010000
        static const char *x86_AMD_V_KVM_MSRs[] = {
        	[0xc0010000 - x86_AMD_V_KVM_MSRs_offset] = "K7_EVNTSEL0",
        <SNIP>
        	[0xc0010114 - x86_AMD_V_KVM_MSRs_offset] = "VM_CR",
        	[0xc0010115 - x86_AMD_V_KVM_MSRs_offset] = "VM_IGNNE",
        	[0xc0010117 - x86_AMD_V_KVM_MSRs_offset] = "VM_HSAVE_PA",
        <SNIP>
        	[0xc0010240 - x86_AMD_V_KVM_MSRs_offset] = "F15H_NB_PERF_CTL",
        	[0xc0010241 - x86_AMD_V_KVM_MSRs_offset] = "F15H_NB_PERF_CTR",
        	[0xc0010280 - x86_AMD_V_KVM_MSRs_offset] = "F15H_PTSC",
        };
      
      Then these will in turn be hooked up in a follow up patch to be used by
      strarrays__scnprintf().
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-ja080xawx08kedez855usnon@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      693d3458
    • Arnaldo Carvalho de Melo's avatar
      perf beauty: Make strarray's offset be u64 · 8d6505ba
      Arnaldo Carvalho de Melo authored
      We need it for things like MSRs that are sparse and go over MAXINT.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-g8t2d0jr0mg3yimg2qrjkvlt@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      8d6505ba
  2. 07 Oct, 2019 27 commits
  3. 06 Oct, 2019 2 commits
    • Linus Torvalds's avatar
      Linux 5.4-rc2 · da0c9ea1
      Linus Torvalds authored
      da0c9ea1
    • Linus Torvalds's avatar
      elf: don't use MAP_FIXED_NOREPLACE for elf executable mappings · b212921b
      Linus Torvalds authored
      In commit 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map") we
      changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
      executable mappings.
      
      Then, people reported that it broke some binaries that had overlapping
      segments from the same file, and commit ad55eac7 ("elf: enforce
      MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
      overlaying elf segment cases.  But only some - despite the summary line
      of that commit, it only did it when it also does a temporary brk vma for
      one obvious overlapping case.
      
      Now Russell King reports another overlapping case with old 32-bit x86
      binaries, which doesn't trigger that limited case.  End result: we had
      better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.
      
      Yes, it's a sign of old binaries generated with old tool-chains, but we
      do pride ourselves on not breaking existing setups.
      
      This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
      and the old load_elf_library() use-cases, because nobody has reported
      breakage for those. Yet.
      
      Note that in all the cases seen so far, the overlapping elf sections
      seem to be just re-mapping of the same executable with different section
      attributes.  We could possibly introduce a new MAP_FIXED_NOFILECHANGE
      flag or similar, which acts like NOREPLACE, but allows just remapping
      the same executable file using different protection flags.
      
      It's not clear that would make a huge difference to anything, but if
      people really hate that "elf remaps over previous maps" behavior, maybe
      at least a more limited form of remapping would alleviate some concerns.
      
      Alternatively, we should take a look at our elf_map() logic to see if we
      end up not mapping things properly the first time.
      
      In the meantime, this is the minimal "don't do that then" patch while
      people hopefully think about it more.
      Reported-by: default avatarRussell King <linux@armlinux.org.uk>
      Fixes: 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map")
      Fixes: ad55eac7 ("elf: enforce  MAP_FIXED on overlaying elf segments")
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b212921b