1. 11 Mar, 2024 7 commits
    • Alexei Starovoitov's avatar
      bpf: Recognize addr_space_cast instruction in the verifier. · 6082b6c3
      Alexei Starovoitov authored
      rY = addr_space_cast(rX, 0, 1) tells the verifier that rY->type = PTR_TO_ARENA.
      Any further operations on PTR_TO_ARENA register have to be in 32-bit domain.
      
      The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32.
      JIT will generate them as kern_vm_start + 32bit_addr memory accesses.
      
      rY = addr_space_cast(rX, 1, 0) tells the verifier that rY->type = unknown scalar.
      If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well.
      Otherwise JIT will convert it to:
        rY = (u32)rX;
        if (rY)
           rY |= arena->user_vm_start & ~(u64)~0U;
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240308010812.89848-6-alexei.starovoitov@gmail.com
      6082b6c3
    • Alexei Starovoitov's avatar
      bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction. · 142fd4d2
      Alexei Starovoitov authored
      LLVM generates bpf_addr_space_cast instruction while translating
      pointers between native (zero) address space and
      __attribute__((address_space(N))).
      The addr_space=1 is reserved as bpf_arena address space.
      
      rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
      converted to normal 32-bit move: wX = wY
      
      rY = addr_space_cast(rX, 1, 0) has to be converted by JIT:
      
      aux_reg = upper_32_bits of arena->user_vm_start
      aux_reg <<= 32
      wX = wY // clear upper 32 bits of dst register
      if (wX) // if not zero add upper bits of user_vm_start
        wX |= aux_reg
      
      JIT can do it more efficiently:
      
      mov dst_reg32, src_reg32  // 32-bit move
      shl dst_reg, 32
      or dst_reg, user_vm_start
      rol dst_reg, 32
      xor r11, r11
      test dst_reg32, dst_reg32 // check if lower 32-bit are zero
      cmove r11, dst_reg	  // if so, set dst_reg to zero
      			  // Intel swapped src/dst register encoding in CMOVcc
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20240308010812.89848-5-alexei.starovoitov@gmail.com
      142fd4d2
    • Alexei Starovoitov's avatar
      bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions. · 2fe99eb0
      Alexei Starovoitov authored
      Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions.
      They are similar to PROBE_MEM instructions with the following differences:
      - PROBE_MEM has to check that the address is in the kernel range with
        src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE check
      - PROBE_MEM doesn't support store
      - PROBE_MEM32 relies on the verifier to clear upper 32-bit in the register
      - PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in %r12 in the prologue)
        Due to bpf_arena constructions such %r12 + %reg + off16 access is guaranteed
        to be within arena virtual range, so no address check at run-time.
      - PROBE_MEM32 allows STX and ST. If they fault the store is a nop.
        When LDX faults the destination register is zeroed.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/bpf/20240308010812.89848-4-alexei.starovoitov@gmail.com
      2fe99eb0
    • Alexei Starovoitov's avatar
      bpf: Disasm support for addr_space_cast instruction. · 667a86ad
      Alexei Starovoitov authored
      LLVM generates rX = addr_space_cast(rY, dst_addr_space, src_addr_space)
      instruction when pointers in non-zero address space are used by the bpf
      program. Recognize this insn in uapi and in bpf disassembler.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/bpf/20240308010812.89848-3-alexei.starovoitov@gmail.com
      667a86ad
    • Alexei Starovoitov's avatar
      bpf: Introduce bpf_arena. · 31746031
      Alexei Starovoitov authored
      Introduce bpf_arena, which is a sparse shared memory region between the bpf
      program and user space.
      
      Use cases:
      1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed
         anonymous region, like memcached or any key/value storage. The bpf
         program implements an in-kernel accelerator. XDP prog can search for
         a key in bpf_arena and return a value without going to user space.
      2. The bpf program builds arbitrary data structures in bpf_arena (hash
         tables, rb-trees, sparse arrays), while user space consumes it.
      3. bpf_arena is a "heap" of memory from the bpf program's point of view.
         The user space may mmap it, but bpf program will not convert pointers
         to user base at run-time to improve bpf program speed.
      
      Initially, the kernel vm_area and user vma are not populated. User space
      can fault in pages within the range. While servicing a page fault,
      bpf_arena logic will insert a new page into the kernel and user vmas. The
      bpf program can allocate pages from that region via
      bpf_arena_alloc_pages(). This kernel function will insert pages into the
      kernel vm_area. The subsequent fault-in from user space will populate that
      page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time
      can be used to prevent fault-in from user space. In such a case, if a page
      is not allocated by the bpf program and not present in the kernel vm_area,
      the user process will segfault. This is useful for use cases 2 and 3 above.
      
      bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
      either at a specific address within the arena or allocates a range with the
      maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees
      pages and removes the range from the kernel vm_area and from user process
      vmas.
      
      bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of
      bpf program is more important than ease of sharing with user space. This is
      use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended.
      It will tell the verifier to treat the rX = bpf_arena_cast_user(rY)
      instruction as a 32-bit move wX = wY, which will improve bpf prog
      performance. Otherwise, bpf_arena_cast_user is translated by JIT to
      conditionally add the upper 32 bits of user vm_start (if the pointer is not
      NULL) to arena pointers before they are stored into memory. This way, user
      space sees them as valid 64-bit pointers.
      
      Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF
      backend generate the bpf_addr_space_cast() instruction to cast pointers
      between address_space(1) which is reserved for bpf_arena pointers and
      default address space zero. All arena pointers in a bpf program written in
      C language are tagged as __attribute__((address_space(1))). Hence, clang
      provides helpful diagnostics when pointers cross address space. Libbpf and
      the kernel support only address_space == 1. All other address space
      identifiers are reserved.
      
      rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the
      verifier that rX->type = PTR_TO_ARENA. Any further operations on
      PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will
      mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate
      them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar
      to copy_from_kernel_nofault() except that no address checks are necessary.
      The address is guaranteed to be in the 4GB range. If the page is not
      present, the destination register is zeroed on read, and the operation is
      ignored on write.
      
      rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type =
      unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the
      verifier converts such cast instructions to mov32. Otherwise, JIT will emit
      native code equivalent to:
      rX = (u32)rY;
      if (rY)
        rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
      
      After such conversion, the pointer becomes a valid user pointer within
      bpf_arena range. The user process can access data structures created in
      bpf_arena without any additional computations. For example, a linked list
      built by a bpf program can be walked natively by user space.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarBarret Rhoden <brho@google.com>
      Link: https://lore.kernel.org/bpf/20240308010812.89848-2-alexei.starovoitov@gmail.com
      31746031
    • Andrii Nakryiko's avatar
      selftests/bpf: Add fexit and kretprobe triggering benchmarks · 365c2b32
      Andrii Nakryiko authored
      We already have kprobe and fentry benchmarks. Let's add kretprobe and
      fexit ones for completeness.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20240309005124.3004446-1-andrii@kernel.org
      365c2b32
    • Alexei Starovoitov's avatar
      mm: Introduce vmap_page_range() to map pages in PCI address space · d7bca919
      Alexei Starovoitov authored
      ioremap_page_range() should be used for ranges within vmalloc range only.
      The vmalloc ranges are allocated by get_vm_area(). PCI has "resource"
      allocator that manages PCI_IOBASE, IO_SPACE_LIMIT address range, hence
      introduce vmap_page_range() to be used exclusively to map pages
      in PCI address space.
      
      Fixes: 3e49a866 ("mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.")
      Reported-by: default avatarMiguel Ojeda <ojeda@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarMiguel Ojeda <ojeda@kernel.org>
      Link: https://lore.kernel.org/bpf/CANiq72ka4rir+RTN2FQoT=Vvprp_Ao-CvoYEkSNqtSY+RZj+AA@mail.gmail.com
      d7bca919
  2. 09 Mar, 2024 1 commit
  3. 08 Mar, 2024 4 commits
    • Alexei Starovoitov's avatar
      Merge branch 'fix-hash-bucket-overflow-checks-for-32-bit-arches' · a27e8967
      Alexei Starovoitov authored
      Toke Høiland-Jørgensen says:
      
      ====================
      Fix hash bucket overflow checks for 32-bit arches
      
      Syzbot managed to trigger a crash by creating a DEVMAP_HASH map with a
      large number of buckets because the overflow check relies on
      well-defined behaviour that is only correct on 64-bit arches.
      
      Fix the overflow checks to happen before values are rounded up in all
      the affected map types.
      
      v3:
      - Keep the htab->n_buckets > U32_MAX / sizeof(struct bucket) check
      - Use 1UL << 31 instead of U32_MAX / 2 + 1 as the constant to check
        against
      - Add patch to fix stackmap.c
      v2:
      - Fix off-by-one error in overflow check
      - Apply the same fix to hashtab, where the devmap_hash code was copied
        from (John)
      
      Toke Høiland-Jørgensen (3):
        bpf: Fix DEVMAP_HASH overflow check on 32-bit arches
        bpf: Fix hashtab overflow check on 32-bit arches
        bpf: Fix stackmap overflow check on 32-bit arches
      
       kernel/bpf/devmap.c   | 11 ++++++-----
       kernel/bpf/hashtab.c  | 14 +++++++++-----
       kernel/bpf/stackmap.c |  9 ++++++---
       3 files changed, 21 insertions(+), 13 deletions(-)
      ====================
      
      Link: https://lore.kernel.org/r/20240307120340.99577-1-toke@redhat.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a27e8967
    • Toke Høiland-Jørgensen's avatar
      bpf: Fix stackmap overflow check on 32-bit arches · 7a4b2125
      Toke Høiland-Jørgensen authored
      The stackmap code relies on roundup_pow_of_two() to compute the number
      of hash buckets, and contains an overflow check by checking if the
      resulting value is 0. However, on 32-bit arches, the roundup code itself
      can overflow by doing a 32-bit left-shift of an unsigned long value,
      which is undefined behaviour, so it is not guaranteed to truncate
      neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
      contains the same check, copied from the hashtab code.
      
      The commit in the fixes tag actually attempted to fix this, but the fix
      did not account for the UB, so the fix only works on CPUs where an
      overflow does result in a neat truncation to zero, which is not
      guaranteed. Checking the value before rounding does not have this
      problem.
      
      Fixes: 6183f4d3 ("bpf: Check for integer overflow when using roundup_pow_of_two()")
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Reviewed-by: default avatarBui Quang Minh <minhquangbui99@gmail.com>
      Message-ID: <20240307120340.99577-4-toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7a4b2125
    • Toke Høiland-Jørgensen's avatar
      bpf: Fix hashtab overflow check on 32-bit arches · 6787d916
      Toke Høiland-Jørgensen authored
      The hashtab code relies on roundup_pow_of_two() to compute the number of
      hash buckets, and contains an overflow check by checking if the
      resulting value is 0. However, on 32-bit arches, the roundup code itself
      can overflow by doing a 32-bit left-shift of an unsigned long value,
      which is undefined behaviour, so it is not guaranteed to truncate
      neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
      contains the same check, copied from the hashtab code. So apply the same
      fix to hashtab, by moving the overflow check to before the roundup.
      
      Fixes: daaf427c ("bpf: fix arraymap NULL deref and missing overflow and zero size checks")
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Message-ID: <20240307120340.99577-3-toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6787d916
    • Toke Høiland-Jørgensen's avatar
      bpf: Fix DEVMAP_HASH overflow check on 32-bit arches · 281d464a
      Toke Høiland-Jørgensen authored
      The devmap code allocates a number hash buckets equal to the next power
      of two of the max_entries value provided when creating the map. When
      rounding up to the next power of two, the 32-bit variable storing the
      number of buckets can overflow, and the code checks for overflow by
      checking if the truncated 32-bit value is equal to 0. However, on 32-bit
      arches the rounding up itself can overflow mid-way through, because it
      ends up doing a left-shift of 32 bits on an unsigned long value. If the
      size of an unsigned long is four bytes, this is undefined behaviour, so
      there is no guarantee that we'll end up with a nice and tidy 0-value at
      the end.
      
      Syzbot managed to turn this into a crash on arm32 by creating a
      DEVMAP_HASH with max_entries > 0x80000000 and then trying to update it.
      Fix this by moving the overflow check to before the rounding up
      operation.
      
      Fixes: 6f9d451a ("xdp: Add devmap_hash map type for looking up devices by hashed index")
      Link: https://lore.kernel.org/r/000000000000ed666a0611af6818@google.com
      Reported-and-tested-by: syzbot+8cd36f6b65f3cafd400a@syzkaller.appspotmail.com
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Message-ID: <20240307120340.99577-2-toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      281d464a
  4. 07 Mar, 2024 7 commits
  5. 06 Mar, 2024 21 commits