• John Fastabend's avatar
    bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_t · 35d2b7ff
    John Fastabend authored
    Sockmap and sockhash maps are a collection of psocks that are
    objects representing a socket plus a set of metadata needed
    to manage the BPF programs associated with the socket. These
    maps use the stab->lock to protect from concurrent operations
    on the maps, e.g. trying to insert to objects into the array
    at the same time in the same slot. Additionally, a sockhash map
    has a bucket lock to protect iteration and insert/delete into
    the hash entry.
    
    Each psock has a psock->link which is a linked list of all the
    maps that a psock is attached to. This allows a psock (socket)
    to be included in multiple sockmap and sockhash maps. This
    linked list is protected the psock->link_lock.
    
    They _must_ be nested correctly to avoid deadlock:
    
      lock(stab->lock)
        : do BPF map operations and psock insert/delete
        lock(psock->link_lock)
           : add map to psock linked list of maps
        unlock(psock->link_lock)
      unlock(stab->lock)
    
    For non PREEMPT_RT kernels both raw_spin_lock_t and spin_lock_t
    are guaranteed to not sleep. But, with PREEMPT_RT kernels the
    spin_lock_t variants may sleep. In the current code we have
    many patterns like this:
    
       rcu_critical_section:
          raw_spin_lock(stab->lock)
             spin_lock(psock->link_lock) <- may sleep ouch
             spin_unlock(psock->link_lock)
          raw_spin_unlock(stab->lock)
       rcu_critical_section
    
    Nesting spin_lock() inside a raw_spin_lock() violates locking
    rules for PREEMPT_RT kernels. And additionally we do alloc(GFP_ATOMICS)
    inside the stab->lock, but those might sleep on PREEMPT_RT kernels.
    The result is splats like this:
    
    ./test_progs -t sockmap_basic
    [   33.344330] bpf_testmod: loading out-of-tree module taints kernel.
    [   33.441933]
    [   33.442089] =============================
    [   33.442421] [ BUG: Invalid wait context ]
    [   33.442763] 6.5.0-rc5-01731-gec0ded2e #4958 Tainted: G           O
    [   33.443320] -----------------------------
    [   33.443624] test_progs/2073 is trying to lock:
    [   33.443960] ffff888102a1c290 (&psock->link_lock){....}-{3:3}, at: sock_map_update_common+0x2c2/0x3d0
    [   33.444636] other info that might help us debug this:
    [   33.444991] context-{5:5}
    [   33.445183] 3 locks held by test_progs/2073:
    [   33.445498]  #0: ffff88811a208d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: sock_map_update_elem_sys+0xff/0x330
    [   33.446159]  #1: ffffffff842539e0 (rcu_read_lock){....}-{1:3}, at: sock_map_update_elem_sys+0xf5/0x330
    [   33.446809]  #2: ffff88810d687240 (&stab->lock){+...}-{2:2}, at: sock_map_update_common+0x177/0x3d0
    [   33.447445] stack backtrace:
    [   33.447655] CPU: 10 PID
    
    To fix observe we can't readily remove the allocations (for that
    we would need to use/create something similar to bpf_map_alloc). So
    convert raw_spin_lock_t to spin_lock_t. We note that sock_map_update
    that would trigger the allocate and potential sleep is only allowed
    through sys_bpf ops and via sock_ops which precludes hw interrupts
    and low level atomic sections in RT preempt kernel. On non RT
    preempt kernel there are no changes here and spin locks sections
    and alloc(GFP_ATOMIC) are still not sleepable.
    Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20230830053517.166611-1-john.fastabend@gmail.com
    35d2b7ff
sock_map.c 40.5 KB