• Martin KaFai Lau's avatar
    bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt · eb18b49e
    Martin KaFai Lau authored
    This patch allows the bpf-tcp-cc to call bpf_setsockopt.  One use
    case is to allow a bpf-tcp-cc switching to another cc during init().
    For example, when the tcp flow is not ecn ready, the bpf_dctcp
    can switch to another cc by calling setsockopt(TCP_CONGESTION).
    
    During setsockopt(TCP_CONGESTION), the new tcp-cc's init() will be
    called and this could cause a recursion but it is stopped by the
    current trampoline's logic (in the prog->active counter).
    
    While retiring a bpf-tcp-cc (e.g. in tcp_v[46]_destroy_sock()),
    the tcp stack calls bpf-tcp-cc's release().  To avoid the retiring
    bpf-tcp-cc making further changes to the sk, bpf_setsockopt is not
    available to the bpf-tcp-cc's release().  This will avoid release()
    making setsockopt() call that will potentially allocate new resources.
    
    Although the bpf-tcp-cc already has a more powerful way to read tcp_sock
    from the PTR_TO_BTF_ID, it is usually expected that bpf_getsockopt and
    bpf_setsockopt are available together.  Thus, bpf_getsockopt() is also
    added to all tcp_congestion_ops except release().
    
    When the old bpf-tcp-cc is calling setsockopt(TCP_CONGESTION)
    to switch to a new cc, the old bpf-tcp-cc will be released by
    bpf_struct_ops_put().  Thus, this patch also puts the bpf_struct_ops_map
    after a rcu grace period because the trampoline's image cannot be freed
    while the old bpf-tcp-cc is still running.
    
    bpf-tcp-cc can only access icsk_ca_priv as SCALAR.  All kernel's
    tcp-cc is also accessing the icsk_ca_priv as SCALAR.   The size
    of icsk_ca_priv has already been raised a few times to avoid
    extra kmalloc and memory referencing.  The only exception is the
    kernel's tcp_cdg.c that stores a kmalloc()-ed pointer in icsk_ca_priv.
    To avoid the old bpf-tcp-cc accidentally overriding this tcp_cdg's pointer
    value stored in icsk_ca_priv after switching and without over-complicating
    the bpf's verifier for this one exception in tcp_cdg, this patch does not
    allow switching to tcp_cdg.  If there is a need, bpf_tcp_cdg can be
    implemented and then use the bpf_sk_storage as the extended storage.
    
    bpf_sk_setsockopt proto has only been recently added and used
    in bpf-sockopt and bpf-iter-tcp, so impose the tcp_cdg limitation in the
    same proto instead of adding a new proto specifically for bpf-tcp-cc.
    Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20210824173007.3976921-1-kafai@fb.com
    eb18b49e
bpf_struct_ops.c 16.6 KB