• Daniel Borkmann's avatar
    netkit, bpf: Add bpf programmable net device · 35dfaad7
    Daniel Borkmann authored
    This work adds a new, minimal BPF-programmable device called "netkit"
    (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
    core idea is that BPF programs are executed within the drivers xmit routine
    and therefore e.g. in case of containers/Pods moving BPF processing closer
    to the source.
    
    One of the goals was that in case of Pod egress traffic, this allows to
    move BPF programs from hostns tcx ingress into the device itself, providing
    earlier drop or forward mechanisms, for example, if the BPF program
    determines that the skb must be sent out of the node, then a redirect to
    the physical device can take place directly without going through per-CPU
    backlog queue. This helps to shift processing for such traffic from softirq
    to process context, leading to better scheduling decisions/performance (see
    measurements in the slides).
    
    In this initial version, the netkit device ships as a pair, but we plan to
    extend this further so it can also operate in single device mode. The pair
    comes with a primary and a peer device. Only the primary device, typically
    residing in hostns, can manage BPF programs for itself and its peer. The
    peer device is designated for containers/Pods and cannot attach/detach
    BPF programs. Upon the device creation, the user can set the default policy
    to 'pass' or 'drop' for the case when no BPF program is attached.
    
    Additionally, the device can be operated in L3 (default) or L2 mode. The
    management of BPF programs is done via bpf_mprog, so that multi-attach is
    supported right from the beginning with similar API and dependency controls
    as tcx. For details on the latter see commit 053c8e1f ("bpf: Add generic
    attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
    so that existing programs can be easily migrated.
    
    Going forward, we plan to use netkit devices in Cilium as the main device
    type for connecting Pods. They will be operated in L3 mode in order to
    simplify a Pod's neighbor management and the peer will operate in default
    drop mode, so that no traffic is leaving between the time when a Pod is
    brought up by the CNI plugin and programs attached by the agent.
    Additionally, the programs we attach via tcx on the physical devices are
    using bpf_redirect_peer() for inbound traffic into netkit device, hence the
    latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
    bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
    directly. Also, BIG TCP is supported on netkit device. For the follow-up
    work in single device mode, we plan to convert Cilium's cilium_host/_net
    devices into a single one.
    
    An extensive test suite for checking device operations and the BPF program
    and link management API comes as BPF selftests in this series.
    Co-developed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
    Acked-by: default avatarStanislav Fomichev <sdf@google.com>
    Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
    Link: https://github.com/borkmann/iproute2/tree/pr/netkit
    Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
    Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.netSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
    35dfaad7
if_link.h 34.3 KB