• Daniel Borkmann's avatar
    bpf: fix unconnected udp hooks · 591c18e3
    Daniel Borkmann authored
    commit 983695fa upstream.
    
    Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
    to applications as also stated in original motivation in 7828f20e ("Merge
    branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
    two hooks into Cilium to enable host based load-balancing with Kubernetes,
    I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
    typically sets up DNS as a service and is thus subject to load-balancing.
    
    Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
    is currently insufficient and thus not usable as-is for standard applications
    shipped with most distros. To break down the issue we ran into with a simple
    example:
    
      # cat /etc/resolv.conf
      nameserver 147.75.207.207
      nameserver 147.75.207.208
    
    For the purpose of a simple test, we set up above IPs as service IPs and
    transparently redirect traffic to a different DNS backend server for that
    node:
    
      # cilium service list
      ID   Frontend            Backend
      1    147.75.207.207:53   1 => 8.8.8.8:53
      2    147.75.207.208:53   1 => 8.8.8.8:53
    
    The attached BPF program is basically selecting one of the backends if the
    service IP/port matches on the cgroup hook. DNS breaks here, because the
    hooks are not transparent enough to applications which have built-in msg_name
    address checks:
    
      # nslookup 1.1.1.1
      ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
      ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
      ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
      [...]
      ;; connection timed out; no servers could be reached
    
      # dig 1.1.1.1
      ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
      ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
      ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
      [...]
    
      ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
      ;; global options: +cmd
      ;; connection timed out; no servers could be reached
    
    For comparison, if none of the service IPs is used, and we tell nslookup
    to use 8.8.8.8 directly it works just fine, of course:
    
      # nslookup 1.1.1.1 8.8.8.8
      1.1.1.1.in-addr.arpa	name = one.one.one.one.
    
    In order to fix this and thus act more transparent to the application,
    this needs reverse translation on recvmsg() side. A minimal fix for this
    API is to add similar recvmsg() hooks behind the BPF cgroups static key
    such that the program can track state and replace the current sockaddr_in{,6}
    with the original service IP. From BPF side, this basically tracks the
    service tuple plus socket cookie in an LRU map where the reverse NAT can
    then be retrieved via map value as one example. Side-note: the BPF cgroups
    static key should be converted to a per-hook static key in future.
    
    Same example after this fix:
    
      # cilium service list
      ID   Frontend            Backend
      1    147.75.207.207:53   1 => 8.8.8.8:53
      2    147.75.207.208:53   1 => 8.8.8.8:53
    
    Lookups work fine now:
    
      # nslookup 1.1.1.1
      1.1.1.1.in-addr.arpa    name = one.one.one.one.
    
      Authoritative answers can be found from:
    
      # dig 1.1.1.1
    
      ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
      ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
    
      ;; OPT PSEUDOSECTION:
      ; EDNS: version: 0, flags:; udp: 512
      ;; QUESTION SECTION:
      ;1.1.1.1.                       IN      A
    
      ;; AUTHORITY SECTION:
      .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
    
      ;; Query time: 17 msec
      ;; SERVER: 147.75.207.207#53(147.75.207.207)
      ;; WHEN: Tue May 21 12:59:38 UTC 2019
      ;; MSG SIZE  rcvd: 111
    
    And from an actual packet level it shows that we're using the back end
    server when talking via 147.75.207.20{7,8} front end:
    
      # tcpdump -i any udp
      [...]
      12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
      12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
      12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
      12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
      [...]
    
    In order to be flexible and to have same semantics as in sendmsg BPF
    programs, we only allow return codes in [1,1] range. In the sendmsg case
    the program is called if msg->msg_name is present which can be the case
    in both, connected and unconnected UDP.
    
    The former only relies on the sockaddr_in{,6} passed via connect(2) if
    passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
    way to call into the BPF program whenever a non-NULL msg->msg_name was
    passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
    that for TCP case, the msg->msg_name is ignored in the regular recvmsg
    path and therefore not relevant.
    
    For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
    the hook is not called. This is intentional as it aligns with the same
    semantics as in case of TCP cgroup BPF hooks right now. This might be
    better addressed in future through a different bpf_attach_type such
    that this case can be distinguished from the regular recvmsg paths,
    for example.
    
    Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
    Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
    Acked-by: default avatarMartynas Pumputis <m@lambda.lt>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    591c18e3
syscall.c 64.3 KB