1. 06 Jun, 2019 7 commits
    • Alexei Starovoitov's avatar
      Merge branch 'fix-unconnected-udp' · 4aeba328
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      Please refer to the patch 1/6 as the main patch with the details
      on the current sendmsg hook API limitations and proposal to fix
      it in order to work with basic applications like DNS. Remaining
      patches are the usual uapi and tooling updates as well as test
      cases. Thanks a lot!
      
      v2 -> v3:
        - Add attach types to test_section_names.c and libbpf (Andrey)
        - Added given Acks, rest as-is
      v1 -> v2:
        - Split off uapi header sync and bpftool bits (Martin, Alexei)
        - Added missing bpftool doc and bash completion as well
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4aeba328
    • Daniel Borkmann's avatar
      bpf: expand section tests for test_section_names · b714560f
      Daniel Borkmann authored
      Add cgroup/recvmsg{4,6} to test_section_names as well. Test run output:
      
        # ./test_section_names
        libbpf: failed to guess program type based on ELF section name 'InvAliD'
        libbpf: supported section(type) names are: [...]
        libbpf: failed to guess attach type based on ELF section name 'InvAliD'
        libbpf: attachable section(type) names are: [...]
        libbpf: failed to guess program type based on ELF section name 'cgroup'
        libbpf: supported section(type) names are: [...]
        libbpf: failed to guess attach type based on ELF section name 'cgroup'
        libbpf: attachable section(type) names are: [...]
        Summary: 38 PASSED, 0 FAILED
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b714560f
    • Daniel Borkmann's avatar
      bpf: more msg_name rewrite tests to test_sock_addr · 1812291e
      Daniel Borkmann authored
      Extend test_sock_addr for recvmsg test cases, bigger parts of the
      sendmsg code can be reused for this. Below are the strace view of
      the recvmsg rewrites; the sendmsg side does not have a BPF prog
      connected to it for the context of this test:
      
      IPv4 test case:
      
        [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, attach_type=0x13 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
        [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
        [pid  4846] bind(5, {sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("127.0.0.1")}, 128) = 0
        [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 6
        [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("127.0.0.1")}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
        [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], left {tv_sec=1, tv_usec=999995})
        [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, msg_namelen=128->16, msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
        [pid  4846] close(6)                    = 0
        [pid  4846] close(5)                    = 0
        [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x13 /* BPF_??? */}, 112) = 0
      
      IPv6 test case:
      
        [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, attach_type=0x14 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
        [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5
        [pid  4846] bind(5, {sa_family=AF_INET6, sin6_port=htons(6666), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 128) = 0
        [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 6
        [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET6, sin6_port=htons(6666), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
        [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], left {tv_sec=1, tv_usec=999996})
        [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128->28, msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
        [pid  4846] close(6)                    = 0
        [pid  4846] close(5)                    = 0
        [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x14 /* BPF_??? */}, 112) = 0
      
      test_sock_addr run w/o strace view:
      
        # ./test_sock_addr.sh
        [...]
        Test case: recvmsg4: return code ok .. [PASS]
        Test case: recvmsg4: return code !ok .. [PASS]
        Test case: recvmsg6: return code ok .. [PASS]
        Test case: recvmsg6: return code !ok .. [PASS]
        Test case: recvmsg4: rewrite IP & port (asm) .. [PASS]
        Test case: recvmsg6: rewrite IP & port (asm) .. [PASS]
        [...]
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1812291e
    • Daniel Borkmann's avatar
      bpf, bpftool: enable recvmsg attach types · 000aa125
      Daniel Borkmann authored
      Trivial patch to bpftool in order to complete enabling attaching programs
      to BPF_CGROUP_UDP{4,6}_RECVMSG.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      000aa125
    • Daniel Borkmann's avatar
      bpf, libbpf: enable recvmsg attach types · 9bb59ac1
      Daniel Borkmann authored
      Another trivial patch to libbpf in order to enable identifying and
      attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG by section name.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9bb59ac1
    • Daniel Borkmann's avatar
      bpf: sync tooling uapi header · 3dbc6ada
      Daniel Borkmann authored
      Sync BPF uapi header in order to pull in BPF_CGROUP_UDP{4,6}_RECVMSG
      attach types. This is done and preferred as an extra patch in order
      to ease sync of libbpf.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3dbc6ada
    • Daniel Borkmann's avatar
      bpf: fix unconnected udp hooks · 983695fa
      Daniel Borkmann authored
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarMartynas Pumputis <m@lambda.lt>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      983695fa
  2. 05 Jun, 2019 2 commits
    • Krzesimir Nowak's avatar
      tools: bpftool: Fix JSON output when lookup fails · 1884c066
      Krzesimir Nowak authored
      In commit 9a5ab8bf ("tools: bpftool: turn err() and info() macros
      into functions") one case of error reporting was special cased, so it
      could report a lookup error for a specific key when dumping the map
      element. What the code forgot to do is to wrap the key and value keys
      into a JSON object, so an example output of pretty JSON dump of a
      sockhash map (which does not support looking up its values) is:
      
      [
          "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x00"
          ],
          "value": {
              "error": "Operation not supported"
          },
          "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x01"
          ],
          "value": {
              "error": "Operation not supported"
          }
      ]
      
      Note the key-value pairs inside the toplevel array. They should be
      wrapped inside a JSON object, otherwise it is an invalid JSON. This
      commit fixes this, so the output now is:
      
      [{
              "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x00"
              ],
              "value": {
                  "error": "Operation not supported"
              }
          },{
              "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x01"
              ],
              "value": {
                  "error": "Operation not supported"
              }
          }
      ]
      
      Fixes: 9a5ab8bf ("tools: bpftool: turn err() and info() macros into functions")
      Cc: Quentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarKrzesimir Nowak <krzesimir@kinvolk.io>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1884c066
    • Hangbin Liu's avatar
      selftests/bpf: move test_lirc_mode2_user to TEST_GEN_PROGS_EXTENDED · 25a7991c
      Hangbin Liu authored
      test_lirc_mode2_user is included in test_lirc_mode2.sh test and should
      not be run directly.
      
      Fixes: 6bdd533c ("bpf: add selftest for lirc_mode2 type program")
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      25a7991c
  3. 03 Jun, 2019 3 commits
    • Alexei Starovoitov's avatar
      Merge branch 'reuseport-fixes' · e7f3dd28
      Alexei Starovoitov authored
      Martin Lau says:
      
      ====================
      s series has fixes when running reuseport's bpf_prog for udp lookup.
      If there is reuseport's bpf_prog, the common issue is the reuseport code
      path expects skb->data pointing to the transport header (udphdr here).
      A couple of commits broke this expectation.  The issue is specific
      to running bpf_prog, so bpf tag is used for this series.
      
      Please refer to the individual commit message for details.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e7f3dd28
    • Martin KaFai Lau's avatar
      bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro · 257a525f
      Martin KaFai Lau authored
      When the commit a6024562 ("udp: Add GRO functions to UDP socket")
      added udp[46]_lib_lookup_skb to the udp_gro code path, it broke
      the reuseport_select_sock() assumption that skb->data is pointing
      to the transport header.
      
      This patch follows an earlier __udp6_lib_err() fix by
      passing a NULL skb to avoid calling the reuseport's bpf_prog.
      
      Fixes: a6024562 ("udp: Add GRO functions to UDP socket")
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      257a525f
    • Martin KaFai Lau's avatar
      bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err · 4ac30c4b
      Martin KaFai Lau authored
      __udp6_lib_err() may be called when handling icmpv6 message. For example,
      the icmpv6 toobig(type=2).  __udp6_lib_lookup() is then called
      which may call reuseport_select_sock().  reuseport_select_sock() will
      call into a bpf_prog (if there is one).
      
      reuseport_select_sock() is expecting the skb->data pointing to the
      transport header (udphdr in this case).  For example, run_bpf_filter()
      is pulling the transport header.
      
      However, in the __udp6_lib_err() path, the skb->data is pointing to the
      ipv6hdr instead of the udphdr.
      
      One option is to pull and push the ipv6hdr in __udp6_lib_err().
      Instead of doing this, this patch follows how the original
      commit 538950a1 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
      was done in IPv4, which has passed a NULL skb pointer to
      reuseport_select_sock().
      
      Fixes: 538950a1 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
      Cc: Craig Gallek <kraig@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarCraig Gallek <kraig@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4ac30c4b
  4. 01 Jun, 2019 2 commits
    • Luke Nelson's avatar
      bpf, riscv: clear high 32 bits for ALU32 add/sub/neg/lsh/rsh/arsh · 1e692f09
      Luke Nelson authored
      In BPF, 32-bit ALU operations should zero-extend their results into
      the 64-bit registers.
      
      The current BPF JIT on RISC-V emits incorrect instructions that perform
      sign extension only (e.g., addw, subw) on 32-bit add, sub, lsh, rsh,
      arsh, and neg. This behavior diverges from the interpreter and JITs
      for other architectures.
      
      This patch fixes the bugs by performing zero extension on the destination
      register of 32-bit ALU operations.
      
      Fixes: 2353ecc6 ("bpf, riscv: add BPF JIT for RV64G")
      Cc: Xi Wang <xi.wang@gmail.com>
      Signed-off-by: default avatarLuke Nelson <luke.r.nels@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@gmail.com>
      Reviewed-by: default avatarPalmer Dabbelt <palmer@sifive.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1e692f09
    • Michal Rostecki's avatar
      libbpf: Return btf_fd for load_sk_storage_btf · cfd49210
      Michal Rostecki authored
      Before this change, function load_sk_storage_btf expected that
      libbpf__probe_raw_btf was returning a BTF descriptor, but in fact it was
      returning an information about whether the probe was successful (0 or
      1). load_sk_storage_btf was using that value as an argument of the close
      function, which was resulting in closing stdout and thus terminating the
      process which called that function.
      
      That bug was visible in bpftool. `bpftool feature` subcommand was always
      exiting too early (because of closed stdout) and it didn't display all
      requested probes. `bpftool -j feature` or `bpftool -p feature` were not
      returning a valid json object.
      
      This change renames the libbpf__probe_raw_btf function to
      libbpf__load_raw_btf, which now returns a BTF descriptor, as expected in
      load_sk_storage_btf.
      
      v2:
      - Fix typo in the commit message.
      
      v3:
      - Simplify BTF descriptor handling in bpf_object__probe_btf_* functions.
      - Rename libbpf__probe_raw_btf function to libbpf__load_raw_btf and
      return a BTF descriptor.
      
      v4:
      - Fix typo in the commit message.
      
      Fixes: d7c4b398 ("libbpf: detect supported kernel BTF features and sanitize BTF")
      Signed-off-by: default avatarMichal Rostecki <mrostecki@opensuse.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cfd49210
  5. 29 May, 2019 4 commits
  6. 24 May, 2019 1 commit
    • John Fastabend's avatar
      bpf: sockmap, fix use after free from sleep in psock backlog workqueue · bd95e678
      John Fastabend authored
      Backlog work for psock (sk_psock_backlog) might sleep while waiting
      for memory to free up when sending packets. However, while sleeping
      the socket may be closed and removed from the map by the user space
      side.
      
      This breaks an assumption in sk_stream_wait_memory, which expects the
      wait queue to be still there when it wakes up resulting in a
      use-after-free shown below. To fix his mark sendmsg as MSG_DONTWAIT
      to avoid the sleep altogether. We already set the flag for the
      sendpage case but we missed the case were sendmsg is used.
      Sockmap is currently the only user of skb_send_sock_locked() so only
      the sockmap paths should be impacted.
      
      ==================================================================
      BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
      Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110
      
      CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3-dirty #14
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
      Workqueue: events sk_psock_backlog
      Call Trace:
       print_address_description+0x6e/0x2b0
       ? remove_wait_queue+0x31/0x70
       kasan_report+0xfd/0x177
       ? remove_wait_queue+0x31/0x70
       ? remove_wait_queue+0x31/0x70
       remove_wait_queue+0x31/0x70
       sk_stream_wait_memory+0x4dd/0x5f0
       ? sk_stream_wait_close+0x1b0/0x1b0
       ? wait_woken+0xc0/0xc0
       ? tcp_current_mss+0xc5/0x110
       tcp_sendmsg_locked+0x634/0x15d0
       ? tcp_set_state+0x2e0/0x2e0
       ? __kasan_slab_free+0x1d1/0x230
       ? kmem_cache_free+0x70/0x140
       ? sk_psock_backlog+0x40c/0x4b0
       ? process_one_work+0x40b/0x660
       ? worker_thread+0x82/0x680
       ? kthread+0x1b9/0x1e0
       ? ret_from_fork+0x1f/0x30
       ? check_preempt_curr+0xaf/0x130
       ? iov_iter_kvec+0x5f/0x70
       ? kernel_sendmsg_locked+0xa0/0xe0
       skb_send_sock_locked+0x273/0x3c0
       ? skb_splice_bits+0x180/0x180
       ? start_thread+0xe0/0xe0
       ? update_min_vruntime.constprop.27+0x88/0xc0
       sk_psock_backlog+0xb3/0x4b0
       ? strscpy+0xbf/0x1e0
       process_one_work+0x40b/0x660
       worker_thread+0x82/0x680
       ? process_one_work+0x660/0x660
       kthread+0x1b9/0x1e0
       ? __kthread_create_on_node+0x250/0x250
       ret_from_fork+0x1f/0x30
      
      Fixes: 20bf50de ("skbuff: Function to send an skbuf on a socket")
      Reported-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Tested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bd95e678
  7. 23 May, 2019 3 commits
    • Jakub Sitnicki's avatar
      bpf: sockmap, restore sk_write_space when psock gets dropped · 186bcc3d
      Jakub Sitnicki authored
      Once psock gets unlinked from its sock (sk_psock_drop), user-space can
      still trigger a call to sk->sk_write_space by setting TCP_NOTSENT_LOWAT
      socket option. This causes a null-ptr-deref because we try to read
      psock->saved_write_space from sk_psock_write_space:
      
      ==================================================================
      BUG: KASAN: null-ptr-deref in sk_psock_write_space+0x69/0x80
      Read of size 8 at addr 00000000000001a0 by task sockmap-echo/131
      
      CPU: 0 PID: 131 Comm: sockmap-echo Not tainted 5.2.0-rc1-00094-gf49aa1de #81
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
      Call Trace:
       ? sk_psock_write_space+0x69/0x80
       __kasan_report.cold.2+0x5/0x3f
       ? sk_psock_write_space+0x69/0x80
       kasan_report+0xe/0x20
       sk_psock_write_space+0x69/0x80
       tcp_setsockopt+0x69a/0xfc0
       ? tcp_shutdown+0x70/0x70
       ? fsnotify+0x5b0/0x5f0
       ? remove_wait_queue+0x90/0x90
       ? __fget_light+0xa5/0xf0
       __sys_setsockopt+0xe6/0x180
       ? sockfd_lookup_light+0xb0/0xb0
       ? vfs_write+0x195/0x210
       ? ksys_write+0xc9/0x150
       ? __x64_sys_read+0x50/0x50
       ? __bpf_trace_x86_fpu+0x10/0x10
       __x64_sys_setsockopt+0x61/0x70
       do_syscall_64+0xc5/0x520
       ? vmacache_find+0xc0/0x110
       ? syscall_return_slowpath+0x110/0x110
       ? handle_mm_fault+0xb4/0x110
       ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
       ? trace_hardirqs_off_caller+0x4b/0x120
       ? trace_hardirqs_off_thunk+0x1a/0x3a
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f2e5e7cdcce
      Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b1 66 2e 0f 1f 84 00 00 00 00 00
      0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 01 f0 ff
      ff 73 01 c3 48 8b 0d 8a 11 0c 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffed011b778 EFLAGS: 00000206 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2e5e7cdcce
      RDX: 0000000000000019 RSI: 0000000000000006 RDI: 0000000000000007
      RBP: 00007ffed011b790 R08: 0000000000000004 R09: 00007f2e5e84ee80
      R10: 00007ffed011b788 R11: 0000000000000206 R12: 00007ffed011b78c
      R13: 00007ffed011b788 R14: 0000000000000007 R15: 0000000000000068
      ==================================================================
      
      Restore the saved sk_write_space callback when psock is being dropped to
      fix the crash.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      186bcc3d
    • Björn Töpel's avatar
      selftests: bpf: add zero extend checks for ALU32 and/or/xor · 00d83045
      Björn Töpel authored
      Add three tests to test_verifier/basic_instr that make sure that the
      high 32-bits of the destination register is cleared after an ALU32
      and/or/xor.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@gmail.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      00d83045
    • Björn Töpel's avatar
      bpf, riscv: clear target register high 32-bits for and/or/xor on ALU32 · fe121ee5
      Björn Töpel authored
      When using 32-bit subregisters (ALU32), the RISC-V JIT would not clear
      the high 32-bits of the target register and therefore generate
      incorrect code.
      
      E.g., in the following code:
      
        $ cat test.c
        unsigned int f(unsigned long long a,
        	       unsigned int b)
        {
        	return (unsigned int)a & b;
        }
      
        $ clang-9 -target bpf -O2 -emit-llvm -S test.c -o - | \
        	llc-9 -mattr=+alu32 -mcpu=v3
        	.text
        	.file	"test.c"
        	.globl	f
        	.p2align	3
        	.type	f,@function
        f:
        	r0 = r1
        	w0 &= w2
        	exit
        .Lfunc_end0:
        	.size	f, .Lfunc_end0-f
      
      The JIT would not clear the high 32-bits of r0 after the
      and-operation, which in this case might give an incorrect return
      value.
      
      After this patch, that is not the case, and the upper 32-bits are
      cleared.
      Reported-by: default avatarJiong Wang <jiong.wang@netronome.com>
      Fixes: 2353ecc6 ("bpf, riscv: add BPF JIT for RV64G")
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fe121ee5
  8. 21 May, 2019 18 commits