1. 13 Oct, 2015 21 commits
    • Eric Dumazet's avatar
      tun: use sk_fullsock() before reading sk->sk_tsflags · 5fcd2d8b
      Eric Dumazet authored
      timewait or request sockets are small and do not contain sk->sk_tsflags
      
      Without this fix, we might read garbage, and crash later in
      
      __skb_complete_tx_timestamp()
       -> sock_queue_err_skb()
      
      (These pseudo sockets do not have an error queue either)
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fcd2d8b
    • David S. Miller's avatar
      Merge branch 'netns-defrag' · b7a46095
      David S. Miller authored
      Eric W. Biederman says:
      
      ====================
      net: Pass net into defragmentation
      
      This is the next installment of my work to pass struct net through the
      output path so the code does not need to guess how to figure out which
      network namespace it is in, and ultimately routes can have output
      devices in another network namespace.
      
      In netfilter and af_packet we defragment packets in the output path,
      and there is the usual amount of confusion about how to compute which
      net we are processing the packets in.  This patchset clears that
      confusion up by explicitly passing in struct net in ip_defrag,
      ip_check_defrag, and nf_ct_frag6_gather.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7a46095
    • Eric W. Biederman's avatar
      ipv6: Pass struct net into nf_ct_frag6_gather · b7277597
      Eric W. Biederman authored
      The function nf_ct_frag6_gather is called on both the input and the
      output paths of the networking stack.  In particular ipv6_defrag which
      calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
      on input and the LOCAL_OUT chain on output.
      
      The addition of a net parameter makes it explicit which network
      namespace the packets are being reassembled in, and removes the need
      for nf_ct_frag6_gather to guess.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7277597
    • Eric W. Biederman's avatar
      ipv4: Pass struct net into ip_defrag and ip_check_defrag · 19bcf9f2
      Eric W. Biederman authored
      The function ip_defrag is called on both the input and the output
      paths of the networking stack.  In particular conntrack when it is
      tracking outbound packets from the local machine calls ip_defrag.
      
      So add a struct net parameter and stop making ip_defrag guess which
      network namespace it needs to defragment packets in.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19bcf9f2
    • Eric W. Biederman's avatar
      ipv4: Only compute net once in ip_call_ra_chain · 37fcbab6
      Eric W. Biederman authored
      ip_call_ra_chain is called early in the forwarding chain from
      ip_forward and ip_mr_input, which makes skb->dev the correct
      expression to get the input network device and dev_net(skb->dev) a
      correct expression for the network namespace the packet is being
      processed in.
      
      Compute the network namespace and store it in a variable to make the
      code clearer.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37fcbab6
    • Eric Dumazet's avatar
      packet: fix match_fanout_group() · 161642e2
      Eric Dumazet authored
      Recent TCP listener patches exposed a prior af_packet bug :
      match_fanout_group() blindly assumes it is always safe
      to cast sk to a packet socket to compare fanout with af_packet_priv
      
      But SYNACK packets can be sent while attached to request_sock, which
      are smaller than a "struct sock".
      
      We can read non existent memory and crash.
      
      Fixes: c0de08d0 ("af_packet: don't emit packet on orig fanout group")
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Leblond <eric@regit.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      161642e2
    • David S. Miller's avatar
      Merge tag 'wireless-drivers-next-for-davem-2015-10-09' of... · 99165967
      David S. Miller authored
      Merge tag 'wireless-drivers-next-for-davem-2015-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
      
      Kalle Valo says:
      
      ====================
      Major changes:
      
      iwlwifi
      
      * some debugfs improvements
      * fix signedness in beacon statistics
      * deinline some functions to reduce size when device tracing is enabled
      * filter beacons out in AP mode when no stations are associated
      * deprecate firmwares version -12
      * fix a runtime PM vs. legacy suspend race
      * one-liner fix for a ToF bug
      * clean-ups in the rx code
      * small debugging improvement
      * fix WoWLAN with new firmware versions
      * more clean-ups towards multiple RX queues;
      * some rate scaling fixes and improvements;
      * some time-of-flight fixes;
      * other generic improvements and clean-ups;
      
      brcmfmac
      
      * rework code dealing with multiple interfaces
      * allow logging firmware console using debug level
      * support for BCM4350, BCM4365, and BCM4366 PCIE devices
      * fixed for legacy P2P and P2P device handling
      * correct set and get tx-power
      
      ath9k
      
      * add support for Outside Context of a BSS (OCB) mode
      
      mwifiex
      
      * add USB multichannel feature
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99165967
    • Paolo Abeni's avatar
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni authored
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      redirect.
      
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      x.x.x.254/24.
      
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      
      The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
      and will continue to use the wrong next-op.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2ca690b
    • Jiri Pirko's avatar
      bridge: try switchdev op first in __vlan_vid_add/del · 0944d6b5
      Jiri Pirko authored
      Some drivers need to implement both switchdev vlan ops and
      vid_add/kill ndos. For that to work in bridge code, we need to try
      switchdev op first when adding/deleting vlan id.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0944d6b5
    • wangweidong's avatar
      BNX2: free temp_stats_blk on error path · 3703ebe4
      wangweidong authored
      In bnx2_init_board, missing free temp_stats_blk on error path when
      some operations do failed. Just add the 'kfree' operation.
      Signed-off-by: default avatarWang Weidong <wangweidong1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3703ebe4
    • David S. Miller's avatar
      Merge branch 'setsockopt_incoming_cpu' · 76973dd7
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: better smp listener behavior
      
      As promised in last patch series, we implement a better SO_REUSEPORT
      strategy, based on cpu hints if given by the application.
      
      We also moved sk_refcnt out of the cache line containing the lookup
      keys, as it was considerably slowing down smp operations because
      of false sharing. This was simpler than converting listen sockets
      to conventional RCU (to avoid sk_refcnt dirtying)
      
      Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76973dd7
    • Eric Dumazet's avatar
      tcp: shrink tcp_timewait_sock by 8 bytes · d475f090
      Eric Dumazet authored
      Reducing tcp_timewait_sock from 280 bytes to 272 bytes
      allows SLAB to pack 15 objects per page instead of 14 (on x86)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d475f090
    • Eric Dumazet's avatar
      net: shrink struct sock and request_sock by 8 bytes · ed53d0ab
      Eric Dumazet authored
      One 32bit hole is following skc_refcnt, use it.
      skc_incoming_cpu can also be an union for request_sock rcv_wnd.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed53d0ab
    • Eric Dumazet's avatar
      net: align sk_refcnt on 128 bytes boundary · 8e5eb54d
      Eric Dumazet authored
      sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
      This is a performance issue if multiple cpus hit a common socket,
      or multiple sockets are chained due to SO_REUSEPORT.
      
      By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
      are mostly read. As they contain the lookup keys, this has
      a considerable performance impact, as cpus can cache them.
      
      These 8 bytes are not wasted, we use them as a place holder
      for various fields, depending on the socket type.
      
      Tested:
       SYN flood hitting a 16 RX queues NIC.
       TCP listener using 16 sockets and SO_REUSEPORT
       and SO_INCOMING_CPU for proper siloing.
      
       Could process 6.0 Mpps SYN instead of 4.2 Mpps
      
       Kernel profile looked like :
          11.68%  [kernel]  [k] sha_transform
           6.51%  [kernel]  [k] __inet_lookup_listener
           5.07%  [kernel]  [k] __inet_lookup_established
           4.15%  [kernel]  [k] memcpy_erms
           3.46%  [kernel]  [k] ipt_do_table
           2.74%  [kernel]  [k] fib_table_lookup
           2.54%  [kernel]  [k] tcp_make_synack
           2.34%  [kernel]  [k] tcp_conn_request
           2.05%  [kernel]  [k] __netif_receive_skb_core
           2.03%  [kernel]  [k] kmem_cache_alloc
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e5eb54d
    • Eric Dumazet's avatar
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet authored
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70da268b
    • Edward Jee's avatar
      packet: support per-packet fwmark for af_packet sendmsg · c7d39e32
      Edward Jee authored
      Signed-off-by: default avatarEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7d39e32
    • Edward Jee's avatar
      sock: support per-packet fwmark · f28ea365
      Edward Jee authored
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: default avatarEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f28ea365
    • David S. Miller's avatar
      Merge branch 'bpf-unprivileged' · c1bf5fe0
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: unprivileged
      
      v1-v2:
      - this set logically depends on cb patch
        "bpf: fix cb access in socket filter programs":
        http://patchwork.ozlabs.org/patch/527391/
        which is must have to allow unprivileged programs.
        Thanks Daniel for finding that issue.
      - refactored sysctl to be similar to 'modules_disabled'
      - dropped bpf_trace_printk
      - split tests into separate patch and added more tests
        based on discussion
      
      v1 cover letter:
      I think it is time to liberate eBPF from CAP_SYS_ADMIN.
      As was discussed when eBPF was first introduced two years ago
      the only piece missing in eBPF verifier is 'pointer leak detection'
      to make it available to non-root users.
      Patch 1 adds this pointer analysis.
      The eBPF programs, obviously, need to see and operate on kernel addresses,
      but with these extra checks they won't be able to pass these addresses
      to user space.
      Patch 2 adds accounting of kernel memory used by programs and maps.
      It changes behavoir for existing root users, but I think it needs
      to be done consistently for both root and non-root, since today
      programs and maps are only limited by number of open FDs (RLIMIT_NOFILE).
      Patch 2 accounts program's and map's kernel memory as RLIMIT_MEMLOCK.
      
      Unprivileged eBPF is only meaningful for 'socket filter'-like programs.
      eBPF programs for tracing and TC classifiers/actions will stay root only.
      
      In parallel the bpf fuzzing effort is ongoing and so far
      we've found only one verifier bug and that was already fixed.
      The 'constant blinding' pass also being worked on.
      It will obfuscate constant-like values that are part of eBPF ISA
      to make jit spraying attacks even harder.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1bf5fe0
    • Alexei Starovoitov's avatar
      bpf: add unprivileged bpf tests · bf508877
      Alexei Starovoitov authored
      Add new tests samples/bpf/test_verifier:
      
      unpriv: return pointer
        checks that pointer cannot be returned from the eBPF program
      
      unpriv: add const to pointer
      unpriv: add pointer to pointer
      unpriv: neg pointer
        checks that pointer arithmetic is disallowed
      
      unpriv: cmp pointer with const
      unpriv: cmp pointer with pointer
        checks that comparison of pointers is disallowed
        Only one case allowed 'void *value = bpf_map_lookup_elem(..); if (value == 0) ...'
      
      unpriv: check that printk is disallowed
        since bpf_trace_printk is not available to unprivileged
      
      unpriv: pass pointer to helper function
        checks that pointers cannot be passed to functions that expect integers
        If function expects a pointer the verifier allows only that type of pointer.
        Like 1st argument of bpf_map_lookup_elem() must be pointer to map.
        (applies to non-root as well)
      
      unpriv: indirectly pass pointer on stack to helper function
        checks that pointer stored into stack cannot be used as part of key
        passed into bpf_map_lookup_elem()
      
      unpriv: mangle pointer on stack 1
      unpriv: mangle pointer on stack 2
        checks that writing into stack slot that already contains a pointer
        is disallowed
      
      unpriv: read pointer from stack in small chunks
        checks that < 8 byte read from stack slot that contains a pointer is
        disallowed
      
      unpriv: write pointer into ctx
        checks that storing pointers into skb->fields is disallowed
      
      unpriv: write pointer into map elem value
        checks that storing pointers into element values is disallowed
        For example:
        int bpf_prog(struct __sk_buff *skb)
        {
          u32 key = 0;
          u64 *value = bpf_map_lookup_elem(&map, &key);
          if (value)
             *value = (u64) skb;
        }
        will be rejected.
      
      unpriv: partial copy of pointer
        checks that doing 32-bit register mov from register containing
        a pointer is disallowed
      
      unpriv: pass pointer to tail_call
        checks that passing pointer as an index into bpf_tail_call
        is disallowed
      
      unpriv: cmp map pointer with zero
        checks that comparing map pointer with constant is disallowed
      
      unpriv: write into frame pointer
        checks that frame pointer is read-only (applies to root too)
      
      unpriv: cmp of frame pointer
        checks that R10 cannot be using in comparison
      
      unpriv: cmp of stack pointer
        checks that Rx = R10 - imm is ok, but comparing Rx is not
      
      unpriv: obfuscate stack pointer
        checks that Rx = R10 - imm is ok, but Rx -= imm is not
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf508877
    • Alexei Starovoitov's avatar
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov authored
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aaac3ba9
    • Alexei Starovoitov's avatar
      bpf: enable non-root eBPF programs · 1be7f75d
      Alexei Starovoitov authored
      In order to let unprivileged users load and execute eBPF programs
      teach verifier to prevent pointer leaks.
      Verifier will prevent
      - any arithmetic on pointers
        (except R10+Imm which is used to compute stack addresses)
      - comparison of pointers
        (except if (map_value_ptr == 0) ... )
      - passing pointers to helper functions
      - indirectly passing pointers in stack to helper functions
      - returning pointer from bpf program
      - storing pointers into ctx or maps
      
      Spill/fill of pointers into stack is allowed, but mangling
      of pointers stored in the stack or reading them byte by byte is not.
      
      Within bpf programs the pointers do exist, since programs need to
      be able to access maps, pass skb pointer to LD_ABS insns, etc
      but programs cannot pass such pointer values to the outside
      or obfuscate them.
      
      Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
      so that socket filters (tcpdump), af_packet (quic acceleration)
      and future kcm can use it.
      tracing and tc cls/act program types still require root permissions,
      since tracing actually needs to be able to see all kernel pointers
      and tc is for root only.
      
      For example, the following unprivileged socket filter program is allowed:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += skb->len;
        return 0;
      }
      
      but the following program is not:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += (u64) skb;
        return 0;
      }
      since it would leak the kernel address into the map.
      
      Unprivileged socket filter bpf programs have access to the
      following helper functions:
      - map lookup/update/delete (but they cannot store kernel pointers into them)
      - get_random (it's already exposed to unprivileged user space)
      - get_smp_processor_id
      - tail_call into another socket filter program
      - ktime_get_ns
      
      The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
      This toggle defaults to off (0), but can be set true (1).  Once true,
      bpf programs and maps cannot be accessed from unprivileged process,
      and the toggle cannot be set back to false.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1be7f75d
  2. 12 Oct, 2015 9 commits
  3. 11 Oct, 2015 10 commits