1. 01 Apr, 2018 40 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · d4069fe6
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-03-31
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Add raw BPF tracepoint API in order to have a BPF program type that
         can access kernel internal arguments of the tracepoints in their
         raw form similar to kprobes based BPF programs. This infrastructure
         also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
         returns an anon-inode backed fd for the tracepoint object that allows
         for automatic detach of the BPF program resp. unregistering of the
         tracepoint probe on fd release, from Alexei.
      
      2) Add new BPF cgroup hooks at bind() and connect() entry in order to
         allow BPF programs to reject, inspect or modify user space passed
         struct sockaddr, and as well a hook at post bind time once the port
         has been allocated. They are used in FB's container management engine
         for implementing policy, replacing fragile LD_PRELOAD wrapper
         intercepting bind() and connect() calls that only works in limited
         scenarios like glibc based apps but not for other runtimes in
         containerized applications, from Andrey.
      
      3) BPF_F_INGRESS flag support has been added to sockmap programs for
         their redirect helper call bringing it in line with cls_bpf based
         programs. Support is added for both variants of sockmap programs,
         meaning for tx ULP hooks as well as recv skb hooks, from John.
      
      4) Various improvements on BPF side for the nfp driver, besides others
         this work adds BPF map update and delete helper call support from
         the datapath, JITing of 32 and 64 bit XADD instructions as well as
         offload support of bpf_get_prandom_u32() call. Initial implementation
         of nfp packet cache has been tackled that optimizes memory access
         (see merge commit for further details), from Jakub and Jiong.
      
      5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
         API has been done in order to prepare to use print_bpf_insn() soon
         out of perf tool directly. This makes the print_bpf_insn() API more
         generic and pushes the env into private data. bpftool is adjusted
         as well with the print_bpf_insn() argument removal, from Jiri.
      
      6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
         Format). The latter will reuse the current BPF verifier log as
         well, thus bpf_verifier_log() is further generalized, from Martin.
      
      7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
         and write support has been added in similar fashion to existing
         IPv6 IPV6_TCLASS socket option we already have, from Nikita.
      
      8) Fixes in recent sockmap scatterlist API usage, which did not use
         sg_init_table() for initialization thus triggering a BUG_ON() in
         scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
         uses a small helper sg_init_marker() to properly handle the affected
         cases, from Prashant.
      
      9) Let the BPF core follow IDR code convention and therefore use the
         idr_preload() and idr_preload_end() helpers, which would also help
         idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
         pressure, from Shaohua.
      
      10) Last but not least, a spelling fix in an error message for the
          BPF cookie UID helper under BPF sample code, from Colin.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4069fe6
    • David S. Miller's avatar
      Merge branch 'inet-frags-bring-rhashtables-to-IP-defrag' · 70ae7222
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      inet: frags: bring rhashtables to IP defrag
      
      IP defrag processing is one of the remaining problematic layer in linux.
      
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket.
      
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      
      Then there is the problem of sharing this hash table for all netns.
      
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      
      Lookup is now using RCU, and 64bit hosts can now provision whatever amount
      of memory needed to handle the expected workloads.
      
      v2: Addressed Herbert and Kirill feedbacks
        (Use rhashtable_free_and_destroy(), and split the big patch into small units)
      
      v3: Removed the extra add_frag_mem_limit(...) from inet_frag_create()
          Removed the refcount_inc_not_zero() call from inet_frags_free_cb(),
          as we can exploit del_timer() return value.
      
      v4: kbuild robot feedback about one missing static (squashed)
          Additional patches :
            inet: frags: do not clone skb in ip_expire()
            ipv6: frags: rewrite ip6_expire_frag_queue()
            rhashtable: reorganize struct rhashtable layout
            inet: frags: reorganize struct netns_frags
            inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
            ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB
            inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CB
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70ae7222
    • Eric Dumazet's avatar
      inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CB · f2d1c724
      Eric Dumazet authored
      nf_ct_frag6_queue() uses skb->cb[] to store the fragment offset,
      meaning that we could use two cache lines per skb when finding
      the insertion point, if for some reason inet6_skb_parm size
      is increased in the future.
      
      By using skb->ip_defrag_offset instead of skb->cb[] we pack all the fields
      in a single cache line, matching what we did for IPv4.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2d1c724
    • Eric Dumazet's avatar
      ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB · 219badfa
      Eric Dumazet authored
      ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that
      we could use two cache lines per skb when finding the insertion point,
      if for some reason inet6_skb_parm size is increased in the future.
      
      By using skb->ip_defrag_offset instead of skb->cb[], we pack all
      the fields in a single cache line, matching what we did for IPv4.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      219badfa
    • Eric Dumazet's avatar
      inet: frags: get rid of ipfrag_skb_cb/FRAG_CB · bf663371
      Eric Dumazet authored
      ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
      this integer is currently in a different cache line than skb->next,
      meaning that we use two cache lines per skb when finding the insertion point.
      
      By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
      in a single cache line and save precious memory bandwidth.
      
      Note that after the fast path added by Changli Gao in commit
      d6bebca9 ("fragment: add fast path for in-order fragments")
      this change wont help the fast path, since we still need
      to access prev->len (2nd cache line), but will show great
      benefits when slow path is entered, since we perform
      a linear scan of a potentially long list.
      
      Also, note that this potential long list is an attack vector,
      we might consider also using an rb-tree there eventually.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf663371
    • Eric Dumazet's avatar
      inet: frags: reorganize struct netns_frags · c2615cf5
      Eric Dumazet authored
      Put the read-mostly fields in a separate cache line
      at the beginning of struct netns_frags, to reduce
      false sharing noticed in inet_frag_kill()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2615cf5
    • Eric Dumazet's avatar
      rhashtable: reorganize struct rhashtable layout · e5d672a0
      Eric Dumazet authored
      While under frags DDOS I noticed unfortunate false sharing between
      @nelems and @params.automatic_shrinking
      
      Move @nelems at the end of struct rhashtable so that first cache line
      is shared between all cpus, because almost never dirtied.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5d672a0
    • Eric Dumazet's avatar
      ipv6: frags: rewrite ip6_expire_frag_queue() · 05c0b86b
      Eric Dumazet authored
      Make it similar to IPv4 ip_expire(), and release the lock
      before calling icmp functions.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05c0b86b
    • Eric Dumazet's avatar
      inet: frags: do not clone skb in ip_expire() · 1eec5d56
      Eric Dumazet authored
      An skb_clone() was added in commit ec4fbd64 ("inet: frag: release
      spinlock before calling icmp_send()")
      
      While fixing the bug at that time, it also added a very high cost
      for DDOS frags, as the ICMP rate limit is applied after this
      expensive operation (skb_clone() + consume_skb(), implying memory
      allocations, copy, and freeing)
      
      We can use skb_get(head) here, all we want is to make sure skb wont
      be freed by another cpu.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1eec5d56
    • Eric Dumazet's avatar
      inet: frags: break the 2GB limit for frags storage · 3e67f106
      Eric Dumazet authored
      Some users are willing to provision huge amounts of memory to be able
      to perform reassembly reasonnably well under pressure.
      
      Current memory tracking is using one atomic_t and integers.
      
      Switch to atomic_long_t so that 64bit arches can use more than 2GB,
      without any cost for 32bit arches.
      
      Note that this patch avoids an overflow error, if high_thresh was set
      to ~2GB, since this test in inet_frag_alloc() was never true :
      
      if (... || frag_mem_limit(nf) > nf->high_thresh)
      
      Tested:
      
      $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
      
      <frag DDOS>
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 14705885 memory 16000002880
      
      $ nstat -n ; sleep 1 ; nstat | grep Reas
      IpReasmReqds                    3317150            0.0
      IpReasmFails                    3317112            0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e67f106
    • Eric Dumazet's avatar
      inet: frags: remove inet_frag_maybe_warn_overflow() · 2d44ed22
      Eric Dumazet authored
      This function is obsolete, after rhashtable addition to inet defrag.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d44ed22
    • Eric Dumazet's avatar
      inet: frags: get rif of inet_frag_evicting() · 399d1404
      Eric Dumazet authored
      This refactors ip_expire() since one indentation level is removed.
      
      Note: in the future, we should try hard to avoid the skb_clone()
      since this is a serious performance cost.
      Under DDOS, the ICMP message wont be sent because of rate limits.
      
      Fact that ip6_expire_frag_queue() does not use skb_clone() is
      disturbing too. Presumably IPv6 should have the same
      issue than the one we fixed in commit ec4fbd64
      ("inet: frag: release spinlock before calling icmp_send()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      399d1404
    • Eric Dumazet's avatar
      inet: frags: remove some helpers · 6befe4a7
      Eric Dumazet authored
      Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()
      
      Also since we use rhashtable we can bring back the number of fragments
      in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
      removed in commit 434d3054 ("inet: frag: don't account number
      of fragment queues")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6befe4a7
    • Eric Dumazet's avatar
      inet: frags: use rhashtables for reassembly units · 648700f7
      Eric Dumazet authored
      Some applications still rely on IP fragmentation, and to be fair linux
      reassembly unit is not working under any serious load.
      
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
      
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      
      Then there is the problem of sharing this hash table for all netns.
      
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      
      Lookup is now using RCU. A followup patch will even remove
      the refcount hold/release left from prior implementation and save
      a couple of atomic operations.
      
      Before this patch, 16 cpus (16 RX queue NIC) could not handle more
      than 1 Mpps frags DDOS.
      
      After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
      of storage for the fragments (exact number depends on frags being evicted
      after timeout)
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 1966916 memory 2140004608
      
      A followup patch will change the limits for 64bit arches.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Aring <alex.aring@gmail.com>
      Cc: Stefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      648700f7
    • Eric Dumazet's avatar
      rhashtable: add schedule points · ae6da1f5
      Eric Dumazet authored
      Rehashing and destroying large hash table takes a lot of time,
      and happens in process context. It is safe to add cond_resched()
      in rhashtable_rehash_table() and rhashtable_free_and_destroy()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae6da1f5
    • Eric Dumazet's avatar
      inet: frags: refactor ipfrag_init() · 483a6e4f
      Eric Dumazet authored
      We need to call inet_frags_init() before register_pernet_subsys(),
      as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      483a6e4f
    • Eric Dumazet's avatar
      inet: frags: refactor lowpan_net_frag_init() · 807f1844
      Eric Dumazet authored
      We want to call lowpan_net_frag_init() earlier.
      Similar to commit "inet: frags: refactor ipv6_frag_init()"
      
      This is a prereq to "inet: frags: use rhashtables for reassembly units"
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      807f1844
    • Eric Dumazet's avatar
      inet: frags: refactor ipv6_frag_init() · 5b975bab
      Eric Dumazet authored
      We want to call inet_frags_init() earlier.
      
      This is a prereq to "inet: frags: use rhashtables for reassembly units"
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b975bab
    • Eric Dumazet's avatar
      inet: frags: add a pointer to struct netns_frags · 093ba729
      Eric Dumazet authored
      In order to simplify the API, add a pointer to struct inet_frags.
      This will allow us to make things less complex.
      
      These functions no longer have a struct inet_frags parameter :
      
      inet_frag_destroy(struct inet_frag_queue *q  /*, struct inet_frags *f */)
      inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
      inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
      inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
      ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      093ba729
    • Eric Dumazet's avatar
      inet: frags: change inet_frags_init_net() return value · 787bea77
      Eric Dumazet authored
      We will soon initialize one rhashtable per struct netns_frags
      in inet_frags_init_net().
      
      This patch changes the return value to eventually propagate an
      error.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      787bea77
    • Eric Dumazet's avatar
      ipv6: frag: remove unused field · c22af22c
      Eric Dumazet authored
      csum field in struct frag_queue is not used, remove it.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c22af22c
    • David S. Miller's avatar
      Merge branch 'bnxt_en-next' · 5749d6af
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: Update for net-next.
      
      Misc. updates including updated firmware interface, some additional
      port statistics, a new IRQ assignment scheme for the RDMA driver, support
      for VF trust, and other changes and improvements for SRIOV.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5749d6af
    • Michael Chan's avatar
      bnxt_en: Add ULP calls to stop and restart IRQs. · ec86f14e
      Michael Chan authored
      When the driver needs to re-initailize the IRQ vectors, we make the
      new ulp_irq_stop() call to tell the RDMA driver to disable and free
      the IRQ vectors.  After IRQ vectors have been re-initailized, we
      make the ulp_irq_restart() call to tell the RDMA driver that
      IRQs can be restarted.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec86f14e
    • Michael Chan's avatar
      bnxt_en: Reserve completion rings and MSIX for bnxt_re RDMA driver. · fbcfc8e4
      Michael Chan authored
      Add additional logic to reserve completion rings for the bnxt_re driver
      when it requests MSIX vectors.  The function bnxt_cp_rings_in_use()
      will return the total number of completion rings used by both drivers
      that need to be reserved.  If the network interface in up, we will
      close and open the NIC to reserve the new set of completion rings and
      re-initialize the vectors.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbcfc8e4
    • Michael Chan's avatar
      bnxt_en: Refactor bnxt_need_reserve_rings(). · 4e41dc5d
      Michael Chan authored
      Refactor bnxt_need_reserve_rings() slightly so that __bnxt_reserve_rings()
      can call it and remove some duplicated code.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e41dc5d
    • Michael Chan's avatar
      bnxt_en: Add IRQ remapping logic. · e5811b8c
      Michael Chan authored
      Add remapping logic so that bnxt_en can use any arbitrary MSIX vectors.
      This will allow the driver to reserve one range of MSIX vectors to be
      used by both bnxt_en and bnxt_re.  bnxt_en can now skip over the MSIX
      vectors used by bnxt_re.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5811b8c
    • Michael Chan's avatar
      bnxt_en: Change IRQ assignment for RDMA driver. · 08654eb2
      Michael Chan authored
      In the current code, the range of MSIX vectors allocated for the RDMA
      driver is disjoint from the network driver.  This creates a problem
      for the new firmware ring reservation scheme.  The new scheme requires
      the reserved completion rings/MSIX vectors to be in a contiguous
      range.
      
      Change the logic to allocate RDMA MSIX vectors to be contiguous with
      the vectors used by bnxt_en on new firmware using the new scheme.
      The new function bnxt_get_num_msix() calculates the exact number of
      vectors needed by both drivers.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08654eb2
    • Michael Chan's avatar
      bnxt_en: Improve ring allocation logic. · 9899bb59
      Michael Chan authored
      Currently, the driver code makes some assumptions about the group index
      and the map index of rings.  This makes the code more difficult to
      understand and less flexible.
      
      Improve it by adding the grp_idx and map_idx fields explicitly to the
      bnxt_ring_struct as a union.  The grp_idx is initialized for each tx ring
      and rx agg ring during init. time.  We do the same for the map_idx for
      each cmpl ring.
      
      The grp_idx ties the tx ring to the ring group.  The map_idx is the
      doorbell index of the ring.  With this new infrastructure, we can change
      the ring index mapping scheme easily in the future.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9899bb59
    • Michael Chan's avatar
      bnxt_en: Improve valid bit checking in firmware response message. · 845adfe4
      Michael Chan authored
      When firmware sends a DMA response to the driver, the last byte of the
      message will be set to 1 to indicate that the whole response is valid.
      The driver waits for the message to be valid before reading the message.
      
      The firmware spec allows these response messages to increase in
      length by adding new fields to the end of these messages.  The
      older spec's valid location may become a new field in a newer
      spec.  To guarantee compatibility, the driver should zero the valid
      byte before interpreting the entire message so that any new fields not
      implemented by the older spec will be read as zero.
      
      For messages that are forwarded to VFs, we need to set the length
      and re-instate the valid bit so the VF will see the valid response.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      845adfe4
    • Michael Chan's avatar
      bnxt_en: Improve resource accounting for SRIOV. · 596f9d55
      Michael Chan authored
      When VFs are created, the current code subtracts the maximum VF
      resources from the PF's pool.  This under-estimates the resources
      remaining in the PF pool.  Instead, we should subtract the minimum
      VF resources.  The VF minimum resources are guaranteed to the VFs
      and only these should be subtracted from the PF's pool.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      596f9d55
    • Michael Chan's avatar
      bnxt_en: Check max_tx_scheduler_inputs value from firmware. · db4723b3
      Michael Chan authored
      When checking for the maximum pre-set TX channels for ethtool -l, we
      need to check the current max_tx_scheduler_inputs parameter from firmware.
      This parameter specifies the max input for the internal QoS nodes currently
      available to this function.  The function's TX rings will be capped by this
      parameter.  By adding this logic, we provide a more accurate pre-set max
      TX channels to the user.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db4723b3
    • Vasundhara Volam's avatar
      bnxt_en: Add extended port statistics support · 00db3cba
      Vasundhara Volam authored
      Gather periodic extended port statistics, if the device is PF and
      link is up.
      Signed-off-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      00db3cba
    • Vasundhara Volam's avatar
      bnxt_en: Include additional hardware port statistics in ethtool -S. · 699efed0
      Vasundhara Volam authored
      Include additional hardware port statistics in ethtool -S, which
      are useful for debugging.
      Signed-off-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      699efed0
    • Vasundhara Volam's avatar
      bnxt_en: Add support for ndo_set_vf_trust · 746df139
      Vasundhara Volam authored
      Trusted VFs are allowed to modify MAC address, even when PF
      has assigned one.
      Signed-off-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      746df139
    • Scott Branden's avatar
      bnxt_en: fix clear flags in ethtool reset handling · 2373d8d6
      Scott Branden authored
      Clear flags when reset command processed successfully for components
      specified.
      
      Fixes: 6502ad59 ("bnxt_en: Add ETH_RESET_AP support")
      Signed-off-by: default avatarScott Branden <scott.branden@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2373d8d6
    • Michael Chan's avatar
      bnxt_en: Use a dedicated VNIC mode for RDMA. · abe93ad2
      Michael Chan authored
      If the RDMA driver is registered, use a new VNIC mode that allows
      RDMA traffic to be seen on the netdev in promiscuous mode.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      abe93ad2
    • Michael Chan's avatar
      bnxt_en: Adjust default rings for multi-port NICs. · 1d3ef13d
      Michael Chan authored
      Change the default ring logic to select default number of rings to be up to
      8 per port if the default rings x NIC ports <= total CPUs.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d3ef13d
    • Michael Chan's avatar
      bnxt_en: Update firmware interface to 1.9.1.15. · d4f52de0
      Michael Chan authored
      Minor changes, such as new extended port statistics.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4f52de0
    • Wei Yongjun's avatar
      vlan: vlan_hw_filter_capable() can be static · eeb0a2a5
      Wei Yongjun authored
      Fixes the following sparse warning:
      
      net/8021q/vlan_core.c:168:6: warning:
       symbol 'vlan_hw_filter_capable' was not declared. Should it be static?
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eeb0a2a5
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2018-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 8bde261e
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2018-03-30
      
      This series contains updates to mlx5 core and mlx5e netdev drivers.
      The main highlight of this series is the RX optimizations for striding RQ path,
      introduced by Tariq.
      
      First Four patches are trivial misc cleanups.
       - Spelling mistake fix
       - Dead code removal
       - Warning messages
      
      RX optimizations for striding RQ:
      
      1) RX refactoring, cleanups and micro optimizations
         - MTU calculation simplifications, obsoletes some WQEs-to-packets translation
           functions and helps delete ~60 LOC.
         - Do not busy-wait a pending UMR completion.
         - post the new values of UMR WQE inline, instead of using a data pointer.
         - use pre-initialized structures to save calculations in datapath.
      
      2) Use linear SKB in Striding RQ "build_skb", (Using linear SKB has many advantages):
          - Saves a memcpy of the headers.
          - No page-boundary checks in datapath.
          - No filler CQEs.
          - Significantly smaller CQ.
          - SKB data continuously resides in linear part, and not split to
            small amount (linear part) and large amount (fragment).
            This saves datapath cycles in driver and improves utilization
            of SKB fragments in GRO.
          - The fragments of a resulting GRO SKB follow the IP forwarding
            assumption of equal-size fragments.
      
          implementation details:
          HW writes the packets to the beginning of a stride,
          i.e. does not keep headroom. To overcome this we make sure we can
          extend backwards and use the last bytes of stride i-1.
          Extra care is needed for stride 0 as it has no preceding stride.
          We make sure headroom bytes are available by shifting the buffer
          pointer passed to HW by headroom bytes.
      
          This configuration now becomes default, whenever capable.
          Of course, this implies turning LRO off.
      
          Performance testing:
          ConnectX-5, single core, single RX ring, default MTU.
      
          UDP packet rate, early drop in TC layer:
      
          --------------------------------------------
          | pkt size | before    | after     | ratio |
          --------------------------------------------
          | 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
          |  500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
          |   64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
          --------------------------------------------
      
          TCP streams: ~20% gain
      
      3) Support XDP over Striding RQ:
          Now that linear SKB is supported over Striding RQ,
          we can support XDP by setting stride size to PAGE_SIZE
          and headroom to XDP_PACKET_HEADROOM.
      
          Striding RQ is capable of a higher packet-rate than
          conventional RQ.
      
          Performance testing:
          ConnectX-5, 24 rings, default MTU.
          CQE compression ON (to reduce completions BW in PCI).
      
          XDP_DROP packet rate:
          --------------------------------------------------
          | pkt size | XDP rate   | 100GbE linerate | pct% |
          --------------------------------------------------
          |   64byte | 126.2 Mpps |      148.0 Mpps |  85% |
          |  128byte |  80.0 Mpps |       84.8 Mpps |  94% |
          |  256byte |  42.7 Mpps |       42.7 Mpps | 100% |
          |  512byte |  23.4 Mpps |       23.4 Mpps | 100% |
          --------------------------------------------------
      
      4) Remove mlx5 page_ref bulking in Striding RQ and use page_ref_inc only when needed.
         Without this bulking, we have:
          - no atomic ops on WQE allocation or free
          - one atomic op per SKB
          - In the default MTU configuration (1500, stride size is 2K),
            the non-bulking method execute 2 atomic ops as before
          - For larger MTUs with stride size of 4K, non-bulking method
            executes only a single op.
          - For XDP (stride size of 4K, no SKBs), non-bulking have no atomic ops per packet at all.
      
          Performance testing:
          ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
      
          Single core packet rate (64 bytes).
      
          Early drop in TC: no degradation.
      
          XDP_DROP:
          before: 14,270,188 pps
          after:  20,503,603 pps, 43% improvement.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8bde261e