1. 01 Apr, 2018 40 commits
    • Jiri Pirko's avatar
      mlxsw: core: Fix arg name of MLXSW_CORE_RES_VALID and MLXSW_CORE_RES_GET · 64f45888
      Jiri Pirko authored
      First arg of these helpers should be "mlxsw_core".
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64f45888
    • Jiri Pirko's avatar
      mlxsw: remove kvd_hash_granularity from config profile struct · 72779c97
      Jiri Pirko authored
      This should not be part of the struct, as the struct fields
      are tightly coupled with the FW command payload of the same name.
      Just use the "granularity" define directly, as in other places.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72779c97
    • Jiri Pirko's avatar
      mlxsw: spectrum: Change KVD linear parts from list to array · 4f8768be
      Jiri Pirko authored
      The parts info is array. The parts copy this info array, yet they are a
      list. So make the indexing according to the id and change the list of
      parts into array of parts. This helps to eliminate lookups and
      constructs like mlxsw_sp_kvdl_part_update() (took me some non-trivial
      time to figure out what is going on there).
      Alongside with that, introduce a helper macro to define the parts infos.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f8768be
    • Jiri Pirko's avatar
      mlxsw: Constify devlink_resource_ops · f9b91201
      Jiri Pirko authored
      devlink_resource_ops should be const as the arg of register function is
      also const.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9b91201
    • Jiri Pirko's avatar
      mlxsw: spectrum_kvdl: Fix handling of resource_size_param · c8276dd2
      Jiri Pirko authored
      Current code uses global variables, adjusts them and passes pointer down
      to devlink. With every other mlxsw_core instance, the previously passed
      pointer values are rewritten. Fix this by de-globalize the variables.
      
      Fixes: 7f47b19b ("mlxsw: spectrum_kvdl: Add support for per part occupancy")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarArkadi Sharshevsky <arkadis@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8276dd2
    • Jiri Pirko's avatar
      mlxsw: spectrum_acl: Fix flex actions header ifndef define construct · 9270aa0d
      Jiri Pirko authored
      Fix copy&paste error in flex actions header ifndef define construct
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9270aa0d
    • David S. Miller's avatar
      Merge branch 'chelsio-inline-tls' · 06b19fe9
      David S. Miller authored
      Atul Gupta says:
      
      ====================
      Chelsio Inline TLS
      
      Series for Chelsio Inline TLS driver (chtls)
      
      Use tls ULP infrastructure to register chtls as Inline TLS driver.
      Chtls use TCP Sockets to Tx/Rx TLS records.
      TCP sk_proto APIs are enhanced to offload TLS record.
      
      T6 adapter provides the following features:
              -TLS record offload, TLS header, encrypt, digest and transmit
              -TLS record receive and decrypt
              -TLS keys store
              -TCP/IP engine
              -TLS engine
              -GCM crypto engine [support CBC also]
      
      TLS provides security at the transport layer. It uses TCP to provide
      reliable end-to-end transport of application data.
      It relies on TCP for any retransmission.
      TLS session comprises of three parts:
      a. TCP/IP connection
      b. TLS handshake
      c. Record layer processing
      
      TLS handshake state machine is executed in host (refer standard
      implementation eg. OpenSSL).  Setsockopt [SOL_TCP, TCP_ULP]
      initialize TCP proto-ops for Chelsio inline tls support.
      setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
      
      Tx and Rx Keys are decided during handshake and programmed on
      the chip after CCS is exchanged.
      struct tls12_crypto_info_aes_gcm_128 crypto_info
      setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info))
      Finish is the first encrypted/decrypted message tx/rx inline.
      
      On the Tx path TLS engine receive plain text from openssl, insert IV,
      fetches the tx key, create cipher text records and generate MAC.
      
      TLS header is added to cipher text and forward to TCP/IP engine for
      transport layer processing and transmission on wire.
      TX PATH:
      Apps--openssl--chtls---TLS engine---encrypt/auth---TCP/IP engine---wire
      
      On the Rx side, data received is PDU aligned at record boundaries.
      TLS processes only the complete record. If rx key is programmed on
      CCS receive, data is decrypted and plain text is posted to host.
      RX PATH:
      Wire--cipher-text--TCP/IP engine [PDU align]---TLS engine---
      decrypt/auth---plain-text--chtls--openssl--application
      
      v15: indent fix in mark_urg
           -removed unwanted checks in sendmsg, sendpage, recvmsg,
            close, disconnect,shutdown, destroy sock [Sabrina]
           - removed unused chtls_free_kmap [chtls.h]
           - rebase to top of net-next
      
      v14: -Reverse christmas tree style for variable declarations for
           various functions in chtls_hw.c, chtls_io.c [Stefano Brivio]
           - replaced break with return in tcp_state_to_flowc_state
             [Stefano Brivio]
           - renamed tlstx_seq_number to tlstx_incr_seqnum [Stefano Brivio]
           - use bool for corked, should_push and send_should_push
             [Stefano Brivio]
           - removed "Reviewed-by" tag for Stefano, Sabrina, Dave Watson
      
      v13: handle clean ctx free for HW_RECORD in tls_sk_proto_close
          -removed SOCK_INLINE [chtls.h], using csk_conn_inline instead
           in send_abort_rpl,chtls_send_abort_rpl,chtls_sendmsg,chtls_sendpage
          -removed sk_no_receive [chtls_io.c] replaced with sk_shutdown &
           RCV_SHUTDOWN in chtls_pt_recvmsg, peekmsg and chtls_recvmsg
          -cleaned chtls_expansion_size [Stefano Brivio]
          - u8 conf:3 in tls_sw_context to add TLS_HW_RECORD
          -removed is_tls_skb, using tls_skb_inline [Stefano Brivio]
          -reverse christmas tree formatting in chtls_io.c, chtls_cm.c
           [Stefano Brivio]
          -fixed build warning reported by kbuild robot
          -retained ctx conf enum in chtls_main vs earlier versions, tls_prots
           not used in chtls.
          -cleanup [removed syn_sent, base_prot, added synq] [Michael Werner]
          - passing struct fw_wr_hdr * to ofldtxq_stop [Casey]
          - rebased on top of the current net-next
      
      v12: patch against net-next
          -fixed build error [reported by Julia]
          -replace set_queue with skb_set_queue_mapping [Sabrina]
          -copyright year correction [chtls]
      
      v11: formatting and cleanup, few function rename and error
           handling [Stefano Brivio]
           - ctx freed later for TLS_HW_RECORD
           - split tx and rx in different patch
      
      v10: fixed following based on the review comments of Sabrina Dubroca
           -docs header added for struct tls_device [tls.h]
           -changed TLS_FULL_HW to TLS_HW_RECORD
           -similary using tls-hw-record instead of tls-inline for
           ethtool feature config
           -added more description to patch sets
           -replaced kmalloc/vmalloc/kfree with kvzalloc/kvfree
           -reordered the patch sequence
           -formatted entire patch for func return values
      
      v9: corrected __u8 and similar usage
          -create_ctx to alloc tls_context
          -tls_hw_prot before sk !establish check
      
      v8: tls_main.c cleanup comment [Dave Watson]
      
      v7: func name change, use sk->sk_prot where required
      
      v6: modify prot only for FULL_HW
         -corrected commit message for patch 11
      
      v5: set TLS_FULL_HW for registered inline tls drivers
         -set TLS_FULL_HW prot for offload connection else move
          to TLS_SW_TX
         -Case handled for interface with same IP [Dave Miller]
         -Removed Specific IP and INADDR_ANY handling [v4]
      
      v4: removed chtls ULP type, retained tls ULP
         -registered chtls with net tls
         -defined struct tls_device to register the Inline drivers
         -ethtool interface tls-inline to enable Inline TLS for interface
         -prot update to support inline TLS
      
      v3: fixed the kbuild test issues
         -made few funtions static
         -initialized few variables
      
      v2: fixed the following based on the review comments of Stephan Mueller,
          Stefano Brivio and Hannes Frederic
          -Added more details in cover letter
          -Fixed indentation and formating issues
          -Using aes instead of aes-generic
          -memset key info after programing the key on chip
          -reordered the patch sequence
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06b19fe9
    • Atul Gupta's avatar
      crypto: chtls - Makefile Kconfig · bd7f4857
      Atul Gupta authored
      Entry for Inline TLS as another driver dependent on cxgb4 and chcr
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd7f4857
    • Atul Gupta's avatar
      crypto: chtls - Program the TLS session Key · d25f2f71
      Atul Gupta authored
      Initialize the space reserved for storing the TLS keys,
      get and free the location where key is stored for the TLS
      connection.
      Program the Tx and Rx key as received from user in
      struct tls12_crypto_info_aes_gcm_128 and understood by hardware.
      added socket option TLS_RX
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d25f2f71
    • Atul Gupta's avatar
      crypto: chtls - Inline TLS record Rx · b647993f
      Atul Gupta authored
      handler for record receive. plain text copied to user
      buffer
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Signed-off-by: default avatarMichael Werner <werner@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b647993f
    • Atul Gupta's avatar
      crypto: chtls - Inline TLS record Tx · 36bedb3f
      Atul Gupta authored
      TLS handler for record transmit.
      Create Inline TLS work request and post to FW.
      Create Inline TLS record CPLs for hardware
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Signed-off-by: default avatarMichael Werner <werner@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36bedb3f
    • Atul Gupta's avatar
      crypto : chtls - CPL handler definition · cc35c88a
      Atul Gupta authored
      Exchange messages with hardware to program the TLS session
      CPL handlers for messages received from chip.
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Signed-off-by: default avatarMichael Werner <werner@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc35c88a
    • Atul Gupta's avatar
      crypto: chtls - Register chtls with net tls · a0894394
      Atul Gupta authored
      Register chtls as Inline TLS driver, chtls is ULD to cxgb4.
      Setsockopt to program (tx/rx) keys on chip.
      Support AES GCM of key size 128.
      Support both Inline Rx and Tx.
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Reviewed-by: default avatarCasey Leedom <leedom@chelsio.com>
      Reviewed-by: default avatarMichael Werner <werner@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0894394
    • Atul Gupta's avatar
      crypto: chtls - structure and macro for Inline TLS · a6779341
      Atul Gupta authored
      Define Inline TLS state, connection management info.
      Supporting macros definition.
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Reviewed-by: default avatarMichael Werner <werner@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6779341
    • Atul Gupta's avatar
      crypto: chcr - Inline TLS Key Macros · 639d28a1
      Atul Gupta authored
      Define macro for programming the TLS Key context
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      639d28a1
    • Atul Gupta's avatar
      cxgb4: LLD driver changes to support TLS · e383f248
      Atul Gupta authored
      Read the Inline TLS capability from firmware.
      Determine the area reserved for storing the keys
      Dump the Inline TLS tx and rx records count.
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Reviewed-by: default avatarMichael Werner <werner@chelsio.com>
      Reviewed-by: default avatarCasey Leedom <leedom@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e383f248
    • Atul Gupta's avatar
      cxgb4: Inline TLS FW Interface · e1087089
      Atul Gupta authored
      Key area size in hw-config file. CPL struct for TLS request
      and response. Work request for Inline TLS.
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Reviewed-by: default avatarCasey Leedom <leedom@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1087089
    • Atul Gupta's avatar
      ethtool: enable Inline TLS in HW · e0be6bea
      Atul Gupta authored
      Ethtool option enables TLS record offload on HW, user
      configures the feature for netdev capable of Inline TLS.
      This allows user to define custom sk_prot for Inline TLS sock
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0be6bea
    • Atul Gupta's avatar
      tls: support for Inline tls record · dd0bed16
      Atul Gupta authored
      Facility to register Inline TLS drivers to net/tls. Setup
      TLS_HW_RECORD prot to listen on offload device.
      
      Cases handled
      - Inline TLS device exists, setup prot for TLS_HW_RECORD
      - Atleast one Inline TLS exists, sets TLS_HW_RECORD.
      - If non-inline device establish connection, move to TLS_SW_TX
      Signed-off-by: default avatarAtul Gupta <atul.gupta@chelsio.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd0bed16
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · d4069fe6
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-03-31
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Add raw BPF tracepoint API in order to have a BPF program type that
         can access kernel internal arguments of the tracepoints in their
         raw form similar to kprobes based BPF programs. This infrastructure
         also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
         returns an anon-inode backed fd for the tracepoint object that allows
         for automatic detach of the BPF program resp. unregistering of the
         tracepoint probe on fd release, from Alexei.
      
      2) Add new BPF cgroup hooks at bind() and connect() entry in order to
         allow BPF programs to reject, inspect or modify user space passed
         struct sockaddr, and as well a hook at post bind time once the port
         has been allocated. They are used in FB's container management engine
         for implementing policy, replacing fragile LD_PRELOAD wrapper
         intercepting bind() and connect() calls that only works in limited
         scenarios like glibc based apps but not for other runtimes in
         containerized applications, from Andrey.
      
      3) BPF_F_INGRESS flag support has been added to sockmap programs for
         their redirect helper call bringing it in line with cls_bpf based
         programs. Support is added for both variants of sockmap programs,
         meaning for tx ULP hooks as well as recv skb hooks, from John.
      
      4) Various improvements on BPF side for the nfp driver, besides others
         this work adds BPF map update and delete helper call support from
         the datapath, JITing of 32 and 64 bit XADD instructions as well as
         offload support of bpf_get_prandom_u32() call. Initial implementation
         of nfp packet cache has been tackled that optimizes memory access
         (see merge commit for further details), from Jakub and Jiong.
      
      5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
         API has been done in order to prepare to use print_bpf_insn() soon
         out of perf tool directly. This makes the print_bpf_insn() API more
         generic and pushes the env into private data. bpftool is adjusted
         as well with the print_bpf_insn() argument removal, from Jiri.
      
      6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
         Format). The latter will reuse the current BPF verifier log as
         well, thus bpf_verifier_log() is further generalized, from Martin.
      
      7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
         and write support has been added in similar fashion to existing
         IPv6 IPV6_TCLASS socket option we already have, from Nikita.
      
      8) Fixes in recent sockmap scatterlist API usage, which did not use
         sg_init_table() for initialization thus triggering a BUG_ON() in
         scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
         uses a small helper sg_init_marker() to properly handle the affected
         cases, from Prashant.
      
      9) Let the BPF core follow IDR code convention and therefore use the
         idr_preload() and idr_preload_end() helpers, which would also help
         idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
         pressure, from Shaohua.
      
      10) Last but not least, a spelling fix in an error message for the
          BPF cookie UID helper under BPF sample code, from Colin.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4069fe6
    • David S. Miller's avatar
      Merge branch 'inet-frags-bring-rhashtables-to-IP-defrag' · 70ae7222
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      inet: frags: bring rhashtables to IP defrag
      
      IP defrag processing is one of the remaining problematic layer in linux.
      
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket.
      
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      
      Then there is the problem of sharing this hash table for all netns.
      
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      
      Lookup is now using RCU, and 64bit hosts can now provision whatever amount
      of memory needed to handle the expected workloads.
      
      v2: Addressed Herbert and Kirill feedbacks
        (Use rhashtable_free_and_destroy(), and split the big patch into small units)
      
      v3: Removed the extra add_frag_mem_limit(...) from inet_frag_create()
          Removed the refcount_inc_not_zero() call from inet_frags_free_cb(),
          as we can exploit del_timer() return value.
      
      v4: kbuild robot feedback about one missing static (squashed)
          Additional patches :
            inet: frags: do not clone skb in ip_expire()
            ipv6: frags: rewrite ip6_expire_frag_queue()
            rhashtable: reorganize struct rhashtable layout
            inet: frags: reorganize struct netns_frags
            inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
            ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB
            inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CB
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70ae7222
    • Eric Dumazet's avatar
      inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CB · f2d1c724
      Eric Dumazet authored
      nf_ct_frag6_queue() uses skb->cb[] to store the fragment offset,
      meaning that we could use two cache lines per skb when finding
      the insertion point, if for some reason inet6_skb_parm size
      is increased in the future.
      
      By using skb->ip_defrag_offset instead of skb->cb[] we pack all the fields
      in a single cache line, matching what we did for IPv4.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2d1c724
    • Eric Dumazet's avatar
      ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB · 219badfa
      Eric Dumazet authored
      ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that
      we could use two cache lines per skb when finding the insertion point,
      if for some reason inet6_skb_parm size is increased in the future.
      
      By using skb->ip_defrag_offset instead of skb->cb[], we pack all
      the fields in a single cache line, matching what we did for IPv4.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      219badfa
    • Eric Dumazet's avatar
      inet: frags: get rid of ipfrag_skb_cb/FRAG_CB · bf663371
      Eric Dumazet authored
      ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
      this integer is currently in a different cache line than skb->next,
      meaning that we use two cache lines per skb when finding the insertion point.
      
      By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
      in a single cache line and save precious memory bandwidth.
      
      Note that after the fast path added by Changli Gao in commit
      d6bebca9 ("fragment: add fast path for in-order fragments")
      this change wont help the fast path, since we still need
      to access prev->len (2nd cache line), but will show great
      benefits when slow path is entered, since we perform
      a linear scan of a potentially long list.
      
      Also, note that this potential long list is an attack vector,
      we might consider also using an rb-tree there eventually.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf663371
    • Eric Dumazet's avatar
      inet: frags: reorganize struct netns_frags · c2615cf5
      Eric Dumazet authored
      Put the read-mostly fields in a separate cache line
      at the beginning of struct netns_frags, to reduce
      false sharing noticed in inet_frag_kill()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2615cf5
    • Eric Dumazet's avatar
      rhashtable: reorganize struct rhashtable layout · e5d672a0
      Eric Dumazet authored
      While under frags DDOS I noticed unfortunate false sharing between
      @nelems and @params.automatic_shrinking
      
      Move @nelems at the end of struct rhashtable so that first cache line
      is shared between all cpus, because almost never dirtied.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5d672a0
    • Eric Dumazet's avatar
      ipv6: frags: rewrite ip6_expire_frag_queue() · 05c0b86b
      Eric Dumazet authored
      Make it similar to IPv4 ip_expire(), and release the lock
      before calling icmp functions.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05c0b86b
    • Eric Dumazet's avatar
      inet: frags: do not clone skb in ip_expire() · 1eec5d56
      Eric Dumazet authored
      An skb_clone() was added in commit ec4fbd64 ("inet: frag: release
      spinlock before calling icmp_send()")
      
      While fixing the bug at that time, it also added a very high cost
      for DDOS frags, as the ICMP rate limit is applied after this
      expensive operation (skb_clone() + consume_skb(), implying memory
      allocations, copy, and freeing)
      
      We can use skb_get(head) here, all we want is to make sure skb wont
      be freed by another cpu.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1eec5d56
    • Eric Dumazet's avatar
      inet: frags: break the 2GB limit for frags storage · 3e67f106
      Eric Dumazet authored
      Some users are willing to provision huge amounts of memory to be able
      to perform reassembly reasonnably well under pressure.
      
      Current memory tracking is using one atomic_t and integers.
      
      Switch to atomic_long_t so that 64bit arches can use more than 2GB,
      without any cost for 32bit arches.
      
      Note that this patch avoids an overflow error, if high_thresh was set
      to ~2GB, since this test in inet_frag_alloc() was never true :
      
      if (... || frag_mem_limit(nf) > nf->high_thresh)
      
      Tested:
      
      $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
      
      <frag DDOS>
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 14705885 memory 16000002880
      
      $ nstat -n ; sleep 1 ; nstat | grep Reas
      IpReasmReqds                    3317150            0.0
      IpReasmFails                    3317112            0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e67f106
    • Eric Dumazet's avatar
      inet: frags: remove inet_frag_maybe_warn_overflow() · 2d44ed22
      Eric Dumazet authored
      This function is obsolete, after rhashtable addition to inet defrag.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d44ed22
    • Eric Dumazet's avatar
      inet: frags: get rif of inet_frag_evicting() · 399d1404
      Eric Dumazet authored
      This refactors ip_expire() since one indentation level is removed.
      
      Note: in the future, we should try hard to avoid the skb_clone()
      since this is a serious performance cost.
      Under DDOS, the ICMP message wont be sent because of rate limits.
      
      Fact that ip6_expire_frag_queue() does not use skb_clone() is
      disturbing too. Presumably IPv6 should have the same
      issue than the one we fixed in commit ec4fbd64
      ("inet: frag: release spinlock before calling icmp_send()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      399d1404
    • Eric Dumazet's avatar
      inet: frags: remove some helpers · 6befe4a7
      Eric Dumazet authored
      Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()
      
      Also since we use rhashtable we can bring back the number of fragments
      in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
      removed in commit 434d3054 ("inet: frag: don't account number
      of fragment queues")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6befe4a7
    • Eric Dumazet's avatar
      inet: frags: use rhashtables for reassembly units · 648700f7
      Eric Dumazet authored
      Some applications still rely on IP fragmentation, and to be fair linux
      reassembly unit is not working under any serious load.
      
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
      
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      
      Then there is the problem of sharing this hash table for all netns.
      
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      
      Lookup is now using RCU. A followup patch will even remove
      the refcount hold/release left from prior implementation and save
      a couple of atomic operations.
      
      Before this patch, 16 cpus (16 RX queue NIC) could not handle more
      than 1 Mpps frags DDOS.
      
      After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
      of storage for the fragments (exact number depends on frags being evicted
      after timeout)
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 1966916 memory 2140004608
      
      A followup patch will change the limits for 64bit arches.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Aring <alex.aring@gmail.com>
      Cc: Stefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      648700f7
    • Eric Dumazet's avatar
      rhashtable: add schedule points · ae6da1f5
      Eric Dumazet authored
      Rehashing and destroying large hash table takes a lot of time,
      and happens in process context. It is safe to add cond_resched()
      in rhashtable_rehash_table() and rhashtable_free_and_destroy()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae6da1f5
    • Eric Dumazet's avatar
      inet: frags: refactor ipfrag_init() · 483a6e4f
      Eric Dumazet authored
      We need to call inet_frags_init() before register_pernet_subsys(),
      as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      483a6e4f
    • Eric Dumazet's avatar
      inet: frags: refactor lowpan_net_frag_init() · 807f1844
      Eric Dumazet authored
      We want to call lowpan_net_frag_init() earlier.
      Similar to commit "inet: frags: refactor ipv6_frag_init()"
      
      This is a prereq to "inet: frags: use rhashtables for reassembly units"
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      807f1844
    • Eric Dumazet's avatar
      inet: frags: refactor ipv6_frag_init() · 5b975bab
      Eric Dumazet authored
      We want to call inet_frags_init() earlier.
      
      This is a prereq to "inet: frags: use rhashtables for reassembly units"
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b975bab
    • Eric Dumazet's avatar
      inet: frags: add a pointer to struct netns_frags · 093ba729
      Eric Dumazet authored
      In order to simplify the API, add a pointer to struct inet_frags.
      This will allow us to make things less complex.
      
      These functions no longer have a struct inet_frags parameter :
      
      inet_frag_destroy(struct inet_frag_queue *q  /*, struct inet_frags *f */)
      inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
      inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
      inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
      ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      093ba729
    • Eric Dumazet's avatar
      inet: frags: change inet_frags_init_net() return value · 787bea77
      Eric Dumazet authored
      We will soon initialize one rhashtable per struct netns_frags
      in inet_frags_init_net().
      
      This patch changes the return value to eventually propagate an
      error.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      787bea77
    • Eric Dumazet's avatar
      ipv6: frag: remove unused field · c22af22c
      Eric Dumazet authored
      csum field in struct frag_queue is not used, remove it.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c22af22c