1. 18 Mar, 2020 3 commits
    • Lukas Wunner's avatar
      netfilter: Introduce egress hook · 8537f786
      Lukas Wunner authored
      Commit e687ad60 ("netfilter: add netfilter ingress hook after
      handle_ing() under unique static key") introduced the ability to
      classify packets on ingress.
      
      Allow the same on egress.  Position the hook immediately before a packet
      is handed to tc and then sent out on an interface, thereby mirroring the
      ingress order.  This order allows marking packets in the netfilter
      egress hook and subsequently using the mark in tc.  Another benefit of
      this order is consistency with a lot of existing documentation which
      says that egress tc is performed after netfilter hooks.
      
      Egress hooks already exist for the most common protocols, such as
      NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
      they are executed earlier during packet processing.  However for more
      exotic protocols, there is currently no provision to apply netfilter on
      egress.  A common workaround is to enslave the interface to a bridge and
      use ebtables, or to resort to tc.  But when the ingress hook was
      introduced, consensus was that users should be given the choice to use
      netfilter or tc, whichever tool suits their needs best:
      https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/
      This hook is also useful for NAT46/NAT64, tunneling and filtering of
      locally generated af_packet traffic such as dhclient.
      
      There have also been occasional user requests for a netfilter egress
      hook in the past, e.g.:
      https://www.spinics.net/lists/netfilter/msg50038.html
      
      Performance measurements with pktgen surprisingly show a speedup rather
      than a slowdown with this commit:
      
      * Without this commit:
        Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
        2920481pps 1401Mb/sec (1401830880bps) errors: 0
      
      * With this commit:
        Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
        2941410pps 1411Mb/sec (1411876800bps) errors: 0
      
      * Without this commit + tc egress:
        Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags)
        2562631pps 1230Mb/sec (1230062880bps) errors: 0
      
      * With this commit + tc egress:
        Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags)
        2659259pps 1276Mb/sec (1276444320bps) errors: 0
      
      * With this commit + nft egress:
        Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags)
        2413320pps 1158Mb/sec (1158393600bps) errors: 0
      
      Tested on a bare-metal Core i7-3615QM, each measurement was performed
      three times to verify that the numbers are stable.
      
      Commands to perform a measurement:
      modprobe pktgen
      echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3
      samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000
      
      Commands for testing tc egress:
      tc qdisc add dev lo clsact
      tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32
      
      Commands for testing nft egress:
      nft add table netdev t
      nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \}
      nft add rule netdev t co ip daddr 4.3.2.1/32 drop
      
      All testing was performed on the loopback interface to avoid distorting
      measurements by the packet handling in the low-level Ethernet driver.
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      8537f786
    • Lukas Wunner's avatar
      netfilter: Generalize ingress hook · 5418d388
      Lukas Wunner authored
      Prepare for addition of a netfilter egress hook by generalizing the
      ingress hook introduced by commit e687ad60 ("netfilter: add
      netfilter ingress hook after handle_ing() under unique static key").
      
      In particular, rename and refactor the ingress hook's static inlines
      such that they can be reused for an egress hook.
      
      No functional change intended.
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      5418d388
    • Lukas Wunner's avatar
      netfilter: Rename ingress hook include file · b030f194
      Lukas Wunner authored
      Prepare for addition of a netfilter egress hook by renaming
      <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.
      
      The egress hook also necessitates a refactoring of the include file,
      but that is done in a separate commit to ease reviewing.
      
      No functional change intended.
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b030f194
  2. 15 Mar, 2020 37 commits
    • Florian Westphal's avatar
      netfilter: conntrack: re-visit sysctls in unprivileged namespaces · d0febd81
      Florian Westphal authored
      since commit b884fa46 ("netfilter: conntrack: unify sysctl handling")
      conntrack no longer exposes most of its sysctls (e.g. tcp timeouts
      settings) to network namespaces that are not owned by the initial user
      namespace.
      
      This patch exposes all sysctls even if the namespace is unpriviliged.
      
      compared to a 4.19 kernel, the newly visible and writeable sysctls are:
        net.netfilter.nf_conntrack_acct
        net.netfilter.nf_conntrack_timestamp
        .. to allow to enable accouting and timestamp extensions.
      
        net.netfilter.nf_conntrack_events
        .. to turn off conntrack event notifications.
      
        net.netfilter.nf_conntrack_checksum
        .. to disable checksum validation.
      
        net.netfilter.nf_conntrack_log_invalid
        .. to enable logging of packets deemed invalid by conntrack.
      
      newly visible sysctls that are only exported as read-only:
      
        net.netfilter.nf_conntrack_count
        .. current number of conntrack entries living in this netns.
      
        net.netfilter.nf_conntrack_max
        .. global upperlimit (maximum size of the table).
      
        net.netfilter.nf_conntrack_buckets
        .. size of the conntrack table (hash buckets).
      
        net.netfilter.nf_conntrack_expect_max
        .. maximum number of permitted expectations in this netns.
      
        net.netfilter.nf_conntrack_helper
        .. conntrack helper auto assignment.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d0febd81
    • Pablo Neira Ayuso's avatar
      netfilter: nft_lookup: update element stateful expression · 339706bc
      Pablo Neira Ayuso authored
      If the set element comes with an stateful expression, update it.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      339706bc
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add nft_set_elem_update_expr() helper function · 76adfafe
      Pablo Neira Ayuso authored
      This helper function runs the eval path of the stateful expression
      of an existing set element.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      76adfafe
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add elements with stateful expressions · 40944452
      Pablo Neira Ayuso authored
      Update nft_add_set_elem() to handle the NFTA_SET_ELEM_EXPR netlink
      attribute. This patch allows users to to add elements with stateful
      expressions.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      40944452
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: statify nft_expr_init() · 795a6d6b
      Pablo Neira Ayuso authored
      Not exposed anymore to modules, statify this function.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      795a6d6b
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add nft_set_elem_expr_alloc() · a7fc9368
      Pablo Neira Ayuso authored
      Add helper function to create stateful expression.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a7fc9368
    • Stefano Brivio's avatar
      nft_set_pipapo: Prepare for single ranged field usage · eb16933a
      Stefano Brivio authored
      A few adjustments in nft_pipapo_init() are needed to allow usage of
      this set back-end for a single, ranged field.
      
      Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently
      makes sure that the rbtree back-end is selected instead, for sets
      with a single field.
      
      This finally allows a fair comparison with rbtree sets, by defining
      NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation:
      
       ---------------.--------------------------.-------------------------.
       AMD Epyc 7402  |      baselines, Mpps     |   Mpps, % over rbtree   |
        1 thread      |__________________________|_________________________|
        3.35GHz       |        |        |        |            |            |
        768KiB L1D$   | netdev |  hash  | rbtree |            |   pipapo   |
       ---------------|  hook  |   no   | single |   pipapo   |single field|
       type   entries |  drop  | ranges | field  |single field|    AVX2    |
       ---------------|--------|--------|--------|------------|------------|
       net,port       |        |        |        |            |            |
                1000  |   19.0 |   10.4 |    3.8 | 6.0   +58% | 9.6  +153% |
       ---------------|--------|--------|--------|------------|------------|
       port,net       |        |        |        |            |            |
                 100  |   18.8 |   10.3 |    5.8 | 9.1   +57% |11.6  +100% |
       ---------------|--------|--------|--------|------------|------------|
       net6,port      |        |        |        |            |            |
                1000  |   16.4 |    7.6 |    1.8 | 2.8   +55% | 6.5  +261% |
       ---------------|--------|--------|--------|------------|------------|
       port,proto     |        |        |        |     [1]    |    [1]     |
               30000  |   19.6 |   11.6 |    3.9 | 0.9   -77% | 2.7   -31% |
       ---------------|--------|--------|--------|------------|------------|
       port,proto     |        |        |        |            |            |
               10000  |   19.6 |   11.6 |    4.4 | 2.1   -52% | 5.6   +27% |
       ---------------|--------|--------|--------|------------|------------|
       port,proto     |        |        |        |            |            |
       4 threads 10000|   77.9 |   45.1 |   17.4 | 8.3   -52% |22.4   +29% |
       ---------------|--------|--------|--------|------------|------------|
       net6,port,mac  |        |        |        |            |            |
                  10  |   16.5 |    5.4 |    4.3 | 4.5    +5% | 8.2   +91% |
       ---------------|--------|--------|--------|------------|------------|
       net6,port,mac, |        |        |        |            |            |
       proto    1000  |   16.5 |    5.7 |    1.9 | 2.8   +47% | 6.6  +247% |
       ---------------|--------|--------|--------|------------|------------|
       net,mac        |        |        |        |            |            |
                1000  |   19.0 |    8.4 |    3.9 | 6.0   +54% | 9.9  +154% |
       ---------------'--------'--------'--------'------------'------------'
       [1] Causes switch of lookup table buckets for 'port' to 4-bit groups
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      eb16933a
    • Stefano Brivio's avatar
      nft_set_pipapo: Introduce AVX2-based lookup implementation · 7400b063
      Stefano Brivio authored
      If the AVX2 set is available, we can exploit the repetitive
      characteristic of this algorithm to provide a fast, vectorised
      version by using 256-bit wide AVX2 operations for bucket loads and
      bitwise intersections.
      
      In most cases, this implementation consistently outperforms rbtree
      set instances despite the fact they are configured to use a given,
      single, ranged data type out of the ones used for performance
      measurements by the nft_concat_range.sh kselftest.
      
      That script, injecting packets directly on the ingoing device path
      with pktgen, reports, averaged over five runs on a single AMD Epyc
      7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
      CONFIG_RETPOLINE was not set here.
      
      Note that this is not a fair comparison over hash and rbtree set
      types: non-ranged entries (used to have a reference for hash types)
      would be matched faster than this, and matching on a single field
      only (which is the case for rbtree) is also significantly faster.
      
      However, it's not possible at the moment to choose this set type
      for non-ranged entries, and the current implementation also needs
      a few minor adjustments in order to match on less than two fields.
      
       ---------------.-----------------------------------.------------.
       AMD Epyc 7402  |          baselines, Mpps          | this patch |
        1 thread      |___________________________________|____________|
        3.35GHz       |        |        |        |        |            |
        768KiB L1D$   | netdev |  hash  | rbtree |        |            |
       ---------------|  hook  |   no   | single |        |   pipapo   |
       type   entries |  drop  | ranges | field  | pipapo |    AVX2    |
       ---------------|--------|--------|--------|--------|------------|
       net,port       |        |        |        |        |            |
                1000  |   19.0 |   10.4 |    3.8 |    4.0 | 7.5   +87% |
       ---------------|--------|--------|--------|--------|------------|
       port,net       |        |        |        |        |            |
                 100  |   18.8 |   10.3 |    5.8 |    6.3 | 8.1   +29% |
       ---------------|--------|--------|--------|--------|------------|
       net6,port      |        |        |        |        |            |
                1000  |   16.4 |    7.6 |    1.8 |    2.1 | 4.8  +128% |
       ---------------|--------|--------|--------|--------|------------|
       port,proto     |        |        |        |        |            |
               30000  |   19.6 |   11.6 |    3.9 |    0.5 | 2.6  +420% |
       ---------------|--------|--------|--------|--------|------------|
       net6,port,mac  |        |        |        |        |            |
                  10  |   16.5 |    5.4 |    4.3 |    3.4 | 4.7   +38% |
       ---------------|--------|--------|--------|--------|------------|
       net6,port,mac, |        |        |        |        |            |
       proto    1000  |   16.5 |    5.7 |    1.9 |    1.4 | 3.6   +26% |
       ---------------|--------|--------|--------|--------|------------|
       net,mac        |        |        |        |        |            |
                1000  |   19.0 |    8.4 |    3.9 |    2.5 | 6.4  +156% |
       ---------------'--------'--------'--------'--------'------------'
      
      A similar strategy could be easily reused to implement specialised
      versions for other SIMD sets, and I plan to post at least a NEON
      version at a later time.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      7400b063
    • Stefano Brivio's avatar
      nft_set_pipapo: Prepare for vectorised implementation: helpers · 8683f4b9
      Stefano Brivio authored
      Move most macros and helpers to a header file, so that they can be
      conveniently used by related implementations.
      
      No functional changes are intended here.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      8683f4b9
    • Stefano Brivio's avatar
      nft_set_pipapo: Prepare for vectorised implementation: alignment · bf3e5839
      Stefano Brivio authored
      SIMD vector extension sets require stricter alignment than native
      instruction sets to operate efficiently (AVX, NEON) or for some
      instructions to work at all (AltiVec).
      
      Provide facilities to define arbitrary alignment for lookup tables
      and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN,
      lt_aligned and scratch_aligned pointers become available.
      
      Additional headroom is allocated, and pointers to the possibly
      unaligned, originally allocated areas are kept so that they can
      be freed.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      bf3e5839
    • Stefano Brivio's avatar
      nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch · 4051f431
      Stefano Brivio authored
      While grouping matching bits in groups of four saves memory compared
      to the more natural choice of 8-bit words (lookup table size is one
      eighth), it comes at a performance cost, as the number of lookup
      comparisons is doubled, and those also needs bitshifts and masking.
      
      Introduce support for 8-bit lookup groups, together with a mapping
      mechanism to dynamically switch, based on defined per-table size
      thresholds and hysteresis, between 8-bit and 4-bit groups, as tables
      grow and shrink. Empty sets start with 8-bit groups, and per-field
      tables are converted to 4-bit groups if they get too big.
      
      An alternative approach would have been to swap per-set lookup
      operation functions as needed, but this doesn't allow for different
      group sizes in the same set, which looks desirable if some fields
      need significantly more matching data compared to others due to
      heavier impact of ranges (e.g. a big number of subnets with
      relatively simple port specifications).
      
      Allowing different group sizes for the same lookup functions implies
      the need for further conditional clauses, whose cost, however,
      appears to be negligible in tests.
      
      The matching rate figures below were obtained for x86_64 running
      the nft_concat_range.sh "performance" cases, averaged over five
      runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64
      on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB),
      clocked at a stable 2147MHz frequency:
      
      ---------------.-----------------------------------.------------.
      AMD Epyc 7402  |          baselines, Mpps          | this patch |
       1 thread      |___________________________________|____________|
       3.35GHz       |        |        |        |        |            |
       768KiB L1D$   | netdev |  hash  | rbtree |        |            |
      ---------------|  hook  |   no   | single | pipapo |   pipapo   |
      type   entries |  drop  | ranges | field  | 4 bits | bit switch |
      ---------------|--------|--------|--------|--------|------------|
      net,port       |        |        |        |        |            |
               1000  |   19.0 |   10.4 |    3.8 |    2.8 | 4.0   +43% |
      ---------------|--------|--------|--------|--------|------------|
      port,net       |        |        |        |        |            |
                100  |   18.8 |   10.3 |    5.8 |    5.5 | 6.3   +14% |
      ---------------|--------|--------|--------|--------|------------|
      net6,port      |        |        |        |        |            |
               1000  |   16.4 |    7.6 |    1.8 |    1.3 | 2.1   +61% |
      ---------------|--------|--------|--------|--------|------------|
      port,proto     |        |        |        |        |     [1]    |
              30000  |   19.6 |   11.6 |    3.9 |    0.3 | 0.5   +66% |
      ---------------|--------|--------|--------|--------|------------|
      net6,port,mac  |        |        |        |        |            |
                 10  |   16.5 |    5.4 |    4.3 |    2.6 | 3.4   +31% |
      ---------------|--------|--------|--------|--------|------------|
      net6,port,mac, |        |        |        |        |            |
      proto    1000  |   16.5 |    5.7 |    1.9 |    1.0 | 1.4   +40% |
      ---------------|--------|--------|--------|--------|------------|
      net,mac        |        |        |        |        |            |
               1000  |   19.0 |    8.4 |    3.9 |    1.7 | 2.5   +47% |
      ---------------'--------'--------'--------'--------'------------'
      [1] Causes switch of lookup table buckets for 'port', not 'proto',
          to 4-bit groups
      
       ---------------.-----------------------------------.------------.
       BCM2711        |          baselines, Mpps          | this patch |
        1 thread      |___________________________________|____________|
        2147MHz       |        |        |        |        |            |
        32KiB L1D$    | netdev |  hash  | rbtree |        |            |
       ---------------|  hook  |   no   | single | pipapo |   pipapo   |
       type   entries |  drop  | ranges | field  | 4 bits | bit switch |
       ---------------|--------|--------|--------|--------|------------|
       net,port       |        |        |        |        |            |
                1000  |   1.63 |   1.37 |   0.87 |   0.61 | 0.70  +17% |
       ---------------|--------|--------|--------|--------|------------|
       port,net       |        |        |        |        |            |
                 100  |   1.64 |   1.36 |   1.02 |   0.78 | 0.81   +4% |
       ---------------|--------|--------|--------|--------|------------|
       net6,port      |        |        |        |        |            |
                1000  |   1.56 |   1.27 |   0.65 |   0.34 | 0.50  +47% |
       ---------------|--------|--------|--------|--------|------------|
       port,proto [2] |        |        |        |        |            |
               10000  |   1.68 |   1.43 |   0.84 |   0.30 | 0.40  +13% |
       ---------------|--------|--------|--------|--------|------------|
       net6,port,mac  |        |        |        |        |            |
                  10  |   1.56 |   1.14 |   1.02 |   0.62 | 0.66   +6% |
       ---------------|--------|--------|--------|--------|------------|
       net6,port,mac, |        |        |        |        |            |
       proto    1000  |   1.56 |   1.12 |   0.64 |   0.27 | 0.40  +48% |
       ---------------|--------|--------|--------|--------|------------|
       net,mac        |        |        |        |        |            |
                1000  |   1.63 |   1.26 |   0.87 |   0.41 | 0.53  +29% |
       ---------------'--------'--------'--------'--------'------------'
      [2] Using 10000 entries instead of 30000 as it would take way too
          long for the test script to generate all of them
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4051f431
    • Stefano Brivio's avatar
      nft_set_pipapo: Generalise group size for buckets · e807b13c
      Stefano Brivio authored
      Get rid of all hardcoded assumptions that buckets in lookup tables
      correspond to four-bit groups, and replace them with appropriate
      calculations based on a variable group size, now stored in struct
      field.
      
      The group size could now be in principle any divisor of eight. Note,
      though, that lookup and get functions need an implementation
      intimately depending on the group size, and the only supported size
      there, currently, is four bits, which is also the initial and only
      used size at the moment.
      
      While at it, drop 'groups' from struct nft_pipapo: it was never used.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e807b13c
    • wenxu's avatar
      netfilter: flowtable: add tunnel encap/decap action offload support · 88bf6e41
      wenxu authored
      This patch add tunnel encap decap action offload in the flowtable
      offload.
      Signed-off-by: default avatarwenxu <wenxu@ucloud.cn>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      88bf6e41
    • wenxu's avatar
      netfilter: flowtable: add tunnel match offload support · cfab6dbd
      wenxu authored
      This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and
      tunnel_dst match for flowtable offload
      Signed-off-by: default avatarwenxu <wenxu@ucloud.cn>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      cfab6dbd
    • wenxu's avatar
      netfilter: flowtable: add indr block setup support · b5140a36
      wenxu authored
      Add etfilter flowtable support indr-block setup. It makes flowtable offload
      vlan and tunnel device.
      Signed-off-by: default avatarwenxu <wenxu@ucloud.cn>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b5140a36
    • wenxu's avatar
      netfilter: flowtable: add nf_flow_table_block_offload_init() · 46798779
      wenxu authored
      Add nf_flow_table_block_offload_init prepare for the indr block
      offload patch
      Signed-off-by: default avatarwenxu <wenxu@ucloud.cn>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      46798779
    • Dan Carpenter's avatar
      netfilter: xt_IDLETIMER: clean up some indenting · f628c27d
      Dan Carpenter authored
      These lines were indented wrong so Smatch complained.
      net/netfilter/xt_IDLETIMER.c:81 idletimer_tg_show() warn: inconsistent indenting
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f628c27d
    • Jeremy Sowden's avatar
      netfilter: bitwise: use more descriptive variable-names. · 049dee95
      Jeremy Sowden authored
      Name the mask and xor data variables, "mask" and "xor," instead of "d1"
      and "d2."
      Signed-off-by: default avatarJeremy Sowden <jeremy@azazel.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      049dee95
    • Gustavo A. R. Silva's avatar
      netfilter: Replace zero-length array with flexible-array member · 6daf1414
      Gustavo A. R. Silva authored
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      Lastly, fix checkpatch.pl warning
      WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
      in net/bridge/netfilter/ebtables.c
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      6daf1414
    • Chen Wandun's avatar
      netfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' static · eb9d7af3
      Chen Wandun authored
      Fix the following sparse warning:
      
      net/netfilter/nft_set_pipapo.c:739:6: warning: symbol 'nft_pipapo_get' was not declared. Should it be static?
      
      Fixes: 3c4287f6 ("nf_tables: Add set type for arbitrary concatenation of ranges")
      Signed-off-by: default avatarChen Wandun <chenwandun@huawei.com>
      Acked-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      eb9d7af3
    • Li RongQing's avatar
      netfilter: cleanup unused macro · 9325f070
      Li RongQing authored
      TEMPLATE_NULLS_VAL is not used after commit 0838aa7f
      ("netfilter: fix netns dependencies with conntrack templates")
      
      PFX is not used after commit 8bee4bad ("netfilter: xt
      extensions: use pr_<level>")
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9325f070
    • Florian Westphal's avatar
      netfilter: nf_tables: make all set structs const · 24d19826
      Florian Westphal authored
      They do not need to be writeable anymore.
      
      v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano)
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      24d19826
    • Florian Westphal's avatar
      netfilter: nf_tables: make sets built-in · e32a4dc6
      Florian Westphal authored
      Placing nftables set support in an extra module is pointless:
      
      1. nf_tables needs dynamic registeration interface for sake of one module
      2. nft heavily relies on sets, e.g. even simple rule like
         "nft ... tcp dport { 80, 443 }" will not work with _SETS=n.
      
      IOW, either nftables isn't used or both nf_tables and nf_tables_set
      modules are needed anyway.
      
      With extra module:
       307K net/netfilter/nf_tables.ko
        79K net/netfilter/nf_tables_set.ko
      
         text  data  bss     dec filename
       146416  3072  545  150033 nf_tables.ko
        35496  1817    0   37313 nf_tables_set.ko
      
      This patch:
       373K net/netfilter/nf_tables.ko
      
       178563  4049  545  183157 nf_tables.ko
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e32a4dc6
    • Xin Long's avatar
      netfilter: nft_tunnel: add support for geneve opts · 925d8446
      Xin Long authored
      Like vxlan and erspan opts, geneve opts should also be supported in
      nft_tunnel. The difference is geneve RFC (draft-ietf-nvo3-geneve-14)
      allows a geneve packet to carry multiple geneve opts. So with this
      patch, nftables/libnftnl would do:
      
        # nft add table ip filter
        # nft add chain ip filter input { type filter hook input priority 0 \; }
        # nft add tunnel filter geneve_02 { type geneve\; id 2\; \
          ip saddr 192.168.1.1\; ip daddr 192.168.1.2\; \
          sport 9000\; dport 9001\; dscp 1234\; ttl 64\; flags 1\; \
          opts \"1:1:34567890,2:2:12121212,3:3:1212121234567890\"\; }
        # nft list tunnels table filter
          table ip filter {
          	tunnel geneve_02 {
          		id 2
          		ip saddr 192.168.1.1
          		ip daddr 192.168.1.2
          		sport 9000
          		dport 9001
          		tos 18
          		ttl 64
          		flags 1
          		geneve opts 1:1:34567890,2:2:12121212,3:3:1212121234567890
          	}
          }
      
      v1->v2:
        - no changes, just post it separately.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      925d8446
    • Manoj Basapathi's avatar
      netfilter: xtables: Add snapshot of hardidletimer target · 68983a35
      Manoj Basapathi authored
      This is a snapshot of hardidletimer netfilter target.
      
      This patch implements a hardidletimer Xtables target that can be
      used to identify when interfaces have been idle for a certain period
      of time.
      
      Timers are identified by labels and are created when a rule is set
      with a new label. The rules also take a timeout value (in seconds) as
      an option. If more than one rule uses the same timer label, the timer
      will be restarted whenever any of the rules get a hit.
      
      One entry for each timer is created in sysfs. This attribute contains
      the timer remaining for the timer to expire. The attributes are
      located under the xt_idletimer class:
      
      /sys/class/xt_idletimer/timers/<label>
      
      When the timer expires, the target module sends a sysfs notification
      to the userspace, which can then decide what to do (eg. disconnect to
      save power)
      
      Compared to IDLETIMER, HARDIDLETIMER can send notifications when
      CPU is in suspend too, to notify the timer expiry.
      
      v1->v2: Moved all functionality into IDLETIMER module to avoid
      code duplication per comment from Florian.
      Signed-off-by: default avatarManoj Basapathi <manojbm@codeaurora.org>
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      68983a35
    • Paul Blakey's avatar
      netfilter: flowtable: Use nf_flow_offload_tuple for stats as well · c3c831b0
      Paul Blakey authored
      This patch doesn't change any functionality.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c3c831b0
    • Alexander Bersenev's avatar
      cdc_ncm: Fix the build warning · 5d0ab06b
      Alexander Bersenev authored
      The ndp32->wLength is two bytes long, so replace cpu_to_le32 with cpu_to_le16.
      
      Fixes: 0fa81b30 ("cdc_ncm: Implement the 32-bit version of NCM Transfer Block")
      Signed-off-by: default avatarAlexander Bersenev <bay@hackerdom.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d0ab06b
    • David S. Miller's avatar
      Merge branch 'mptcp-simplify-mptcp_accept' · a79c838f
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      mptcp: simplify mptcp_accept()
      
      Currently we allocate the MPTCP master socket at accept time.
      
      The above makes mptcp_accept() quite complex, and requires checks is several
      places for NULL MPTCP master socket.
      
      These series simplify the MPTCP accept implementation, moving the master socket
      allocation at syn-ack time, so that we drop unneeded checks with the follow-up
      patch.
      
      v1 -> v2:
      - rebased on top of 2398e399
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a79c838f
    • Paolo Abeni's avatar
      mptcp: drop unneeded checks · dc093db5
      Paolo Abeni authored
      After the previous patch subflow->conn is always != NULL and
      is never changed. We can drop a bunch of now unneeded checks.
      
      v1 -> v2:
       - rebased on top of commit 2398e399 ("mptcp: always
         include dack if possible.")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc093db5
    • Paolo Abeni's avatar
      mptcp: create msk early · 58b09919
      Paolo Abeni authored
      This change moves the mptcp socket allocation from mptcp_accept() to
      subflow_syn_recv_sock(), so that subflow->conn is now always set
      for the non fallback scenario.
      
      It allows cleaning up a bit mptcp_accept() reducing the additional
      locking and will allow fourther cleanup in the next patch.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58b09919
    • Dejin Zheng's avatar
      net: stmmac: platform: convert to devm_platform_ioremap_resource · 7a1d0e61
      Dejin Zheng authored
      Use devm_platform_ioremap_resource() to simplify code, which
      contains platform_get_resource and devm_ioremap_resource.
      Signed-off-by: default avatarDejin Zheng <zhengdejin5@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a1d0e61
    • Vladimir Oltean's avatar
      net: mscc: ocelot: adjust maxlen on NPI port, not CPU · 4a601f10
      Vladimir Oltean authored
      Being a non-physical port, the CPU port does not have an ocelot_port
      structure, so the ocelot_port_writel call inside the
      ocelot_port_set_maxlen() function would access data behind a NULL
      pointer.
      
      This is a patch for net-next only, the net tree boots fine, the bug was
      introduced during the net -> net-next merge.
      
      Fixes: 1d343579 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
      Fixes: a8015ded ("net: mscc: ocelot: properly account for VLAN header length when setting MRU")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a601f10
    • Hoang Le's avatar
      tipc: add NULL pointer check to prevent kernel oops · 746a1eda
      Hoang Le authored
      Calling:
      tipc_node_link_down()->
         - tipc_node_write_unlock()->tipc_mon_peer_down()
         - tipc_mon_peer_down()
        just after disabling bearer could be caused kernel oops.
      
      Fix this by adding a sanity check to make sure valid memory
      access.
      Acked-by: default avatarJon Maloy <jmaloy@redhat.com>
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      746a1eda
    • Hoang Le's avatar
      tipc: simplify trivial boolean return · e228c5c0
      Hoang Le authored
      Checking and returning 'true' boolean is useless as it will be
      returning at end of function
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Acked-by: default avatarJon Maloy <jmaloy@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e228c5c0
    • David S. Miller's avatar
      Merge branch 'ethtool-consolidate-irq-coalescing-part-5' · b8323deb
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      ethtool: consolidate irq coalescing - part 5
      
      Convert more drivers following the groundwork laid in a recent
      patch set [1] and continued in [2], [3], [4]. The aim of the effort
      is to consolidate irq coalescing parameter validation in the core.
      
      This set converts further 15 drivers in drivers/net/ethernet.
      One more conversion sets to come.
      
      [1] https://lore.kernel.org/netdev/20200305051542.991898-1-kuba@kernel.org/
      [2] https://lore.kernel.org/netdev/20200306010602.1620354-1-kuba@kernel.org/
      [3] https://lore.kernel.org/netdev/20200310021512.1861626-1-kuba@kernel.org/
      [4] https://lore.kernel.org/netdev/20200311223302.2171564-1-kuba@kernel.org/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8323deb
    • Jakub Kicinski's avatar
      net: via: reject unsupported coalescing params · 5b71256a
      Jakub Kicinski authored
      Set ethtool_ops->supported_coalesce_params to let
      the core reject unsupported coalescing parameters.
      
      This driver did not previously reject unsupported parameters.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b71256a
    • Jakub Kicinski's avatar
      net: sxgbe: reject unsupported coalescing params · 19d9ec99
      Jakub Kicinski authored
      Set ethtool_ops->supported_coalesce_params to let
      the core reject unsupported coalescing parameters.
      
      This driver did not previously reject unsupported parameters.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19d9ec99