Commits · 0995210753a26c4fa1a3d8c63cc230e22a8537cd · Kirill Smelkov / linux

08 Jan, 2018 40 commits

netfilter: flow table support for IPv6 · 09952107

Pablo Neira Ayuso authored Jan 07, 2018

This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.

This patch exports ip6_dst_mtu_forward() that is required to check for
mtu to pass up packets that need PMTUD handling to the classic
forwarding path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

09952107

netfilter: flow table support for IPv4 · 97add9f0

Pablo Neira Ayuso authored Jan 07, 2018

This patch adds the IPv4 flow table type, that implements the datapath
flow table to forward IPv4 traffic. Rationale is:

1) Look up for the packet in the flow table, from the ingress hook.
2) If there's a hit, decrement ttl and pass it on to the neighbour layer
   for transmission.
3) If there's a miss, packet is passed up to the classic forwarding
   path.

This patch also supports layer 3 source and destination NAT.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

97add9f0

netfilter: add generic flow table infrastructure · ac2a6666

Pablo Neira Ayuso authored Jan 07, 2018

This patch defines the API to interact with flow tables, this allows to
add, delete and lookup for entries in the flow table. This also adds the
generic garbage code that removes entries that have expired, ie. no
traffic has been seen for a while.

Users of the flow table infrastructure can delete entries via
flow_offload_dead(), which sets the dying bit, this signals the garbage
collector to release an entry from user context.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ac2a6666

netfilter: nf_tables: add flow table netlink frontend · 3b49e2e9

Pablo Neira Ayuso authored Jan 07, 2018

This patch introduces a netlink control plane to create, delete and dump
flow tables. Flow tables are identified by name, this name is used from
rules to refer to an specific flow table. Flow tables use the rhashtable
class and a generic garbage collector to remove expired entries.

This also adds the infrastructure to add different flow table types, so
we can add one for each layer 3 protocol family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

3b49e2e9

netfilter: nf_conntrack: add IPS_OFFLOAD status bit · 90964016

Pablo Neira Ayuso authored Jan 07, 2018

This new bit tells us that the conntrack entry is owned by the flow
table offload infrastructure.

 # cat /proc/net/nf_conntrack
 ipv4     2 tcp      6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2

Note the [OFFLOAD] tag in the listing.

The timer of such conntrack entries look like stopped from userspace.
In practise, to make sure the conntrack entry does not go away, the
conntrack timer is periodically set to an arbitrary large value that
gets refreshed on every iteration from the garbage collector, so it
never expires- and they display no internal state in the case of TCP
flows. This allows us to save a bitcheck from the packet path via
nf_ct_is_expired().

Conntrack entries that have been offloaded to the flow table
infrastructure cannot be deleted/flushed via ctnetlink. The flow table
infrastructure is also responsible for releasing this conntrack entry.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

90964016

netfilter: nf_tables: remove nft_dereference() · 0befd061

Pablo Neira Ayuso authored Jan 02, 2018

This macro is unnecessary, it just hides details for one single caller.
nfnl_dereference() is just enough.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

0befd061

netfilter: remove defensive check on malformed packets from raw sockets · a7f87b47

Pablo Neira Ayuso authored Dec 30, 2017

Users cannot forge malformed IPv4/IPv6 headers via raw sockets that they
can inject into the stack. Specifically, not for IPv4 since 55888dfb
("AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
(v2)"). IPv6 raw sockets also ensure that packets have a well-formed
IPv6 header available in the skbuff.

At quick glance, br_netfilter also validates layer 3 headers and it
drops malformed both IPv4 and IPv6 packets.

Therefore, let's remove this defensive check all over the place.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a7f87b47

netfilter: meta: secpath support · f6931f5f

Florian Westphal authored Dec 06, 2017

replacement for iptables "-m policy --dir in --policy {ipsec,none}".
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

f6931f5f

netfilter: remove struct nf_afinfo and its helper functions · b3a61254

Pablo Neira Ayuso authored Dec 09, 2017

This abstraction has no clients anymore, remove it.

This is what remains from previous authors, so correct copyright
statement after recent modifications and code removal.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

b3a61254

netfilter: remove route_key_size field in struct nf_afinfo · 46435623

Pablo Neira Ayuso authored Nov 27, 2017

This is only needed by nf_queue, place this code where it belongs.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

46435623

netfilter: move reroute indirection to struct nf_ipv6_ops · ce388f45

Pablo Neira Ayuso authored Nov 27, 2017

We cannot make a direct call to nf_ip6_reroute() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define reroute indirection in nf_ipv6_ops where this really
belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ce388f45

netfilter: move route indirection to struct nf_ipv6_ops · 3f87c08c

Pablo Neira Ayuso authored Nov 27, 2017

We cannot make a direct call to nf_ip6_route() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define route indirection in nf_ipv6_ops where this really
belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

3f87c08c

netfilter: remove saveroute indirection in struct nf_afinfo · 7db9a51e

Pablo Neira Ayuso authored Dec 20, 2017

This is only used by nf_queue.c and this function comes with no symbol
dependencies with IPv6, it just refers to structure layouts. Therefore,
we can replace it by a direct function call from where it belongs.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

7db9a51e

netfilter: move checksum_partial indirection to struct nf_ipv6_ops · f7dcbe2f

Pablo Neira Ayuso authored Dec 20, 2017

We cannot make a direct call to nf_ip6_checksum_partial() because that
would result in autoloading the 'ipv6' module because of symbol
dependencies.  Therefore, define checksum_partial indirection in
nf_ipv6_ops where this really belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

f7dcbe2f

netfilter: move checksum indirection to struct nf_ipv6_ops · ef71fe27

Pablo Neira Ayuso authored Nov 27, 2017

We cannot make a direct call to nf_ip6_checksum() because that would
result in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define checksum indirection in nf_ipv6_ops where this really
belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ef71fe27

netfilter: connlimit: split xt_connlimit into front and backend · 625c5561

Florian Westphal authored Dec 09, 2017

This allows to reuse xt_connlimit infrastructure from nf_tables.
The upcoming nf_tables frontend can just pass in an nftables register
as input key, this allows limiting by any nft-supported key, including
concatenations.

For xt_connlimit, pass in the zone and the ip/ipv6 address.

With help from Yi-Hung Wei.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

625c5561

netfilter: nf_tables: remove hooks from family definition · c2f9eafe

Pablo Neira Ayuso authored Dec 09, 2017

They don't belong to the family definition, move them to the filter
chain type definition instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

c2f9eafe

netfilter: nf_tables: remove multihook chains and families · c974a3a3

Pablo Neira Ayuso authored Dec 09, 2017

Since NFPROTO_INET is handled from the core, we don't need to maintain
extra infrastructure in nf_tables to handle the double hook
registration, one for IPv4 and another for IPv6.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

c974a3a3

netfilter: nf_tables_inet: don't use multihook infrastructure anymore · 12355d36

Pablo Neira Ayuso authored Dec 09, 2017

Use new native NFPROTO_INET support in netfilter core, this gets rid of
ad-hoc code in the nf_tables API codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

12355d36

netfilter: core: support for NFPROTO_INET hook registration · cb7ccd83

Pablo Neira Ayuso authored Dec 09, 2017

Expand NFPROTO_INET in two hook registrations, one for NFPROTO_IPV4 and
another for NFPROTO_IPV6. Hence, we handle NFPROTO_INET from the core.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

cb7ccd83

netfilter: core: pass family as parameter to nf_remove_net_hook() · 30259408

Pablo Neira Ayuso authored Dec 09, 2017

So static_key_slow_dec applies to the family behind NFPROTO_INET.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

30259408

netfilter: core: pass hook number, family and device to nf_find_hook_list() · 62a0fe46

Pablo Neira Ayuso authored Dec 09, 2017

Instead of passing struct nf_hook_ops, this is needed by follow up
patches to handle NFPROTO_INET from the core.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

62a0fe46

netfilter: core: add nf_remove_net_hook · 3d3cdc38

Pablo Neira Ayuso authored Dec 09, 2017

Just a cleanup, __nf_unregister_net_hook() is used by a follow up patch
when handling NFPROTO_INET as a real family from the core.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

3d3cdc38

netfilter: nf_tables: add nft_set_is_anonymous() helper · 408070d6

Pablo Neira Ayuso authored Nov 24, 2017

Add helper function to test for the NFT_SET_ANONYMOUS flag.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

408070d6

netfilter: nf_tables: explicit nft_set_pktinfo() call from hook path · 7a4473a3

Pablo Neira Ayuso authored Dec 10, 2017

Instead of calling this function from the family specific variant, this
reduces the code size in the fast path for the netdev, bridge and inet
families. After this change, we must call nft_set_pktinfo() upfront from
the chain hook indirection.

Before:

   text    data     bss     dec     hex filename
   2145     208       0    2353     931 net/netfilter/nf_tables_netdev.o

After:

   text    data     bss     dec     hex filename
   2125     208       0    2333     91d net/netfilter/nf_tables_netdev.o
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

7a4473a3

netfilter: nf_tables_arp: don't set forward chain · fa45a760

Pablo Neira Ayuso authored Dec 10, 2017

46928a0b49f3 ("netfilter: nf_tables: remove multihook chains and
families") already removed this, this is a leftover.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

fa45a760

netfilter: nf_tables: reject nat hook registration if prio is before conntrack · 84ba7dd7

Florian Westphal authored Dec 08, 2017

No problem for iptables as priorities are fixed values defined in the
nat modules, but in nftables the priority its coming from userspace.

Reject in case we see that such a hook would not work.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

84ba7dd7

netfilter: core: only allow one nat hook per hook point · f92b40a8

Florian Westphal authored Dec 08, 2017

The netfilter NAT core cannot deal with more than one NAT hook per hook
location (prerouting, input ...), because the NAT hooks install a NAT null
binding in case the iptables nat table (iptable_nat hooks) or the
corresponding nftables chain (nft nat hooks) doesn't specify a nat
transformation.

Null bindings are needed to detect port collsisions between NAT-ed and
non-NAT-ed connections.

This causes nftables NAT rules to not work when iptable_nat module is
loaded, and vice versa because nat binding has already been attached
when the second nat hook is consulted.

The netfilter core is not really the correct location to handle this
(hooks are just hooks, the core has no notion of what kinds of side
 effects a hook implements), but its the only place where we can check
for conflicts between both iptables hooks and nftables hooks without
adding dependencies.

So add nat annotation to hook_ops to describe those hooks that will
add NAT bindings and then make core reject if such a hook already exists.
The annotation fills a padding hole, in case further restrictions appar
we might change this to a 'u8 type' instead of bool.

iptables error if nft nat hook active:
iptables -t nat -A POSTROUTING -j MASQUERADE
iptables v1.4.21: can't initialize iptables table `nat': File exists
Perhaps iptables or your kernel needs to be upgraded.

nftables error if iptables nat table present:
nft -f /etc/nftables/ipv4-nat
/usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists
table nat {
^^
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

f92b40a8

netfilter: xtables: add and use xt_request_find_table_lock · 03d13b68

Florian Westphal authored Dec 08, 2017

currently we always return -ENOENT to userspace if we can't find
a particular table, or if the table initialization fails.

Followup patch will make nat table init fail in case nftables already
registered a nat hook so this change makes xt_find_table_lock return
an ERR_PTR to return the errno value reported from the table init
function.

Add xt_request_find_table_lock as try_then_request_module replacement
and use it where needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

03d13b68

netfilter: reduce NF_MAX_HOOKS define · 256d94ba

Florian Westphal authored Dec 07, 2017

This can be same as NF_INET_NUMHOOKS if we don't support DECNET.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

256d94ba

netfilter: don't allocate space for arp/bridge hooks unless needed · 2a95183a

Florian Westphal authored Dec 07, 2017

no need to define hook points if the family isn't supported.
Because we need these hooks for either nftables, arp/ebtables
or the 'call-iptables' hack we have in the bridge layer add two
new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
users select them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

2a95183a

netfilter: don't allocate space for decnet hooks unless needed · bb4badf3

Florian Westphal authored Dec 07, 2017

no need to define hook points if the family isn't supported.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

bb4badf3

netfilter: reduce hook array sizes to what is needed · ef57170b

Florian Westphal authored Dec 07, 2017

Not all families share the same hook count, adjust sizes to what is
needed.

struct net before:
/* size: 6592, cachelines: 103, members: 46 */
after:
/* size: 5952, cachelines: 93, members: 46 */
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ef57170b

netfilter: add defines for arp/decnet max hooks · e58f33cc

Florian Westphal authored Dec 07, 2017

The kernel already has defines for this, but they are in uapi exposed
headers.

Including these from netns.h causes build errors and also adds unneeded
dependencies on heads that we don't need.

So move these defines to netfilter_defs.h and place the uapi ones
in ifndef __KERNEL__ to keep them for userspace.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

e58f33cc

netfilter: reduce size of hook entry point locations · b0f38338

Florian Westphal authored Dec 03, 2017

struct net contains:

struct nf_hook_entries __rcu *hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];

which store the hook entry point locations for the various protocol
families and the hooks.

Using array results in compact c code when doing accesses, i.e.
  x = rcu_dereference(net->nf.hooks[pf][hook]);

but its also wasting a lot of memory, as most families are
not used.

So split the array into those families that are used, which
are only 5 (instead of 13).  In most cases, the 'pf' argument is
constant, i.e. gcc removes switch statement.

struct net before:
 /* size: 5184, cachelines: 81, members: 46 */
after:
 /* size: 4672, cachelines: 73, members: 46 */
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

b0f38338

netfilter: core: free hooks with call_rcu · 8c873e21

Florian Westphal authored Dec 01, 2017

Giuseppe Scrivano says:
  "SELinux, if enabled, registers for each new network namespace 6
    netfilter hooks."

Cost for this is high.  With synchronize_net() removed:
   "The net benefit on an SMP machine with two cores is that creating a
   new network namespace takes -40% of the original time."

This patch replaces synchronize_net+kvfree with call_rcu().
We store rcu_head at the tail of a structure that has no fixed layout,
i.e. we cannot use offsetof() to compute the start of the original
allocation.  Thus store this information right after the rcu head.

We could simplify this by just placing the rcu_head at the start
of struct nf_hook_entries.  However, this structure is used in
packet processing hotpath, so only place what is needed for that
at the beginning of the struct.
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

8c873e21

netfilter: core: remove synchronize_net call if nfqueue is used · 26888dfd

Florian Westphal authored Dec 01, 2017

since commit 960632ec ("netfilter: convert hook list to an array")
nfqueue no longer stores a pointer to the hook that caused the packet
to be queued.  Therefore no extra synchronize_net() call is needed after
dropping the packets enqueued by the old rule blob.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

26888dfd

netfilter: core: make nf_unregister_net_hooks simple wrapper again · 4e645b47

Florian Westphal authored Dec 01, 2017

This reverts commit d3ad2c17
("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").

Nothing wrong with it.  However, followup patch will delay freeing of hooks
with call_rcu, so all synchronize_net() calls become obsolete and there
is no need anymore for this batching.

This revert causes a temporary performance degradation when destroying
network namespace, but its resolved with the upcoming call_rcu conversion.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

4e645b47

netfilter: nf_conntrack_h323: Remove unwanted comments. · ca9b0147

Varsha Rao authored Nov 30, 2017

Change old multi-line comment style to kernel comment style and
remove unwanted comments.
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ca9b0147

netfilter: ipset: add resched points during set listing · a778a15f

Florian Westphal authored Nov 30, 2017

When sets are extremely large we can get softlockup during ipset -L.
We could fix this by adding cond_resched_rcu() at the right location
during iteration, but this only works if RCU nesting depth is 1.

At this time entire variant->list() is called under under rcu_read_lock_bh.
This used to be a read_lock_bh() but as rcu doesn't really lock anything,
it does not appear to be needed, so remove it (ipset increments set
reference count before this, so a set deletion should not be possible).
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a778a15f