Commit 08dbc7a6 authored by Alexei Starovoitov's avatar Alexei Starovoitov

Merge branch 'AF_XDP-initial-support'

Björn Töpel says:

====================
This patch set introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this patch set, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This patch set only supports copy-mode
for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
for RX using the XDP_DRV path. Zero-copy support requires XDP and
driver changes that Jesper Dangaard Brouer is working on. Some of his
work has already been accepted. We will publish our zero-copy support
for RX and TX on top of his patch sets at a later point in time.

An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.

This new dedicated packet buffer area is call a UMEM. It consists of a
number of equally size frames and each frame has a unique frame id. A
descriptor in one of the queues references a frame by referencing its
frame id. The user space allocates memory for this UMEM using whatever
means it feels is most appropriate (malloc, mmap, huge pages,
etc). This memory area is then registered with the kernel using the new
setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
and the COMPLETION queue. The fill queue is used by the application to
send down frame ids for the kernel to fill in with RX packet
data. References to these frames will then appear in the RX queue of
the XSK once they have been received. The completion queue, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.

The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this patch set,
all packet data is copied out to user-space.

A new feature in this patch set is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.

How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.

AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for XDP, it will be used by the AF_XDP
code to provide better performance, but there is still a copy of the
data into user space.

There is a xdpsock benchmarking/test application included that
demonstrates how to use AF_XDP sockets with both private and shared
UMEMs. Say that you would like your UDP traffic from port 4242 to end
up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode can then be done
using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. With
retpoline, the AF_XDP numbers drop with between 10 - 15 percent.

AF_XDP performance 64 byte packets. Results from V2 in parenthesis.
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.9(3.0)   9.6(9.5)
txpush       2.6(2.5)   NA*
l2fwd        1.9(1.9)   2.5(2.5) (TX using XDP_SKB in both cases)

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.1(2.2)   3.3(3.3)
l2fwd        1.4(1.4)   1.8(1.8) (TX using XDP_SKB in both cases)

* NA since we have no support for TX using the XDP_DRV infrastructure
  in this patch set. This is for a future patch set since it involves
  changes to the XDP NDOs. Some of this has been upstreamed by Jesper
  Dangaard Brouer.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32.3(32.9)M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3(3.3)M    0

Changes from V2:

* Fixed a race in XSKMAP map found by Will. The code has been
  completely rearchitected and is now simpler, faster, and hopefully
  also not racy. Please review and check if it holds.

If you would like to diff V2 against V3, you can find them here:
https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next
https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next

The structure of the patch set is as follows:

Patches 1-3: Basic socket and umem plumbing
Patches 4-9: RX support together with the new XSKMAP
Patches 10-13: TX support
Patch 14: Statistics support with getsockopt()
Patch 15: Sample application

We based this patch set on bpf-next commit a3fe1f6f ("tools:
bpftool: change time format for program 'loaded at:' information")

To do for this patch set:

* Syzkaller torture session being worked on

Post-series plan:

* Optimize performance

* Kernel selftest

* Kernel load module support of AF_XDP would be nice. Unclear how to
  achieve this though since our XDP code depends on net/core.

* Support for AF_XDP sockets without an XPD program loaded. In this
  case all the traffic on a queue should go up to the user space socket.

* Daniel Borkmann's suggestion for a "copy to XDP socket, and return
  XDP_PASS" for a tcpdump-like functionality.

* And of course getting to zero-copy support in small increments,
  starting with TX then adding RX.

Thanks: Björn and Magnus
====================
Acked-by: default avatarWillem de Bruijn <willemb@google.com>
Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
parents 03f5781b b4b8faa1
This diff is collapsed.
......@@ -6,6 +6,7 @@ Contents:
.. toctree::
:maxdepth: 2
af_xdp
batman-adv
can
dpaa2/index
......
......@@ -15424,6 +15424,14 @@ T: git git://linuxtv.org/media_tree.git
S: Maintained
F: drivers/media/tuners/tuner-xc2028.*
XDP SOCKETS (AF_XDP)
M: Björn Töpel <bjorn.topel@intel.com>
M: Magnus Karlsson <magnus.karlsson@intel.com>
L: netdev@vger.kernel.org
S: Maintained
F: kernel/bpf/xskmap.c
F: net/xdp/
XEN BLOCK SUBSYSTEM
M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
M: Roger Pau Monné <roger.pau@citrix.com>
......
......@@ -676,6 +676,31 @@ static inline int sock_map_prog(struct bpf_map *map,
}
#endif
#if defined(CONFIG_XDP_SOCKETS)
struct xdp_sock;
struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key);
int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
struct xdp_sock *xs);
void __xsk_map_flush(struct bpf_map *map);
#else
struct xdp_sock;
static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map,
u32 key)
{
return NULL;
}
static inline int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
struct xdp_sock *xs)
{
return -EOPNOTSUPP;
}
static inline void __xsk_map_flush(struct bpf_map *map)
{
}
#endif
/* verifier prototypes for helper functions called from eBPF programs */
extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
extern const struct bpf_func_proto bpf_map_update_elem_proto;
......
......@@ -49,4 +49,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
#endif
BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
#if defined(CONFIG_XDP_SOCKETS)
BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
#endif
#endif
......@@ -760,7 +760,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
* This does not appear to be a real limitation for existing software.
*/
int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct bpf_prog *prog);
struct xdp_buff *xdp, struct bpf_prog *prog);
int xdp_do_redirect(struct net_device *dev,
struct xdp_buff *xdp,
struct bpf_prog *prog);
......
......@@ -2486,6 +2486,7 @@ void dev_disable_lro(struct net_device *dev);
int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
int dev_queue_xmit(struct sk_buff *skb);
int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
int register_netdevice(struct net_device *dev);
void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
void unregister_netdevice_many(struct list_head *head);
......
......@@ -207,8 +207,9 @@ struct ucred {
* PF_SMC protocol family that
* reuses AF_INET address family
*/
#define AF_XDP 44 /* XDP sockets */
#define AF_MAX 44 /* For now.. */
#define AF_MAX 45 /* For now.. */
/* Protocol families, same as address families. */
#define PF_UNSPEC AF_UNSPEC
......@@ -257,6 +258,7 @@ struct ucred {
#define PF_KCM AF_KCM
#define PF_QIPCRTR AF_QIPCRTR
#define PF_SMC AF_SMC
#define PF_XDP AF_XDP
#define PF_MAX AF_MAX
/* Maximum queue length specifiable by listen. */
......@@ -338,6 +340,7 @@ struct ucred {
#define SOL_NFC 280
#define SOL_KCM 281
#define SOL_TLS 282
#define SOL_XDP 283
/* IPX options */
#define IPX_TYPE 1
......
......@@ -104,6 +104,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
}
void xdp_return_frame(struct xdp_frame *xdpf);
void xdp_return_buff(struct xdp_buff *xdp);
int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
struct net_device *dev, u32 queue_index);
......
/* SPDX-License-Identifier: GPL-2.0
* AF_XDP internal functions
* Copyright(c) 2018 Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#ifndef _LINUX_XDP_SOCK_H
#define _LINUX_XDP_SOCK_H
#include <linux/mutex.h>
#include <net/sock.h>
struct net_device;
struct xsk_queue;
struct xdp_umem;
struct xdp_sock {
/* struct sock must be the first member of struct xdp_sock */
struct sock sk;
struct xsk_queue *rx;
struct net_device *dev;
struct xdp_umem *umem;
struct list_head flush_node;
u16 queue_id;
struct xsk_queue *tx ____cacheline_aligned_in_smp;
/* Protects multiple processes in the control path */
struct mutex mutex;
u64 rx_dropped;
};
struct xdp_buff;
#ifdef CONFIG_XDP_SOCKETS
int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
void xsk_flush(struct xdp_sock *xs);
bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
#else
static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
{
return -ENOTSUPP;
}
static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
{
return -ENOTSUPP;
}
static inline void xsk_flush(struct xdp_sock *xs)
{
}
static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
{
return false;
}
#endif /* CONFIG_XDP_SOCKETS */
#endif /* _LINUX_XDP_SOCK_H */
......@@ -116,6 +116,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
BPF_MAP_TYPE_XSKMAP,
};
enum bpf_prog_type {
......
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
*
* if_xdp: XDP socket user-space interface
* Copyright(c) 2018 Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*
* Author(s): Björn Töpel <bjorn.topel@intel.com>
* Magnus Karlsson <magnus.karlsson@intel.com>
*/
#ifndef _LINUX_IF_XDP_H
#define _LINUX_IF_XDP_H
#include <linux/types.h>
/* Options for the sxdp_flags field */
#define XDP_SHARED_UMEM 1
struct sockaddr_xdp {
__u16 sxdp_family;
__u32 sxdp_ifindex;
__u32 sxdp_queue_id;
__u32 sxdp_shared_umem_fd;
__u16 sxdp_flags;
};
/* XDP socket options */
#define XDP_RX_RING 1
#define XDP_TX_RING 2
#define XDP_UMEM_REG 3
#define XDP_UMEM_FILL_RING 4
#define XDP_UMEM_COMPLETION_RING 5
#define XDP_STATISTICS 6
struct xdp_umem_reg {
__u64 addr; /* Start of packet data area */
__u64 len; /* Length of packet data area */
__u32 frame_size; /* Frame size */
__u32 frame_headroom; /* Frame head room */
};
struct xdp_statistics {
__u64 rx_dropped; /* Dropped for reasons other than invalid desc */
__u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
};
/* Pgoff for mmaping the rings */
#define XDP_PGOFF_RX_RING 0
#define XDP_PGOFF_TX_RING 0x80000000
#define XDP_UMEM_PGOFF_FILL_RING 0x100000000
#define XDP_UMEM_PGOFF_COMPLETION_RING 0x180000000
struct xdp_desc {
__u32 idx;
__u32 len;
__u16 offset;
__u8 flags;
__u8 padding[5];
};
struct xdp_ring {
__u32 producer __attribute__((aligned(64)));
__u32 consumer __attribute__((aligned(64)));
};
/* Used for the RX and TX queues for packets */
struct xdp_rxtx_ring {
struct xdp_ring ptrs;
struct xdp_desc desc[0] __attribute__((aligned(64)));
};
/* Used for the fill and completion queues for buffers */
struct xdp_umem_ring {
struct xdp_ring ptrs;
__u32 desc[0] __attribute__((aligned(64)));
};
#endif /* _LINUX_IF_XDP_H */
......@@ -8,6 +8,9 @@ obj-$(CONFIG_BPF_SYSCALL) += btf.o
ifeq ($(CONFIG_NET),y)
obj-$(CONFIG_BPF_SYSCALL) += devmap.o
obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
ifeq ($(CONFIG_XDP_SOCKETS),y)
obj-$(CONFIG_BPF_SYSCALL) += xskmap.o
endif
obj-$(CONFIG_BPF_SYSCALL) += offload.o
ifeq ($(CONFIG_STREAM_PARSER),y)
ifeq ($(CONFIG_INET),y)
......
......@@ -2070,8 +2070,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
if (func_id != BPF_FUNC_redirect_map)
goto error;
break;
/* Restrict bpf side of cpumap, open when use-cases appear */
/* Restrict bpf side of cpumap and xskmap, open when use-cases
* appear.
*/
case BPF_MAP_TYPE_CPUMAP:
case BPF_MAP_TYPE_XSKMAP:
if (func_id != BPF_FUNC_redirect_map)
goto error;
break;
......@@ -2118,7 +2121,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
break;
case BPF_FUNC_redirect_map:
if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
map->map_type != BPF_MAP_TYPE_CPUMAP)
map->map_type != BPF_MAP_TYPE_CPUMAP &&
map->map_type != BPF_MAP_TYPE_XSKMAP)
goto error;
break;
case BPF_FUNC_sk_redirect_map:
......
// SPDX-License-Identifier: GPL-2.0
/* XSKMAP used for AF_XDP sockets
* Copyright(c) 2018 Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#include <linux/bpf.h>
#include <linux/capability.h>
#include <net/xdp_sock.h>
#include <linux/slab.h>
#include <linux/sched.h>
struct xsk_map {
struct bpf_map map;
struct xdp_sock **xsk_map;
struct list_head __percpu *flush_list;
};
static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
{
int cpu, err = -EINVAL;
struct xsk_map *m;
u64 cost;
if (!capable(CAP_NET_ADMIN))
return ERR_PTR(-EPERM);
if (attr->max_entries == 0 || attr->key_size != 4 ||
attr->value_size != 4 ||
attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY))
return ERR_PTR(-EINVAL);
m = kzalloc(sizeof(*m), GFP_USER);
if (!m)
return ERR_PTR(-ENOMEM);
bpf_map_init_from_attr(&m->map, attr);
cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
cost += sizeof(struct list_head) * num_possible_cpus();
if (cost >= U32_MAX - PAGE_SIZE)
goto free_m;
m->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
/* Notice returns -EPERM on if map size is larger than memlock limit */
err = bpf_map_precharge_memlock(m->map.pages);
if (err)
goto free_m;
m->flush_list = alloc_percpu(struct list_head);
if (!m->flush_list)
goto free_m;
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(per_cpu_ptr(m->flush_list, cpu));
m->xsk_map = bpf_map_area_alloc(m->map.max_entries *
sizeof(struct xdp_sock *),
m->map.numa_node);
if (!m->xsk_map)
goto free_percpu;
return &m->map;
free_percpu:
free_percpu(m->flush_list);
free_m:
kfree(m);
return ERR_PTR(err);
}
static void xsk_map_free(struct bpf_map *map)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
int i;
synchronize_net();
for (i = 0; i < map->max_entries; i++) {
struct xdp_sock *xs;
xs = m->xsk_map[i];
if (!xs)
continue;
sock_put((struct sock *)xs);
}
free_percpu(m->flush_list);
bpf_map_area_free(m->xsk_map);
kfree(m);
}
static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
u32 index = key ? *(u32 *)key : U32_MAX;
u32 *next = next_key;
if (index >= m->map.max_entries) {
*next = 0;
return 0;
}
if (index == m->map.max_entries - 1)
return -ENOENT;
*next = index + 1;
return 0;
}
struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
struct xdp_sock *xs;
if (key >= map->max_entries)
return NULL;
xs = READ_ONCE(m->xsk_map[key]);
return xs;
}
int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
struct xdp_sock *xs)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
struct list_head *flush_list = this_cpu_ptr(m->flush_list);
int err;
err = xsk_rcv(xs, xdp);
if (err)
return err;
if (!xs->flush_node.prev)
list_add(&xs->flush_node, flush_list);
return 0;
}
void __xsk_map_flush(struct bpf_map *map)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
struct list_head *flush_list = this_cpu_ptr(m->flush_list);
struct xdp_sock *xs, *tmp;
list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
xsk_flush(xs);
__list_del(xs->flush_node.prev, xs->flush_node.next);
xs->flush_node.prev = NULL;
}
}
static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
{
return NULL;
}
static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
u64 map_flags)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
u32 i = *(u32 *)key, fd = *(u32 *)value;
struct xdp_sock *xs, *old_xs;
struct socket *sock;
int err;
if (unlikely(map_flags > BPF_EXIST))
return -EINVAL;
if (unlikely(i >= m->map.max_entries))
return -E2BIG;
if (unlikely(map_flags == BPF_NOEXIST))
return -EEXIST;
sock = sockfd_lookup(fd, &err);
if (!sock)
return err;
if (sock->sk->sk_family != PF_XDP) {
sockfd_put(sock);
return -EOPNOTSUPP;
}
xs = (struct xdp_sock *)sock->sk;
if (!xsk_is_setup_for_bpf_map(xs)) {
sockfd_put(sock);
return -EOPNOTSUPP;
}
sock_hold(sock->sk);
old_xs = xchg(&m->xsk_map[i], xs);
if (old_xs) {
/* Make sure we've flushed everything. */
synchronize_net();
sock_put((struct sock *)old_xs);
}
sockfd_put(sock);
return 0;
}
static int xsk_map_delete_elem(struct bpf_map *map, void *key)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
struct xdp_sock *old_xs;
int k = *(u32 *)key;
if (k >= map->max_entries)
return -EINVAL;
old_xs = xchg(&m->xsk_map[k], NULL);
if (old_xs) {
/* Make sure we've flushed everything. */
synchronize_net();
sock_put((struct sock *)old_xs);
}
return 0;
}
const struct bpf_map_ops xsk_map_ops = {
.map_alloc = xsk_map_alloc,
.map_free = xsk_map_free,
.map_get_next_key = xsk_map_get_next_key,
.map_lookup_elem = xsk_map_lookup_elem,
.map_update_elem = xsk_map_update_elem,
.map_delete_elem = xsk_map_delete_elem,
};
......@@ -59,6 +59,7 @@ source "net/tls/Kconfig"
source "net/xfrm/Kconfig"
source "net/iucv/Kconfig"
source "net/smc/Kconfig"
source "net/xdp/Kconfig"
config INET
bool "TCP/IP networking"
......
......@@ -85,3 +85,4 @@ obj-y += l3mdev/
endif
obj-$(CONFIG_QRTR) += qrtr/
obj-$(CONFIG_NET_NCSI) += ncsi/
obj-$(CONFIG_XDP_SOCKETS) += xdp/
......@@ -3625,6 +3625,44 @@ int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
}
EXPORT_SYMBOL(dev_queue_xmit_accel);
int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
{
struct net_device *dev = skb->dev;
struct sk_buff *orig_skb = skb;
struct netdev_queue *txq;
int ret = NETDEV_TX_BUSY;
bool again = false;
if (unlikely(!netif_running(dev) ||
!netif_carrier_ok(dev)))
goto drop;
skb = validate_xmit_skb_list(skb, dev, &again);
if (skb != orig_skb)
goto drop;
skb_set_queue_mapping(skb, queue_id);
txq = skb_get_tx_queue(dev, skb);
local_bh_disable();
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (!netif_xmit_frozen_or_drv_stopped(txq))
ret = netdev_start_xmit(skb, dev, txq, false);
HARD_TX_UNLOCK(dev, txq);
local_bh_enable();
if (!dev_xmit_complete(ret))
kfree_skb(skb);
return ret;
drop:
atomic_long_inc(&dev->tx_dropped);
kfree_skb_list(skb);
return NET_XMIT_DROP;
}
EXPORT_SYMBOL(dev_direct_xmit);
/*************************************************************************
* Receiver routines
......@@ -3994,12 +4032,12 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
}
static u32 netif_receive_generic_xdp(struct sk_buff *skb,
struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
{
struct netdev_rx_queue *rxqueue;
void *orig_data, *orig_data_end;
u32 metalen, act = XDP_DROP;
struct xdp_buff xdp;
int hlen, off;
u32 mac_len;
......@@ -4034,19 +4072,19 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
*/
mac_len = skb->data - skb_mac_header(skb);
hlen = skb_headlen(skb) + mac_len;
xdp.data = skb->data - mac_len;
xdp.data_meta = xdp.data;
xdp.data_end = xdp.data + hlen;
xdp.data_hard_start = skb->data - skb_headroom(skb);
orig_data_end = xdp.data_end;
orig_data = xdp.data;
xdp->data = skb->data - mac_len;
xdp->data_meta = xdp->data;
xdp->data_end = xdp->data + hlen;
xdp->data_hard_start = skb->data - skb_headroom(skb);
orig_data_end = xdp->data_end;
orig_data = xdp->data;
rxqueue = netif_get_rxqueue(skb);
xdp.rxq = &rxqueue->xdp_rxq;
xdp->rxq = &rxqueue->xdp_rxq;
act = bpf_prog_run_xdp(xdp_prog, &xdp);
act = bpf_prog_run_xdp(xdp_prog, xdp);
off = xdp.data - orig_data;
off = xdp->data - orig_data;
if (off > 0)
__skb_pull(skb, off);
else if (off < 0)
......@@ -4056,10 +4094,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
/* check if bpf_xdp_adjust_tail was used. it can only "shrink"
* pckt.
*/
off = orig_data_end - xdp.data_end;
off = orig_data_end - xdp->data_end;
if (off != 0) {
skb_set_tail_pointer(skb, xdp.data_end - xdp.data);
skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
skb->len -= off;
}
switch (act) {
......@@ -4068,7 +4107,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
__skb_push(skb, mac_len);
break;
case XDP_PASS:
metalen = xdp.data - xdp.data_meta;
metalen = xdp->data - xdp->data_meta;
if (metalen)
skb_metadata_set(skb, metalen);
break;
......@@ -4118,17 +4157,19 @@ static struct static_key generic_xdp_needed __read_mostly;
int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
{
if (xdp_prog) {
u32 act = netif_receive_generic_xdp(skb, xdp_prog);
struct xdp_buff xdp;
u32 act;
int err;
act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
if (act != XDP_PASS) {
switch (act) {
case XDP_REDIRECT:
err = xdp_do_generic_redirect(skb->dev, skb,
xdp_prog);
&xdp, xdp_prog);
if (err)
goto out_redir;
/* fallthru to submit skb */
break;
case XDP_TX:
generic_xdp_tx(skb, xdp_prog);
break;
......
......@@ -59,6 +59,7 @@
#include <net/tcp.h>
#include <net/xfrm.h>
#include <linux/bpf_trace.h>
#include <net/xdp_sock.h>
/**
* sk_filter_trim_cap - run a packet through a socket filter
......@@ -2801,7 +2802,8 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
{
int err;
if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
switch (map->map_type) {
case BPF_MAP_TYPE_DEVMAP: {
struct net_device *dev = fwd;
struct xdp_frame *xdpf;
......@@ -2819,14 +2821,25 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
if (err)
return err;
__dev_map_insert_ctx(map, index);
} else if (map->map_type == BPF_MAP_TYPE_CPUMAP) {
break;
}
case BPF_MAP_TYPE_CPUMAP: {
struct bpf_cpu_map_entry *rcpu = fwd;
err = cpu_map_enqueue(rcpu, xdp, dev_rx);
if (err)
return err;
__cpu_map_insert_ctx(map, index);
break;
}
case BPF_MAP_TYPE_XSKMAP: {
struct xdp_sock *xs = fwd;
err = __xsk_map_redirect(map, xdp, xs);
return err;
}
default:
break;
}
return 0;
}
......@@ -2845,6 +2858,9 @@ void xdp_do_flush_map(void)
case BPF_MAP_TYPE_CPUMAP:
__cpu_map_flush(map);
break;
case BPF_MAP_TYPE_XSKMAP:
__xsk_map_flush(map);
break;
default:
break;
}
......@@ -2859,6 +2875,8 @@ static void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
return __dev_map_lookup_elem(map, index);
case BPF_MAP_TYPE_CPUMAP:
return __cpu_map_lookup_elem(map, index);
case BPF_MAP_TYPE_XSKMAP:
return __xsk_map_lookup_elem(map, index);
default:
return NULL;
}
......@@ -2956,13 +2974,14 @@ static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
static int xdp_do_generic_redirect_map(struct net_device *dev,
struct sk_buff *skb,
struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
{
struct redirect_info *ri = this_cpu_ptr(&redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
struct net_device *fwd = NULL;
u32 index = ri->ifindex;
void *fwd = NULL;
int err = 0;
ri->ifindex = 0;
......@@ -2984,6 +3003,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
goto err;
skb->dev = fwd;
generic_xdp_tx(skb, xdp_prog);
} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
struct xdp_sock *xs = fwd;
err = xsk_generic_rcv(xs, xdp);
if (err)
goto err;
consume_skb(skb);
} else {
/* TODO: Handle BPF_MAP_TYPE_CPUMAP */
err = -EBADRQC;
......@@ -2998,7 +3025,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
}
int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct bpf_prog *xdp_prog)
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
{
struct redirect_info *ri = this_cpu_ptr(&redirect_info);
u32 index = ri->ifindex;
......@@ -3006,7 +3033,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
int err = 0;
if (ri->map)
return xdp_do_generic_redirect_map(dev, skb, xdp_prog);
return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog);
ri->ifindex = 0;
fwd = dev_get_by_index_rcu(dev_net(dev), index);
......@@ -3020,6 +3047,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
skb->dev = fwd;
_trace_xdp_redirect(dev, xdp_prog, index);
generic_xdp_tx(skb, xdp_prog);
return 0;
err:
_trace_xdp_redirect_err(dev, xdp_prog, index, err);
......
......@@ -226,7 +226,8 @@ static struct lock_class_key af_family_kern_slock_keys[AF_MAX];
x "AF_RXRPC" , x "AF_ISDN" , x "AF_PHONET" , \
x "AF_IEEE802154", x "AF_CAIF" , x "AF_ALG" , \
x "AF_NFC" , x "AF_VSOCK" , x "AF_KCM" , \
x "AF_QIPCRTR", x "AF_SMC" , x "AF_MAX"
x "AF_QIPCRTR", x "AF_SMC" , x "AF_XDP" , \
x "AF_MAX"
static const char *const af_family_key_strings[AF_MAX+1] = {
_sock_locks("sk_lock-")
......@@ -262,7 +263,8 @@ static const char *const af_family_rlock_key_strings[AF_MAX+1] = {
"rlock-AF_RXRPC" , "rlock-AF_ISDN" , "rlock-AF_PHONET" ,
"rlock-AF_IEEE802154", "rlock-AF_CAIF" , "rlock-AF_ALG" ,
"rlock-AF_NFC" , "rlock-AF_VSOCK" , "rlock-AF_KCM" ,
"rlock-AF_QIPCRTR", "rlock-AF_SMC" , "rlock-AF_MAX"
"rlock-AF_QIPCRTR", "rlock-AF_SMC" , "rlock-AF_XDP" ,
"rlock-AF_MAX"
};
static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
"wlock-AF_UNSPEC", "wlock-AF_UNIX" , "wlock-AF_INET" ,
......@@ -279,7 +281,8 @@ static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
"wlock-AF_RXRPC" , "wlock-AF_ISDN" , "wlock-AF_PHONET" ,
"wlock-AF_IEEE802154", "wlock-AF_CAIF" , "wlock-AF_ALG" ,
"wlock-AF_NFC" , "wlock-AF_VSOCK" , "wlock-AF_KCM" ,
"wlock-AF_QIPCRTR", "wlock-AF_SMC" , "wlock-AF_MAX"
"wlock-AF_QIPCRTR", "wlock-AF_SMC" , "wlock-AF_XDP" ,
"wlock-AF_MAX"
};
static const char *const af_family_elock_key_strings[AF_MAX+1] = {
"elock-AF_UNSPEC", "elock-AF_UNIX" , "elock-AF_INET" ,
......@@ -296,7 +299,8 @@ static const char *const af_family_elock_key_strings[AF_MAX+1] = {
"elock-AF_RXRPC" , "elock-AF_ISDN" , "elock-AF_PHONET" ,
"elock-AF_IEEE802154", "elock-AF_CAIF" , "elock-AF_ALG" ,
"elock-AF_NFC" , "elock-AF_VSOCK" , "elock-AF_KCM" ,
"elock-AF_QIPCRTR", "elock-AF_SMC" , "elock-AF_MAX"
"elock-AF_QIPCRTR", "elock-AF_SMC" , "elock-AF_XDP" ,
"elock-AF_MAX"
};
/*
......
......@@ -308,11 +308,9 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
}
EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
void xdp_return_frame(struct xdp_frame *xdpf)
static void xdp_return(void *data, struct xdp_mem_info *mem)
{
struct xdp_mem_info *mem = &xdpf->mem;
struct xdp_mem_allocator *xa;
void *data = xdpf->data;
struct page *page;
switch (mem->type) {
......@@ -339,4 +337,15 @@ void xdp_return_frame(struct xdp_frame *xdpf)
break;
}
}
void xdp_return_frame(struct xdp_frame *xdpf)
{
xdp_return(xdpf->data, &xdpf->mem);
}
EXPORT_SYMBOL_GPL(xdp_return_frame);
void xdp_return_buff(struct xdp_buff *xdp)
{
xdp_return(xdp->data, &xdp->rxq->mem);
}
EXPORT_SYMBOL_GPL(xdp_return_buff);
......@@ -209,7 +209,7 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
struct tpacket3_hdr *);
static void packet_flush_mclist(struct sock *sk);
static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb);
static u16 packet_pick_tx_queue(struct sk_buff *skb);
struct packet_skb_cb {
union {
......@@ -243,40 +243,7 @@ static void __fanout_link(struct sock *sk, struct packet_sock *po);
static int packet_direct_xmit(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
struct sk_buff *orig_skb = skb;
struct netdev_queue *txq;
int ret = NETDEV_TX_BUSY;
bool again = false;
if (unlikely(!netif_running(dev) ||
!netif_carrier_ok(dev)))
goto drop;
skb = validate_xmit_skb_list(skb, dev, &again);
if (skb != orig_skb)
goto drop;
packet_pick_tx_queue(dev, skb);
txq = skb_get_tx_queue(dev, skb);
local_bh_disable();
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (!netif_xmit_frozen_or_drv_stopped(txq))
ret = netdev_start_xmit(skb, dev, txq, false);
HARD_TX_UNLOCK(dev, txq);
local_bh_enable();
if (!dev_xmit_complete(ret))
kfree_skb(skb);
return ret;
drop:
atomic_long_inc(&dev->tx_dropped);
kfree_skb_list(skb);
return NET_XMIT_DROP;
return dev_direct_xmit(skb, packet_pick_tx_queue(skb));
}
static struct net_device *packet_cached_dev_get(struct packet_sock *po)
......@@ -313,8 +280,9 @@ static u16 __packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
return (u16) raw_smp_processor_id() % dev->real_num_tx_queues;
}
static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
static u16 packet_pick_tx_queue(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
const struct net_device_ops *ops = dev->netdev_ops;
u16 queue_index;
......@@ -326,7 +294,7 @@ static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
queue_index = __packet_pick_tx_queue(dev, skb);
}
skb_set_queue_mapping(skb, queue_index);
return queue_index;
}
/* __register_prot_hook must be invoked through register_prot_hook
......
config XDP_SOCKETS
bool "XDP sockets"
depends on BPF_SYSCALL
default n
help
XDP sockets allows a channel between XDP programs and
userspace applications.
obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
// SPDX-License-Identifier: GPL-2.0
/* XDP user-space packet buffer
* Copyright(c) 2018 Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#include <linux/init.h>
#include <linux/sched/mm.h>
#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/uaccess.h>
#include <linux/slab.h>
#include <linux/bpf.h>
#include <linux/mm.h>
#include "xdp_umem.h"
#define XDP_UMEM_MIN_FRAME_SIZE 2048
int xdp_umem_create(struct xdp_umem **umem)
{
*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
if (!(*umem))
return -ENOMEM;
return 0;
}
static void xdp_umem_unpin_pages(struct xdp_umem *umem)
{
unsigned int i;
if (umem->pgs) {
for (i = 0; i < umem->npgs; i++) {
struct page *page = umem->pgs[i];
set_page_dirty_lock(page);
put_page(page);
}
kfree(umem->pgs);
umem->pgs = NULL;
}
}
static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
{
if (umem->user) {
atomic_long_sub(umem->npgs, &umem->user->locked_vm);
free_uid(umem->user);
}
}
static void xdp_umem_release(struct xdp_umem *umem)
{
struct task_struct *task;
struct mm_struct *mm;
if (umem->fq) {
xskq_destroy(umem->fq);
umem->fq = NULL;
}
if (umem->cq) {
xskq_destroy(umem->cq);
umem->cq = NULL;
}
if (umem->pgs) {
xdp_umem_unpin_pages(umem);
task = get_pid_task(umem->pid, PIDTYPE_PID);
put_pid(umem->pid);
if (!task)
goto out;
mm = get_task_mm(task);
put_task_struct(task);
if (!mm)
goto out;
mmput(mm);
umem->pgs = NULL;
}
xdp_umem_unaccount_pages(umem);
out:
kfree(umem);
}
static void xdp_umem_release_deferred(struct work_struct *work)
{
struct xdp_umem *umem = container_of(work, struct xdp_umem, work);
xdp_umem_release(umem);
}
void xdp_get_umem(struct xdp_umem *umem)
{
atomic_inc(&umem->users);
}
void xdp_put_umem(struct xdp_umem *umem)
{
if (!umem)
return;
if (atomic_dec_and_test(&umem->users)) {
INIT_WORK(&umem->work, xdp_umem_release_deferred);
schedule_work(&umem->work);
}
}
static int xdp_umem_pin_pages(struct xdp_umem *umem)
{
unsigned int gup_flags = FOLL_WRITE;
long npgs;
int err;
umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
if (!umem->pgs)
return -ENOMEM;
down_write(&current->mm->mmap_sem);
npgs = get_user_pages(umem->address, umem->npgs,
gup_flags, &umem->pgs[0], NULL);
up_write(&current->mm->mmap_sem);
if (npgs != umem->npgs) {
if (npgs >= 0) {
umem->npgs = npgs;
err = -ENOMEM;
goto out_pin;
}
err = npgs;
goto out_pgs;
}
return 0;
out_pin:
xdp_umem_unpin_pages(umem);
out_pgs:
kfree(umem->pgs);
umem->pgs = NULL;
return err;
}
static int xdp_umem_account_pages(struct xdp_umem *umem)
{
unsigned long lock_limit, new_npgs, old_npgs;
if (capable(CAP_IPC_LOCK))
return 0;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
umem->user = get_uid(current_user());
do {
old_npgs = atomic_long_read(&umem->user->locked_vm);
new_npgs = old_npgs + umem->npgs;
if (new_npgs > lock_limit) {
free_uid(umem->user);
umem->user = NULL;
return -ENOBUFS;
}
} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
new_npgs) != old_npgs);
return 0;
}
int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
{
u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
u64 addr = mr->addr, size = mr->len;
unsigned int nframes, nfpp;
int size_chk, err;
if (!umem)
return -EINVAL;
if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
/* Strictly speaking we could support this, if:
* - huge pages, or*
* - using an IOMMU, or
* - making sure the memory area is consecutive
* but for now, we simply say "computer says no".
*/
return -EINVAL;
}
if (!is_power_of_2(frame_size))
return -EINVAL;
if (!PAGE_ALIGNED(addr)) {
/* Memory area has to be page size aligned. For
* simplicity, this might change.
*/
return -EINVAL;
}
if ((addr + size) < addr)
return -EINVAL;
nframes = size / frame_size;
if (nframes == 0 || nframes > UINT_MAX)
return -EINVAL;
nfpp = PAGE_SIZE / frame_size;
if (nframes < nfpp || nframes % nfpp)
return -EINVAL;
frame_headroom = ALIGN(frame_headroom, 64);
size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
if (size_chk < 0)
return -EINVAL;
umem->pid = get_task_pid(current, PIDTYPE_PID);
umem->size = (size_t)size;
umem->address = (unsigned long)addr;
umem->props.frame_size = frame_size;
umem->props.nframes = nframes;
umem->frame_headroom = frame_headroom;
umem->npgs = size / PAGE_SIZE;
umem->pgs = NULL;
umem->user = NULL;
umem->frame_size_log2 = ilog2(frame_size);
umem->nfpp_mask = nfpp - 1;
umem->nfpplog2 = ilog2(nfpp);
atomic_set(&umem->users, 1);
err = xdp_umem_account_pages(umem);
if (err)
goto out;
err = xdp_umem_pin_pages(umem);
if (err)
goto out_account;
return 0;
out_account:
xdp_umem_unaccount_pages(umem);
out:
put_pid(umem->pid);
return err;
}
bool xdp_umem_validate_queues(struct xdp_umem *umem)
{
return (umem->fq && umem->cq);
}
/* SPDX-License-Identifier: GPL-2.0
* XDP user-space packet buffer
* Copyright(c) 2018 Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#ifndef XDP_UMEM_H_
#define XDP_UMEM_H_
#include <linux/mm.h>
#include <linux/if_xdp.h>
#include <linux/workqueue.h>
#include "xsk_queue.h"
#include "xdp_umem_props.h"
struct xdp_umem {
struct xsk_queue *fq;
struct xsk_queue *cq;
struct page **pgs;
struct xdp_umem_props props;
u32 npgs;
u32 frame_headroom;
u32 nfpp_mask;
u32 nfpplog2;
u32 frame_size_log2;
struct user_struct *user;
struct pid *pid;
unsigned long address;
size_t size;
atomic_t users;
struct work_struct work;
};
static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
{
u64 pg, off;
char *data;
pg = idx >> umem->nfpplog2;
off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
data = page_address(umem->pgs[pg]);
return data + off;
}
static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
u32 idx)
{
return xdp_umem_get_data(umem, idx) + umem->frame_headroom;
}
bool xdp_umem_validate_queues(struct xdp_umem *umem);
int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
void xdp_get_umem(struct xdp_umem *umem);
void xdp_put_umem(struct xdp_umem *umem);
int xdp_umem_create(struct xdp_umem **umem);
#endif /* XDP_UMEM_H_ */
/* SPDX-License-Identifier: GPL-2.0
* XDP user-space packet buffer
* Copyright(c) 2018 Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#ifndef XDP_UMEM_PROPS_H_
#define XDP_UMEM_PROPS_H_
struct xdp_umem_props {
u32 frame_size;
u32 nframes;
};
#endif /* XDP_UMEM_PROPS_H_ */
This diff is collapsed.
// SPDX-License-Identifier: GPL-2.0
/* XDP user-space ring structure
* Copyright(c) 2018 Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#include <linux/slab.h>
#include "xsk_queue.h"
void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props)
{
if (!q)
return;
q->umem_props = *umem_props;
}
static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
{
return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
}
static u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
{
return (sizeof(struct xdp_ring) +
q->nentries * sizeof(struct xdp_desc));
}
struct xsk_queue *xskq_create(u32 nentries, bool umem_queue)
{
struct xsk_queue *q;
gfp_t gfp_flags;
size_t size;
q = kzalloc(sizeof(*q), GFP_KERNEL);
if (!q)
return NULL;
q->nentries = nentries;
q->ring_mask = nentries - 1;
gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
__GFP_COMP | __GFP_NORETRY;
size = umem_queue ? xskq_umem_get_ring_size(q) :
xskq_rxtx_get_ring_size(q);
q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
get_order(size));
if (!q->ring) {
kfree(q);
return NULL;
}
return q;
}
void xskq_destroy(struct xsk_queue *q)
{
if (!q)
return;
page_frag_free(q->ring);
kfree(q);
}
/* SPDX-License-Identifier: GPL-2.0
* XDP user-space ring structure
* Copyright(c) 2018 Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#ifndef _LINUX_XSK_QUEUE_H
#define _LINUX_XSK_QUEUE_H
#include <linux/types.h>
#include <linux/if_xdp.h>
#include "xdp_umem_props.h"
#define RX_BATCH_SIZE 16
struct xsk_queue {
struct xdp_umem_props umem_props;
u32 ring_mask;
u32 nentries;
u32 prod_head;
u32 prod_tail;
u32 cons_head;
u32 cons_tail;
struct xdp_ring *ring;
u64 invalid_descs;
};
/* Common functions operating for both RXTX and umem queues */
static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
{
return q ? q->invalid_descs : 0;
}
static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
{
u32 entries = q->prod_tail - q->cons_tail;
if (entries == 0) {
/* Refresh the local pointer */
q->prod_tail = READ_ONCE(q->ring->producer);
entries = q->prod_tail - q->cons_tail;
}
return (entries > dcnt) ? dcnt : entries;
}
static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
{
u32 free_entries = q->nentries - (producer - q->cons_tail);
if (free_entries >= dcnt)
return free_entries;
/* Refresh the local tail pointer */
q->cons_tail = READ_ONCE(q->ring->consumer);
return q->nentries - (producer - q->cons_tail);
}
/* UMEM queue */
static inline bool xskq_is_valid_id(struct xsk_queue *q, u32 idx)
{
if (unlikely(idx >= q->umem_props.nframes)) {
q->invalid_descs++;
return false;
}
return true;
}
static inline u32 *xskq_validate_id(struct xsk_queue *q)
{
while (q->cons_tail != q->cons_head) {
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
unsigned int idx = q->cons_tail & q->ring_mask;
if (xskq_is_valid_id(q, ring->desc[idx]))
return &ring->desc[idx];
q->cons_tail++;
}
return NULL;
}
static inline u32 *xskq_peek_id(struct xsk_queue *q)
{
struct xdp_umem_ring *ring;
if (q->cons_tail == q->cons_head) {
WRITE_ONCE(q->ring->consumer, q->cons_tail);
q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
/* Order consumer and data */
smp_rmb();
return xskq_validate_id(q);
}
ring = (struct xdp_umem_ring *)q->ring;
return &ring->desc[q->cons_tail & q->ring_mask];
}
static inline void xskq_discard_id(struct xsk_queue *q)
{
q->cons_tail++;
(void)xskq_validate_id(q);
}
static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
{
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
ring->desc[q->prod_tail++ & q->ring_mask] = id;
/* Order producer and data */
smp_wmb();
WRITE_ONCE(q->ring->producer, q->prod_tail);
return 0;
}
static inline int xskq_reserve_id(struct xsk_queue *q)
{
if (xskq_nb_free(q, q->prod_head, 1) == 0)
return -ENOSPC;
q->prod_head++;
return 0;
}
/* Rx/Tx queue */
static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
{
u32 buff_len;
if (unlikely(d->idx >= q->umem_props.nframes)) {
q->invalid_descs++;
return false;
}
buff_len = q->umem_props.frame_size;
if (unlikely(d->len > buff_len || d->len == 0 ||
d->offset > buff_len || d->offset + d->len > buff_len)) {
q->invalid_descs++;
return false;
}
return true;
}
static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
struct xdp_desc *desc)
{
while (q->cons_tail != q->cons_head) {
struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
unsigned int idx = q->cons_tail & q->ring_mask;
if (xskq_is_valid_desc(q, &ring->desc[idx])) {
if (desc)
*desc = ring->desc[idx];
return desc;
}
q->cons_tail++;
}
return NULL;
}
static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
struct xdp_desc *desc)
{
struct xdp_rxtx_ring *ring;
if (q->cons_tail == q->cons_head) {
WRITE_ONCE(q->ring->consumer, q->cons_tail);
q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
/* Order consumer and data */
smp_rmb();
return xskq_validate_desc(q, desc);
}
ring = (struct xdp_rxtx_ring *)q->ring;
*desc = ring->desc[q->cons_tail & q->ring_mask];
return desc;
}
static inline void xskq_discard_desc(struct xsk_queue *q)
{
q->cons_tail++;
(void)xskq_validate_desc(q, NULL);
}
static inline int xskq_produce_batch_desc(struct xsk_queue *q,
u32 id, u32 len, u16 offset)
{
struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
unsigned int idx;
if (xskq_nb_free(q, q->prod_head, 1) == 0)
return -ENOSPC;
idx = (q->prod_head++) & q->ring_mask;
ring->desc[idx].idx = id;
ring->desc[idx].len = len;
ring->desc[idx].offset = offset;
return 0;
}
static inline void xskq_produce_flush_desc(struct xsk_queue *q)
{
/* Order producer and data */
smp_wmb();
q->prod_tail = q->prod_head,
WRITE_ONCE(q->ring->producer, q->prod_tail);
}
static inline bool xskq_full_desc(struct xsk_queue *q)
{
return (xskq_nb_avail(q, q->nentries) == q->nentries);
}
static inline bool xskq_empty_desc(struct xsk_queue *q)
{
return (xskq_nb_free(q, q->prod_tail, 1) == q->nentries);
}
void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
void xskq_destroy(struct xsk_queue *q_ops);
#endif /* _LINUX_XSK_QUEUE_H */
......@@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
hostprogs-y += syscall_tp
hostprogs-y += cpustat
hostprogs-y += xdp_adjust_tail
hostprogs-y += xdpsock
# Libbpf dependencies
LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
......@@ -98,6 +99,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
......@@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
always += syscall_tp_kern.o
always += cpustat_kern.o
always += xdp_adjust_tail_kern.o
always += xdpsock_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
......@@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
HOSTLOADLIBES_syscall_tp += -lelf
HOSTLOADLIBES_cpustat += -lelf
HOSTLOADLIBES_xdp_adjust_tail += -lelf
HOSTLOADLIBES_xdpsock += -lelf -pthread
# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
# make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
......
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef XDPSOCK_H_
#define XDPSOCK_H_
/* Power-of-2 number of sockets */
#define MAX_SOCKS 4
/* Round-robin receive */
#define RR_LB 0
#endif /* XDPSOCK_H_ */
// SPDX-License-Identifier: GPL-2.0
#define KBUILD_MODNAME "foo"
#include <uapi/linux/bpf.h>
#include "bpf_helpers.h"
#include "xdpsock.h"
struct bpf_map_def SEC("maps") qidconf_map = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 1,
};
struct bpf_map_def SEC("maps") xsks_map = {
.type = BPF_MAP_TYPE_XSKMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 4,
};
struct bpf_map_def SEC("maps") rr_map = {
.type = BPF_MAP_TYPE_PERCPU_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(unsigned int),
.max_entries = 1,
};
SEC("xdp_sock")
int xdp_sock_prog(struct xdp_md *ctx)
{
int *qidconf, key = 0, idx;
unsigned int *rr;
qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
if (!qidconf)
return XDP_ABORTED;
if (*qidconf != ctx->rx_queue_index)
return XDP_PASS;
#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
rr = bpf_map_lookup_elem(&rr_map, &key);
if (!rr)
return XDP_ABORTED;
*rr = (*rr + 1) & (MAX_SOCKS - 1);
idx = *rr;
#else
idx = 0;
#endif
return bpf_redirect_map(&xsks_map, idx, 0);
}
char _license[] SEC("license") = "GPL";
This diff is collapsed.
......@@ -1471,7 +1471,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
return SECCLASS_QIPCRTR_SOCKET;
case PF_SMC:
return SECCLASS_SMC_SOCKET;
#if PF_MAX > 44
case PF_XDP:
return SECCLASS_XDP_SOCKET;
#if PF_MAX > 45
#error New address family defined, please update this function.
#endif
}
......
......@@ -240,9 +240,11 @@ struct security_class_mapping secclass_map[] = {
{ "manage_subnet", NULL } },
{ "bpf",
{"map_create", "map_read", "map_write", "prog_load", "prog_run"} },
{ "xdp_socket",
{ COMMON_SOCK_PERMS, NULL } },
{ NULL }
};
#if PF_MAX > 44
#if PF_MAX > 45
#error New address family defined, please update secclass_map.
#endif
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment