Commit d48ce3e5 authored by Daniel Borkmann's avatar Daniel Borkmann

Merge branch 'bpf-sockmap-ulp'

John Fastabend says:

====================
This series adds a BPF hook for sendmsg and senfile by using
the ULP infrastructure and sockmap. A simple pseudocode example
would be,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
                &obj, &msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);

After the above snippet any socket attached to the map would run
msg_prog on sendmsg and sendfile system calls.

Three additional helpers are added bpf_msg_apply_bytes(),
bpf_msg_cork_bytes(), and bpf_msg_pull_data(). With
bpf_msg_apply_bytes BPF programs can tell the infrastructure how
many bytes the given verdict should apply to. This has two cases.
First, a BPF program applies verdict to fewer bytes than in the
current sendmsg/sendfile msg this will apply the verdict to the
first N bytes of the message then run the BPF program again with
data pointers recalculated to the N+1 byte. The second case is the
BPF program applies a verdict to more bytes than the current sendmsg
or sendfile system call. In this case the infrastructure will cache
the verdict and apply it to future sendmsg/sendfile calls until the
byte limit is reached. This avoids the overhead of running BPF
programs on large payloads.

The helper bpf_msg_cork_bytes() handles a different case where
a BPF program can not reach a verdict on a msg until it receives
more bytes AND the program doesn't want to forward the packet
until it is known to be "good". The example case being a user
(albeit a dumb one probably) sends a N byte header in 1B system
calls. The BPF program can call bpf_msg_cork_bytes with the
required byte limit to reach a verdict and then the program will
only be called again once N bytes are received.

The last helper added in this series is bpf_msg_pull_data(). It
is used to pull data in for modification or reading. Similar to
how sk_pull_data() works msg_pull_data can be used to access data
not in the initial (data_start, data_end) range. For sendpage()
calls this is needed if any data is accessed because the BPF
sendpage hook initializes the data_start and data_end pointers to
zero. We do this because sendpage data is shared with the user
and can be modified during or after the BPF verdict possibly
invalidating any verdict the BPF program decides. For sendmsg
the data is already copied by the sendmsg bpf infrastructure so
we only copy the data if the user request a data range that is
not already linearized. This happens if the user requests larger
blocks of data that are not in a single scatterlist element. The
common case seems to be accessing headers which normally are
in the first scatterlist element and already linearized.

For more examples please review the sample program. There are
examples for all the actions and helpers there.

Patches 1-8 implement the above sockmap/BPF infrastructure. The
remaining patches flush out some minimal selftests and the sample
sockmap program. The sockmap sample program is the main vehicle
for testing this infrastructure and will be moved into selftests
shortly. The final patch in this series is a simple shell script
to run a set of tests. These are the tests I run after any changes
to sockmap. The next task on the list after this series is to
push those into selftests so we can avoid manually testing.

Couple notes on future items in the pipeline,

  0. move sample sockmap programs into selftests (noted above)
  1. add additional support for tcp flags, most are ignored now.
  2. add a Documentation/bpf/sockmap file with these details
  3. support stacked ULP types to allow this and ktls to cooperate
  4. Ingress flag support, redirect only supports egress here. The
     other redirect helpers support ingress and egress flags.
  5. add optimizations, I cut a few optimizations here in the
     first iteration of the code for later study/implementation

-v3 updates
  : u32 data pointers in msg_md changed to void *
  : page_address NULL check and flag verification in msg_pull_data
  : remove old note in commit msg that is no longer relevant
  : remove enum sk_msg_action its not used anywhere
  : fixup test_verifier W -> DW insn to account for data pointers
  : unintentionally dropped a smap_stop_tx() call in sockmap.c

I propagated the ACKs forward because above changes were small
one/two line changes.

-v2 updates (discussion):

Dave noticed that sendpage call was previously (in v1) running
on the data directly. This allowed users to potentially modify
the data after or during the BPF program. However doing a copy
automatically even if the data is not accessed has measurable
performance impact. So we added another helper modeled after
the existing skb_pull_data() helper to allow users to selectively
pull data from the msg. This is also useful in the sendmsg case
when users need to access data outside the first scatterlist
element or across scatterlist boundaries.

While doing this I also unified the sendmsg and sendfile handlers
a bit. Originally the sendfile call was optimized for never
touching the data. I've decided for a first submission to drop
this optimization and we can add it back later. It introduced
unnecessary complexity, at least for a first posting, for a
use case I have not entirely flushed out yet. When the use
case is deployed we can add it back if needed. Then we can
review concrete performance deltas as well on real-world
use-cases/applications.

Lastly, I reorganized the patches a bit. Now all sockmap
changes are in a single patch and each helper gets its own
patch. This, at least IMO, makes it easier to review because
sockmap changes are not spread across the patch series. On
the other hand now apply_bytes, cork_bytes logic is only
activated later in the series. But that should be OK.
====================
Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
parents 318df9f0 ae30727f
......@@ -21,6 +21,7 @@ struct bpf_verifier_env;
struct perf_event;
struct bpf_prog;
struct bpf_map;
struct sock;
/* map is generic key/value storage optionally accesible by eBPF programs */
struct bpf_map_ops {
......
......@@ -13,6 +13,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
#endif
#ifdef CONFIG_BPF_EVENTS
BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
......
......@@ -507,6 +507,22 @@ struct xdp_buff {
struct xdp_rxq_info *rxq;
};
struct sk_msg_buff {
void *data;
void *data_end;
__u32 apply_bytes;
__u32 cork_bytes;
int sg_copybreak;
int sg_start;
int sg_curr;
int sg_end;
struct scatterlist sg_data[MAX_SKB_FRAGS];
bool sg_copy[MAX_SKB_FRAGS];
__u32 key;
__u32 flags;
struct bpf_map *map;
};
/* Compute the linear packet data range [data, data_end) which
* will be accessed by various program types (cls_bpf, act_bpf,
* lwt, ...). Subsystems allowing direct data access must (!)
......@@ -771,6 +787,7 @@ xdp_data_meta_unsupported(const struct xdp_buff *xdp)
void bpf_warn_invalid_xdp_action(u32 act);
struct sock *do_sk_redirect_map(struct sk_buff *skb);
struct sock *do_msg_redirect_map(struct sk_msg_buff *md);
#ifdef CONFIG_BPF_JIT
extern int bpf_jit_enable;
......
......@@ -287,6 +287,7 @@ struct ucred {
#define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
#define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
#define MSG_EOF MSG_FIN
#define MSG_NO_SHARED_FRAGS 0x80000 /* sendpage() internal : page frags are not shared */
#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
......
......@@ -2141,6 +2141,10 @@ static inline struct page_frag *sk_page_frag(struct sock *sk)
bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);
int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
int sg_start, int *sg_curr, unsigned int *sg_size,
int first_coalesce);
/*
* Default write policy as shown to user space via poll/select/SIGIO
*/
......
......@@ -133,6 +133,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
BPF_PROG_TYPE_SK_MSG,
};
enum bpf_attach_type {
......@@ -143,6 +144,7 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_PARSER,
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
BPF_SK_MSG_VERDICT,
__MAX_BPF_ATTACH_TYPE
};
......@@ -718,6 +720,15 @@ union bpf_attr {
* int bpf_override_return(pt_regs, rc)
* @pt_regs: pointer to struct pt_regs
* @rc: the return value to set
*
* int bpf_msg_redirect_map(map, key, flags)
* Redirect msg to a sock in map using key as a lookup key for the
* sock in map.
* @map: pointer to sockmap
* @key: key to lookup sock in map
* @flags: reserved for future use
* Return: SK_PASS
*
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
......@@ -779,7 +790,11 @@ union bpf_attr {
FN(perf_prog_read_value), \
FN(getsockopt), \
FN(override_return), \
FN(sock_ops_cb_flags_set),
FN(sock_ops_cb_flags_set), \
FN(msg_redirect_map), \
FN(msg_apply_bytes), \
FN(msg_cork_bytes), \
FN(msg_pull_data),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
......@@ -942,6 +957,14 @@ enum sk_action {
SK_PASS,
};
/* user accessible metadata for SK_MSG packet hook, new fields must
* be added to the end of this structure
*/
struct sk_msg_md {
void *data;
void *data_end;
};
#define BPF_TAG_SIZE 8
struct bpf_prog_info {
......
......@@ -38,6 +38,7 @@
#include <linux/skbuff.h>
#include <linux/workqueue.h>
#include <linux/list.h>
#include <linux/mm.h>
#include <net/strparser.h>
#include <net/tcp.h>
......@@ -47,6 +48,7 @@
struct bpf_stab {
struct bpf_map map;
struct sock **sock_map;
struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
};
......@@ -62,8 +64,7 @@ struct smap_psock_map_entry {
struct smap_psock {
struct rcu_head rcu;
/* refcnt is used inside sk_callback_lock */
u32 refcnt;
refcount_t refcnt;
/* datapath variables */
struct sk_buff_head rxqueue;
......@@ -74,7 +75,16 @@ struct smap_psock {
int save_off;
struct sk_buff *save_skb;
/* datapath variables for tx_msg ULP */
struct sock *sk_redir;
int apply_bytes;
int cork_bytes;
int sg_size;
int eval;
struct sk_msg_buff *cork;
struct strparser strp;
struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
struct list_head maps;
......@@ -92,6 +102,11 @@ struct smap_psock {
void (*save_write_space)(struct sock *sk);
};
static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
{
return rcu_dereference_sk_user_data(sk);
......@@ -116,27 +131,41 @@ static int bpf_tcp_init(struct sock *sk)
psock->save_close = sk->sk_prot->close;
psock->sk_proto = sk->sk_prot;
if (psock->bpf_tx_msg) {
tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
}
sk->sk_prot = &tcp_bpf_proto;
rcu_read_unlock();
return 0;
}
static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
static int free_start_sg(struct sock *sk, struct sk_msg_buff *md);
static void bpf_tcp_release(struct sock *sk)
{
struct smap_psock *psock;
rcu_read_lock();
psock = smap_psock_sk(sk);
if (unlikely(!psock))
goto out;
if (likely(psock)) {
sk->sk_prot = psock->sk_proto;
psock->sk_proto = NULL;
if (psock->cork) {
free_start_sg(psock->sock, psock->cork);
kfree(psock->cork);
psock->cork = NULL;
}
sk->sk_prot = psock->sk_proto;
psock->sk_proto = NULL;
out:
rcu_read_unlock();
}
static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
static void bpf_tcp_close(struct sock *sk, long timeout)
{
void (*close_fun)(struct sock *sk, long timeout);
......@@ -175,6 +204,7 @@ enum __sk_action {
__SK_DROP = 0,
__SK_PASS,
__SK_REDIRECT,
__SK_NONE,
};
static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
......@@ -186,10 +216,621 @@ static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
.release = bpf_tcp_release,
};
static int memcopy_from_iter(struct sock *sk,
struct sk_msg_buff *md,
struct iov_iter *from, int bytes)
{
struct scatterlist *sg = md->sg_data;
int i = md->sg_curr, rc = -ENOSPC;
do {
int copy;
char *to;
if (md->sg_copybreak >= sg[i].length) {
md->sg_copybreak = 0;
if (++i == MAX_SKB_FRAGS)
i = 0;
if (i == md->sg_end)
break;
}
copy = sg[i].length - md->sg_copybreak;
to = sg_virt(&sg[i]) + md->sg_copybreak;
md->sg_copybreak += copy;
if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY)
rc = copy_from_iter_nocache(to, copy, from);
else
rc = copy_from_iter(to, copy, from);
if (rc != copy) {
rc = -EFAULT;
goto out;
}
bytes -= copy;
if (!bytes)
break;
md->sg_copybreak = 0;
if (++i == MAX_SKB_FRAGS)
i = 0;
} while (i != md->sg_end);
out:
md->sg_curr = i;
return rc;
}
static int bpf_tcp_push(struct sock *sk, int apply_bytes,
struct sk_msg_buff *md,
int flags, bool uncharge)
{
bool apply = apply_bytes;
struct scatterlist *sg;
int offset, ret = 0;
struct page *p;
size_t size;
while (1) {
sg = md->sg_data + md->sg_start;
size = (apply && apply_bytes < sg->length) ?
apply_bytes : sg->length;
offset = sg->offset;
tcp_rate_check_app_limited(sk);
p = sg_page(sg);
retry:
ret = do_tcp_sendpages(sk, p, offset, size, flags);
if (ret != size) {
if (ret > 0) {
if (apply)
apply_bytes -= ret;
size -= ret;
offset += ret;
if (uncharge)
sk_mem_uncharge(sk, ret);
goto retry;
}
sg->length = size;
sg->offset = offset;
return ret;
}
if (apply)
apply_bytes -= ret;
sg->offset += ret;
sg->length -= ret;
if (uncharge)
sk_mem_uncharge(sk, ret);
if (!sg->length) {
put_page(p);
md->sg_start++;
if (md->sg_start == MAX_SKB_FRAGS)
md->sg_start = 0;
memset(sg, 0, sizeof(*sg));
if (md->sg_start == md->sg_end)
break;
}
if (apply && !apply_bytes)
break;
}
return 0;
}
static inline void bpf_compute_data_pointers_sg(struct sk_msg_buff *md)
{
struct scatterlist *sg = md->sg_data + md->sg_start;
if (md->sg_copy[md->sg_start]) {
md->data = md->data_end = 0;
} else {
md->data = sg_virt(sg);
md->data_end = md->data + sg->length;
}
}
static void return_mem_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
{
struct scatterlist *sg = md->sg_data;
int i = md->sg_start;
do {
int uncharge = (bytes < sg[i].length) ? bytes : sg[i].length;
sk_mem_uncharge(sk, uncharge);
bytes -= uncharge;
if (!bytes)
break;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
} while (i != md->sg_end);
}
static void free_bytes_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
{
struct scatterlist *sg = md->sg_data;
int i = md->sg_start, free;
while (bytes && sg[i].length) {
free = sg[i].length;
if (bytes < free) {
sg[i].length -= bytes;
sg[i].offset += bytes;
sk_mem_uncharge(sk, bytes);
break;
}
sk_mem_uncharge(sk, sg[i].length);
put_page(sg_page(&sg[i]));
bytes -= sg[i].length;
sg[i].length = 0;
sg[i].page_link = 0;
sg[i].offset = 0;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
}
}
static int free_sg(struct sock *sk, int start, struct sk_msg_buff *md)
{
struct scatterlist *sg = md->sg_data;
int i = start, free = 0;
while (sg[i].length) {
free += sg[i].length;
sk_mem_uncharge(sk, sg[i].length);
put_page(sg_page(&sg[i]));
sg[i].length = 0;
sg[i].page_link = 0;
sg[i].offset = 0;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
}
return free;
}
static int free_start_sg(struct sock *sk, struct sk_msg_buff *md)
{
int free = free_sg(sk, md->sg_start, md);
md->sg_start = md->sg_end;
return free;
}
static int free_curr_sg(struct sock *sk, struct sk_msg_buff *md)
{
return free_sg(sk, md->sg_curr, md);
}
static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
{
return ((_rc == SK_PASS) ?
(md->map ? __SK_REDIRECT : __SK_PASS) :
__SK_DROP);
}
static unsigned int smap_do_tx_msg(struct sock *sk,
struct smap_psock *psock,
struct sk_msg_buff *md)
{
struct bpf_prog *prog;
unsigned int rc, _rc;
preempt_disable();
rcu_read_lock();
/* If the policy was removed mid-send then default to 'accept' */
prog = READ_ONCE(psock->bpf_tx_msg);
if (unlikely(!prog)) {
_rc = SK_PASS;
goto verdict;
}
bpf_compute_data_pointers_sg(md);
rc = (*prog->bpf_func)(md, prog->insnsi);
psock->apply_bytes = md->apply_bytes;
/* Moving return codes from UAPI namespace into internal namespace */
_rc = bpf_map_msg_verdict(rc, md);
/* The psock has a refcount on the sock but not on the map and because
* we need to drop rcu read lock here its possible the map could be
* removed between here and when we need it to execute the sock
* redirect. So do the map lookup now for future use.
*/
if (_rc == __SK_REDIRECT) {
if (psock->sk_redir)
sock_put(psock->sk_redir);
psock->sk_redir = do_msg_redirect_map(md);
if (!psock->sk_redir) {
_rc = __SK_DROP;
goto verdict;
}
sock_hold(psock->sk_redir);
}
verdict:
rcu_read_unlock();
preempt_enable();
return _rc;
}
static int bpf_tcp_sendmsg_do_redirect(struct sock *sk, int send,
struct sk_msg_buff *md,
int flags)
{
struct smap_psock *psock;
struct scatterlist *sg;
int i, err, free = 0;
sg = md->sg_data;
rcu_read_lock();
psock = smap_psock_sk(sk);
if (unlikely(!psock))
goto out_rcu;
if (!refcount_inc_not_zero(&psock->refcnt))
goto out_rcu;
rcu_read_unlock();
lock_sock(sk);
err = bpf_tcp_push(sk, send, md, flags, false);
release_sock(sk);
smap_release_sock(psock, sk);
if (unlikely(err))
goto out;
return 0;
out_rcu:
rcu_read_unlock();
out:
i = md->sg_start;
while (sg[i].length) {
free += sg[i].length;
put_page(sg_page(&sg[i]));
sg[i].length = 0;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
}
return free;
}
static inline void bpf_md_init(struct smap_psock *psock)
{
if (!psock->apply_bytes) {
psock->eval = __SK_NONE;
if (psock->sk_redir) {
sock_put(psock->sk_redir);
psock->sk_redir = NULL;
}
}
}
static void apply_bytes_dec(struct smap_psock *psock, int i)
{
if (psock->apply_bytes) {
if (psock->apply_bytes < i)
psock->apply_bytes = 0;
else
psock->apply_bytes -= i;
}
}
static int bpf_exec_tx_verdict(struct smap_psock *psock,
struct sk_msg_buff *m,
struct sock *sk,
int *copied, int flags)
{
bool cork = false, enospc = (m->sg_start == m->sg_end);
struct sock *redir;
int err = 0;
int send;
more_data:
if (psock->eval == __SK_NONE)
psock->eval = smap_do_tx_msg(sk, psock, m);
if (m->cork_bytes &&
m->cork_bytes > psock->sg_size && !enospc) {
psock->cork_bytes = m->cork_bytes - psock->sg_size;
if (!psock->cork) {
psock->cork = kcalloc(1,
sizeof(struct sk_msg_buff),
GFP_ATOMIC | __GFP_NOWARN);
if (!psock->cork) {
err = -ENOMEM;
goto out_err;
}
}
memcpy(psock->cork, m, sizeof(*m));
goto out_err;
}
send = psock->sg_size;
if (psock->apply_bytes && psock->apply_bytes < send)
send = psock->apply_bytes;
switch (psock->eval) {
case __SK_PASS:
err = bpf_tcp_push(sk, send, m, flags, true);
if (unlikely(err)) {
*copied -= free_start_sg(sk, m);
break;
}
apply_bytes_dec(psock, send);
psock->sg_size -= send;
break;
case __SK_REDIRECT:
redir = psock->sk_redir;
apply_bytes_dec(psock, send);
if (psock->cork) {
cork = true;
psock->cork = NULL;
}
return_mem_sg(sk, send, m);
release_sock(sk);
err = bpf_tcp_sendmsg_do_redirect(redir, send, m, flags);
lock_sock(sk);
if (cork) {
free_start_sg(sk, m);
kfree(m);
m = NULL;
}
if (unlikely(err))
*copied -= err;
else
psock->sg_size -= send;
break;
case __SK_DROP:
default:
free_bytes_sg(sk, send, m);
apply_bytes_dec(psock, send);
*copied -= send;
psock->sg_size -= send;
err = -EACCES;
break;
}
if (likely(!err)) {
bpf_md_init(psock);
if (m &&
m->sg_data[m->sg_start].page_link &&
m->sg_data[m->sg_start].length)
goto more_data;
}
out_err:
return err;
}
static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
int flags = msg->msg_flags | MSG_NO_SHARED_FRAGS;
struct sk_msg_buff md = {0};
unsigned int sg_copy = 0;
struct smap_psock *psock;
int copied = 0, err = 0;
struct scatterlist *sg;
long timeo;
/* Its possible a sock event or user removed the psock _but_ the ops
* have not been reprogrammed yet so we get here. In this case fallback
* to tcp_sendmsg. Note this only works because we _only_ ever allow
* a single ULP there is no hierarchy here.
*/
rcu_read_lock();
psock = smap_psock_sk(sk);
if (unlikely(!psock)) {
rcu_read_unlock();
return tcp_sendmsg(sk, msg, size);
}
/* Increment the psock refcnt to ensure its not released while sending a
* message. Required because sk lookup and bpf programs are used in
* separate rcu critical sections. Its OK if we lose the map entry
* but we can't lose the sock reference.
*/
if (!refcount_inc_not_zero(&psock->refcnt)) {
rcu_read_unlock();
return tcp_sendmsg(sk, msg, size);
}
sg = md.sg_data;
sg_init_table(sg, MAX_SKB_FRAGS);
rcu_read_unlock();
lock_sock(sk);
timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
while (msg_data_left(msg)) {
struct sk_msg_buff *m;
bool enospc = false;
int copy;
if (sk->sk_err) {
err = sk->sk_err;
goto out_err;
}
copy = msg_data_left(msg);
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
m = psock->cork_bytes ? psock->cork : &md;
m->sg_curr = m->sg_copybreak ? m->sg_curr : m->sg_end;
err = sk_alloc_sg(sk, copy, m->sg_data,
m->sg_start, &m->sg_end, &sg_copy,
m->sg_end - 1);
if (err) {
if (err != -ENOSPC)
goto wait_for_memory;
enospc = true;
copy = sg_copy;
}
err = memcopy_from_iter(sk, m, &msg->msg_iter, copy);
if (err < 0) {
free_curr_sg(sk, m);
goto out_err;
}
psock->sg_size += copy;
copied += copy;
sg_copy = 0;
/* When bytes are being corked skip running BPF program and
* applying verdict unless there is no more buffer space. In
* the ENOSPC case simply run BPF prorgram with currently
* accumulated data. We don't have much choice at this point
* we could try extending the page frags or chaining complex
* frags but even in these cases _eventually_ we will hit an
* OOM scenario. More complex recovery schemes may be
* implemented in the future, but BPF programs must handle
* the case where apply_cork requests are not honored. The
* canonical method to verify this is to check data length.
*/
if (psock->cork_bytes) {
if (copy > psock->cork_bytes)
psock->cork_bytes = 0;
else
psock->cork_bytes -= copy;
if (psock->cork_bytes && !enospc)
goto out_cork;
/* All cork bytes accounted for re-run filter */
psock->eval = __SK_NONE;
psock->cork_bytes = 0;
}
err = bpf_exec_tx_verdict(psock, m, sk, &copied, flags);
if (unlikely(err < 0))
goto out_err;
continue;
wait_for_sndbuf:
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
err = sk_stream_wait_memory(sk, &timeo);
if (err)
goto out_err;
}
out_err:
if (err < 0)
err = sk_stream_error(sk, msg->msg_flags, err);
out_cork:
release_sock(sk);
smap_release_sock(psock, sk);
return copied ? copied : err;
}
static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
struct sk_msg_buff md = {0}, *m = NULL;
int err = 0, copied = 0;
struct smap_psock *psock;
struct scatterlist *sg;
bool enospc = false;
rcu_read_lock();
psock = smap_psock_sk(sk);
if (unlikely(!psock))
goto accept;
if (!refcount_inc_not_zero(&psock->refcnt))
goto accept;
rcu_read_unlock();
lock_sock(sk);
if (psock->cork_bytes)
m = psock->cork;
else
m = &md;
/* Catch case where ring is full and sendpage is stalled. */
if (unlikely(m->sg_end == m->sg_start &&
m->sg_data[m->sg_end].length))
goto out_err;
psock->sg_size += size;
sg = &m->sg_data[m->sg_end];
sg_set_page(sg, page, size, offset);
get_page(page);
m->sg_copy[m->sg_end] = true;
sk_mem_charge(sk, size);
m->sg_end++;
copied = size;
if (m->sg_end == MAX_SKB_FRAGS)
m->sg_end = 0;
if (m->sg_end == m->sg_start)
enospc = true;
if (psock->cork_bytes) {
if (size > psock->cork_bytes)
psock->cork_bytes = 0;
else
psock->cork_bytes -= size;
if (psock->cork_bytes && !enospc)
goto out_err;
/* All cork bytes accounted for re-run filter */
psock->eval = __SK_NONE;
psock->cork_bytes = 0;
}
err = bpf_exec_tx_verdict(psock, m, sk, &copied, flags);
out_err:
release_sock(sk);
smap_release_sock(psock, sk);
return copied ? copied : err;
accept:
rcu_read_unlock();
return tcp_sendpage(sk, page, offset, size, flags);
}
static void bpf_tcp_msg_add(struct smap_psock *psock,
struct sock *sk,
struct bpf_prog *tx_msg)
{
struct bpf_prog *orig_tx_msg;
orig_tx_msg = xchg(&psock->bpf_tx_msg, tx_msg);
if (orig_tx_msg)
bpf_prog_put(orig_tx_msg);
}
static int bpf_tcp_ulp_register(void)
{
tcp_bpf_proto = tcp_prot;
tcp_bpf_proto.close = bpf_tcp_close;
/* Once BPF TX ULP is registered it is never unregistered. It
* will be in the ULP list for the lifetime of the system. Doing
* duplicate registers is not a problem.
*/
return tcp_register_ulp(&bpf_tcp_ulp_ops);
}
......@@ -373,15 +1014,13 @@ static void smap_destroy_psock(struct rcu_head *rcu)
static void smap_release_sock(struct smap_psock *psock, struct sock *sock)
{
psock->refcnt--;
if (psock->refcnt)
return;
tcp_cleanup_ulp(sock);
smap_stop_sock(psock, sock);
clear_bit(SMAP_TX_RUNNING, &psock->state);
rcu_assign_sk_user_data(sock, NULL);
call_rcu_sched(&psock->rcu, smap_destroy_psock);
if (refcount_dec_and_test(&psock->refcnt)) {
tcp_cleanup_ulp(sock);
smap_stop_sock(psock, sock);
clear_bit(SMAP_TX_RUNNING, &psock->state);
rcu_assign_sk_user_data(sock, NULL);
call_rcu_sched(&psock->rcu, smap_destroy_psock);
}
}
static int smap_parse_func_strparser(struct strparser *strp,
......@@ -415,7 +1054,6 @@ static int smap_parse_func_strparser(struct strparser *strp,
return rc;
}
static int smap_read_sock_done(struct strparser *strp, int err)
{
return err;
......@@ -485,12 +1123,22 @@ static void smap_gc_work(struct work_struct *w)
bpf_prog_put(psock->bpf_parse);
if (psock->bpf_verdict)
bpf_prog_put(psock->bpf_verdict);
if (psock->bpf_tx_msg)
bpf_prog_put(psock->bpf_tx_msg);
if (psock->cork) {
free_start_sg(psock->sock, psock->cork);
kfree(psock->cork);
}
list_for_each_entry_safe(e, tmp, &psock->maps, list) {
list_del(&e->list);
kfree(e);
}
if (psock->sk_redir)
sock_put(psock->sk_redir);
sock_put(psock->sock);
kfree(psock);
}
......@@ -506,12 +1154,13 @@ static struct smap_psock *smap_init_psock(struct sock *sock,
if (!psock)
return ERR_PTR(-ENOMEM);
psock->eval = __SK_NONE;
psock->sock = sock;
skb_queue_head_init(&psock->rxqueue);
INIT_WORK(&psock->tx_work, smap_tx_work);
INIT_WORK(&psock->gc_work, smap_gc_work);
INIT_LIST_HEAD(&psock->maps);
psock->refcnt = 1;
refcount_set(&psock->refcnt, 1);
rcu_assign_sk_user_data(sock, psock);
sock_hold(sock);
......@@ -714,10 +1363,11 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
{
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
struct smap_psock_map_entry *e = NULL;
struct bpf_prog *verdict, *parse;
struct bpf_prog *verdict, *parse, *tx_msg;
struct sock *osock, *sock;
struct smap_psock *psock;
u32 i = *(u32 *)key;
bool new = false;
int err;
if (unlikely(flags > BPF_EXIST))
......@@ -740,6 +1390,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
*/
verdict = READ_ONCE(stab->bpf_verdict);
parse = READ_ONCE(stab->bpf_parse);
tx_msg = READ_ONCE(stab->bpf_tx_msg);
if (parse && verdict) {
/* bpf prog refcnt may be zero if a concurrent attach operation
......@@ -758,6 +1409,17 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
}
}
if (tx_msg) {
tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
if (IS_ERR(tx_msg)) {
if (verdict)
bpf_prog_put(verdict);
if (parse)
bpf_prog_put(parse);
return PTR_ERR(tx_msg);
}
}
write_lock_bh(&sock->sk_callback_lock);
psock = smap_psock_sk(sock);
......@@ -772,7 +1434,14 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
err = -EBUSY;
goto out_progs;
}
psock->refcnt++;
if (READ_ONCE(psock->bpf_tx_msg) && tx_msg) {
err = -EBUSY;
goto out_progs;
}
if (!refcount_inc_not_zero(&psock->refcnt)) {
err = -EAGAIN;
goto out_progs;
}
} else {
psock = smap_init_psock(sock, stab);
if (IS_ERR(psock)) {
......@@ -780,11 +1449,8 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
goto out_progs;
}
err = tcp_set_ulp_id(sock, TCP_ULP_BPF);
if (err)
goto out_progs;
set_bit(SMAP_TX_RUNNING, &psock->state);
new = true;
}
e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
......@@ -797,6 +1463,14 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
/* 3. At this point we have a reference to a valid psock that is
* running. Attach any BPF programs needed.
*/
if (tx_msg)
bpf_tcp_msg_add(psock, sock, tx_msg);
if (new) {
err = tcp_set_ulp_id(sock, TCP_ULP_BPF);
if (err)
goto out_free;
}
if (parse && verdict && !psock->strp_enabled) {
err = smap_init_sock(psock, sock);
if (err)
......@@ -818,8 +1492,6 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
struct smap_psock *opsock = smap_psock_sk(osock);
write_lock_bh(&osock->sk_callback_lock);
if (osock != sock && parse)
smap_stop_sock(opsock, osock);
smap_list_remove(opsock, &stab->sock_map[i]);
smap_release_sock(opsock, osock);
write_unlock_bh(&osock->sk_callback_lock);
......@@ -832,6 +1504,8 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
bpf_prog_put(verdict);
if (parse)
bpf_prog_put(parse);
if (tx_msg)
bpf_prog_put(tx_msg);
write_unlock_bh(&sock->sk_callback_lock);
kfree(e);
return err;
......@@ -846,6 +1520,9 @@ int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
return -EINVAL;
switch (type) {
case BPF_SK_MSG_VERDICT:
orig = xchg(&stab->bpf_tx_msg, prog);
break;
case BPF_SK_SKB_STREAM_PARSER:
orig = xchg(&stab->bpf_parse, prog);
break;
......@@ -907,6 +1584,10 @@ static void sock_map_release(struct bpf_map *map, struct file *map_file)
orig = xchg(&stab->bpf_verdict, NULL);
if (orig)
bpf_prog_put(orig);
orig = xchg(&stab->bpf_tx_msg, NULL);
if (orig)
bpf_prog_put(orig);
}
const struct bpf_map_ops sock_map_ops = {
......
......@@ -1315,7 +1315,8 @@ static int bpf_obj_get(const union bpf_attr *attr)
#define BPF_PROG_ATTACH_LAST_FIELD attach_flags
static int sockmap_get_from_fd(const union bpf_attr *attr, bool attach)
static int sockmap_get_from_fd(const union bpf_attr *attr,
int type, bool attach)
{
struct bpf_prog *prog = NULL;
int ufd = attr->target_fd;
......@@ -1329,8 +1330,7 @@ static int sockmap_get_from_fd(const union bpf_attr *attr, bool attach)
return PTR_ERR(map);
if (attach) {
prog = bpf_prog_get_type(attr->attach_bpf_fd,
BPF_PROG_TYPE_SK_SKB);
prog = bpf_prog_get_type(attr->attach_bpf_fd, type);
if (IS_ERR(prog)) {
fdput(f);
return PTR_ERR(prog);
......@@ -1382,9 +1382,11 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_CGROUP_DEVICE:
ptype = BPF_PROG_TYPE_CGROUP_DEVICE;
break;
case BPF_SK_MSG_VERDICT:
return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, true);
case BPF_SK_SKB_STREAM_PARSER:
case BPF_SK_SKB_STREAM_VERDICT:
return sockmap_get_from_fd(attr, true);
return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, true);
default:
return -EINVAL;
}
......@@ -1437,9 +1439,11 @@ static int bpf_prog_detach(const union bpf_attr *attr)
case BPF_CGROUP_DEVICE:
ptype = BPF_PROG_TYPE_CGROUP_DEVICE;
break;
case BPF_SK_MSG_VERDICT:
return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, false);
case BPF_SK_SKB_STREAM_PARSER:
case BPF_SK_SKB_STREAM_VERDICT:
return sockmap_get_from_fd(attr, false);
return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, false);
default:
return -EINVAL;
}
......
......@@ -1248,6 +1248,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
case BPF_PROG_TYPE_XDP:
case BPF_PROG_TYPE_LWT_XMIT:
case BPF_PROG_TYPE_SK_SKB:
case BPF_PROG_TYPE_SK_MSG:
if (meta)
return meta->pkt_access;
......@@ -2071,7 +2072,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
case BPF_MAP_TYPE_SOCKMAP:
if (func_id != BPF_FUNC_sk_redirect_map &&
func_id != BPF_FUNC_sock_map_update &&
func_id != BPF_FUNC_map_delete_elem)
func_id != BPF_FUNC_map_delete_elem &&
func_id != BPF_FUNC_msg_redirect_map)
goto error;
break;
default:
......@@ -2109,6 +2111,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
goto error;
break;
case BPF_FUNC_sk_redirect_map:
case BPF_FUNC_msg_redirect_map:
if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
goto error;
break;
......
......@@ -1890,6 +1890,202 @@ static const struct bpf_func_proto bpf_sk_redirect_map_proto = {
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg,
struct bpf_map *, map, u32, key, u64, flags)
{
/* If user passes invalid input drop the packet. */
if (unlikely(flags))
return SK_DROP;
msg->key = key;
msg->flags = flags;
msg->map = map;
return SK_PASS;
}
struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
{
struct sock *sk = NULL;
if (msg->map) {
sk = __sock_map_lookup_elem(msg->map, msg->key);
msg->key = 0;
msg->map = NULL;
}
return sk;
}
static const struct bpf_func_proto bpf_msg_redirect_map_proto = {
.func = bpf_msg_redirect_map,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_CONST_MAP_PTR,
.arg3_type = ARG_ANYTHING,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_2(bpf_msg_apply_bytes, struct sk_msg_buff *, msg, u32, bytes)
{
msg->apply_bytes = bytes;
return 0;
}
static const struct bpf_func_proto bpf_msg_apply_bytes_proto = {
.func = bpf_msg_apply_bytes,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_ANYTHING,
};
BPF_CALL_2(bpf_msg_cork_bytes, struct sk_msg_buff *, msg, u32, bytes)
{
msg->cork_bytes = bytes;
return 0;
}
static const struct bpf_func_proto bpf_msg_cork_bytes_proto = {
.func = bpf_msg_cork_bytes,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_msg_pull_data,
struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
{
unsigned int len = 0, offset = 0, copy = 0;
struct scatterlist *sg = msg->sg_data;
int first_sg, last_sg, i, shift;
unsigned char *p, *to, *from;
int bytes = end - start;
struct page *page;
if (unlikely(flags || end <= start))
return -EINVAL;
/* First find the starting scatterlist element */
i = msg->sg_start;
do {
len = sg[i].length;
offset += len;
if (start < offset + len)
break;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
} while (i != msg->sg_end);
if (unlikely(start >= offset + len))
return -EINVAL;
if (!msg->sg_copy[i] && bytes <= len)
goto out;
first_sg = i;
/* At this point we need to linearize multiple scatterlist
* elements or a single shared page. Either way we need to
* copy into a linear buffer exclusively owned by BPF. Then
* place the buffer in the scatterlist and fixup the original
* entries by removing the entries now in the linear buffer
* and shifting the remaining entries. For now we do not try
* to copy partial entries to avoid complexity of running out
* of sg_entry slots. The downside is reading a single byte
* will copy the entire sg entry.
*/
do {
copy += sg[i].length;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
if (bytes < copy)
break;
} while (i != msg->sg_end);
last_sg = i;
if (unlikely(copy < end - start))
return -EINVAL;
page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
if (unlikely(!page))
return -ENOMEM;
p = page_address(page);
offset = 0;
i = first_sg;
do {
from = sg_virt(&sg[i]);
len = sg[i].length;
to = p + offset;
memcpy(to, from, len);
offset += len;
sg[i].length = 0;
put_page(sg_page(&sg[i]));
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
} while (i != last_sg);
sg[first_sg].length = copy;
sg_set_page(&sg[first_sg], page, copy, 0);
/* To repair sg ring we need to shift entries. If we only
* had a single entry though we can just replace it and
* be done. Otherwise walk the ring and shift the entries.
*/
shift = last_sg - first_sg - 1;
if (!shift)
goto out;
i = first_sg + 1;
do {
int move_from;
if (i + shift >= MAX_SKB_FRAGS)
move_from = i + shift - MAX_SKB_FRAGS;
else
move_from = i + shift;
if (move_from == msg->sg_end)
break;
sg[i] = sg[move_from];
sg[move_from].length = 0;
sg[move_from].page_link = 0;
sg[move_from].offset = 0;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
} while (1);
msg->sg_end -= shift;
if (msg->sg_end < 0)
msg->sg_end += MAX_SKB_FRAGS;
out:
msg->data = sg_virt(&sg[i]) + start - offset;
msg->data_end = msg->data + bytes;
return 0;
}
static const struct bpf_func_proto bpf_msg_pull_data_proto = {
.func = bpf_msg_pull_data,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_ANYTHING,
.arg3_type = ARG_ANYTHING,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_1(bpf_get_cgroup_classid, const struct sk_buff *, skb)
{
return task_get_classid(skb);
......@@ -2831,7 +3027,8 @@ bool bpf_helper_changes_pkt_data(void *func)
func == bpf_l3_csum_replace ||
func == bpf_l4_csum_replace ||
func == bpf_xdp_adjust_head ||
func == bpf_xdp_adjust_meta)
func == bpf_xdp_adjust_meta ||
func == bpf_msg_pull_data)
return true;
return false;
......@@ -3591,6 +3788,22 @@ static const struct bpf_func_proto *
}
}
static const struct bpf_func_proto *sk_msg_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
case BPF_FUNC_msg_redirect_map:
return &bpf_msg_redirect_map_proto;
case BPF_FUNC_msg_apply_bytes:
return &bpf_msg_apply_bytes_proto;
case BPF_FUNC_msg_cork_bytes:
return &bpf_msg_cork_bytes_proto;
case BPF_FUNC_msg_pull_data:
return &bpf_msg_pull_data_proto;
default:
return bpf_base_func_proto(func_id);
}
}
static const struct bpf_func_proto *sk_skb_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
......@@ -3980,6 +4193,32 @@ static bool sk_skb_is_valid_access(int off, int size,
return bpf_skb_is_valid_access(off, size, type, info);
}
static bool sk_msg_is_valid_access(int off, int size,
enum bpf_access_type type,
struct bpf_insn_access_aux *info)
{
if (type == BPF_WRITE)
return false;
switch (off) {
case offsetof(struct sk_msg_md, data):
info->reg_type = PTR_TO_PACKET;
break;
case offsetof(struct sk_msg_md, data_end):
info->reg_type = PTR_TO_PACKET_END;
break;
}
if (off < 0 || off >= sizeof(struct sk_msg_md))
return false;
if (off % size != 0)
return false;
if (size != sizeof(__u64))
return false;
return true;
}
static u32 bpf_convert_ctx_access(enum bpf_access_type type,
const struct bpf_insn *si,
struct bpf_insn *insn_buf,
......@@ -4778,6 +5017,29 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
return insn - insn_buf;
}
static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
const struct bpf_insn *si,
struct bpf_insn *insn_buf,
struct bpf_prog *prog, u32 *target_size)
{
struct bpf_insn *insn = insn_buf;
switch (si->off) {
case offsetof(struct sk_msg_md, data):
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
si->dst_reg, si->src_reg,
offsetof(struct sk_msg_buff, data));
break;
case offsetof(struct sk_msg_md, data_end):
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
si->dst_reg, si->src_reg,
offsetof(struct sk_msg_buff, data_end));
break;
}
return insn - insn_buf;
}
const struct bpf_verifier_ops sk_filter_verifier_ops = {
.get_func_proto = sk_filter_func_proto,
.is_valid_access = sk_filter_is_valid_access,
......@@ -4868,6 +5130,15 @@ const struct bpf_verifier_ops sk_skb_verifier_ops = {
const struct bpf_prog_ops sk_skb_prog_ops = {
};
const struct bpf_verifier_ops sk_msg_verifier_ops = {
.get_func_proto = sk_msg_func_proto,
.is_valid_access = sk_msg_is_valid_access,
.convert_ctx_access = sk_msg_convert_ctx_access,
};
const struct bpf_prog_ops sk_msg_prog_ops = {
};
int sk_detach_filter(struct sock *sk)
{
int ret = -ENOENT;
......
......@@ -2239,6 +2239,67 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
}
EXPORT_SYMBOL(sk_page_frag_refill);
int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
int sg_start, int *sg_curr_index, unsigned int *sg_curr_size,
int first_coalesce)
{
int sg_curr = *sg_curr_index, use = 0, rc = 0;
unsigned int size = *sg_curr_size;
struct page_frag *pfrag;
struct scatterlist *sge;
len -= size;
pfrag = sk_page_frag(sk);
while (len > 0) {
unsigned int orig_offset;
if (!sk_page_frag_refill(sk, pfrag)) {
rc = -ENOMEM;
goto out;
}
use = min_t(int, len, pfrag->size - pfrag->offset);
if (!sk_wmem_schedule(sk, use)) {
rc = -ENOMEM;
goto out;
}
sk_mem_charge(sk, use);
size += use;
orig_offset = pfrag->offset;
pfrag->offset += use;
sge = sg + sg_curr - 1;
if (sg_curr > first_coalesce && sg_page(sg) == pfrag->page &&
sg->offset + sg->length == orig_offset) {
sg->length += use;
} else {
sge = sg + sg_curr;
sg_unmark_end(sge);
sg_set_page(sge, pfrag->page, use, orig_offset);
get_page(pfrag->page);
sg_curr++;
if (sg_curr == MAX_SKB_FRAGS)
sg_curr = 0;
if (sg_curr == sg_start) {
rc = -ENOSPC;
break;
}
}
len -= use;
}
out:
*sg_curr_size = size;
*sg_curr_index = sg_curr;
return rc;
}
EXPORT_SYMBOL(sk_alloc_sg);
static void __lock_sock(struct sock *sk)
__releases(&sk->sk_lock.slock)
__acquires(&sk->sk_lock.slock)
......
......@@ -994,7 +994,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
get_page(page);
skb_fill_page_desc(skb, i, page, offset, copy);
}
skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
if (!(flags & MSG_NO_SHARED_FRAGS))
skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
skb->len += copy;
skb->data_len += copy;
......
......@@ -87,71 +87,16 @@ static void trim_both_sgl(struct sock *sk, int target_size)
target_size);
}
static int alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
int *sg_num_elem, unsigned int *sg_size,
int first_coalesce)
{
struct page_frag *pfrag;
unsigned int size = *sg_size;
int num_elem = *sg_num_elem, use = 0, rc = 0;
struct scatterlist *sge;
unsigned int orig_offset;
len -= size;
pfrag = sk_page_frag(sk);
while (len > 0) {
if (!sk_page_frag_refill(sk, pfrag)) {
rc = -ENOMEM;
goto out;
}
use = min_t(int, len, pfrag->size - pfrag->offset);
if (!sk_wmem_schedule(sk, use)) {
rc = -ENOMEM;
goto out;
}
sk_mem_charge(sk, use);
size += use;
orig_offset = pfrag->offset;
pfrag->offset += use;
sge = sg + num_elem - 1;
if (num_elem > first_coalesce && sg_page(sg) == pfrag->page &&
sg->offset + sg->length == orig_offset) {
sg->length += use;
} else {
sge++;
sg_unmark_end(sge);
sg_set_page(sge, pfrag->page, use, orig_offset);
get_page(pfrag->page);
++num_elem;
if (num_elem == MAX_SKB_FRAGS) {
rc = -ENOSPC;
break;
}
}
len -= use;
}
goto out;
out:
*sg_size = size;
*sg_num_elem = num_elem;
return rc;
}
static int alloc_encrypted_sg(struct sock *sk, int len)
{
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
int rc = 0;
rc = alloc_sg(sk, len, ctx->sg_encrypted_data,
&ctx->sg_encrypted_num_elem, &ctx->sg_encrypted_size, 0);
rc = sk_alloc_sg(sk, len,
ctx->sg_encrypted_data, 0,
&ctx->sg_encrypted_num_elem,
&ctx->sg_encrypted_size, 0);
return rc;
}
......@@ -162,9 +107,9 @@ static int alloc_plaintext_sg(struct sock *sk, int len)
struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
int rc = 0;
rc = alloc_sg(sk, len, ctx->sg_plaintext_data,
&ctx->sg_plaintext_num_elem, &ctx->sg_plaintext_size,
tls_ctx->pending_open_record_frags);
rc = sk_alloc_sg(sk, len, ctx->sg_plaintext_data, 0,
&ctx->sg_plaintext_num_elem, &ctx->sg_plaintext_size,
tls_ctx->pending_open_record_frags);
return rc;
}
......
......@@ -67,6 +67,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
bool is_sockops = strncmp(event, "sockops", 7) == 0;
bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
......@@ -96,6 +97,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_type = BPF_PROG_TYPE_SOCK_OPS;
} else if (is_sk_skb) {
prog_type = BPF_PROG_TYPE_SK_SKB;
} else if (is_sk_msg) {
prog_type = BPF_PROG_TYPE_SK_MSG;
} else {
printf("Unknown event '%s'\n", event);
return -1;
......@@ -113,7 +116,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
return 0;
if (is_socket || is_sockops || is_sk_skb) {
if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
if (is_socket)
event += 6;
else
......@@ -589,7 +592,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
memcmp(shname, "socket", 6) == 0 ||
memcmp(shname, "cgroup/", 7) == 0 ||
memcmp(shname, "sockops", 7) == 0 ||
memcmp(shname, "sk_skb", 6) == 0) {
memcmp(shname, "sk_skb", 6) == 0 ||
memcmp(shname, "sk_msg", 6) == 0) {
ret = load_and_attach(shname, data->d_buf,
data->d_size);
if (ret != 0)
......
......@@ -43,6 +43,42 @@ struct bpf_map_def SEC("maps") sock_map = {
.max_entries = 20,
};
struct bpf_map_def SEC("maps") sock_map_txmsg = {
.type = BPF_MAP_TYPE_SOCKMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 20,
};
struct bpf_map_def SEC("maps") sock_map_redir = {
.type = BPF_MAP_TYPE_SOCKMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 1,
};
struct bpf_map_def SEC("maps") sock_apply_bytes = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 1
};
struct bpf_map_def SEC("maps") sock_cork_bytes = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 1
};
struct bpf_map_def SEC("maps") sock_pull_bytes = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 2
};
SEC("sk_skb1")
int bpf_prog1(struct __sk_buff *skb)
{
......@@ -105,4 +141,165 @@ int bpf_sockmap(struct bpf_sock_ops *skops)
return 0;
}
SEC("sk_msg1")
int bpf_prog4(struct sk_msg_md *msg)
{
int *bytes, zero = 0, one = 1;
int *start, *end;
bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
if (bytes)
bpf_msg_apply_bytes(msg, *bytes);
bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
if (bytes)
bpf_msg_cork_bytes(msg, *bytes);
start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
if (start && end)
bpf_msg_pull_data(msg, *start, *end, 0);
return SK_PASS;
}
SEC("sk_msg2")
int bpf_prog5(struct sk_msg_md *msg)
{
int err1 = -1, err2 = -1, zero = 0, one = 1;
int *bytes, *start, *end, len1, len2;
bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
if (bytes)
err1 = bpf_msg_apply_bytes(msg, *bytes);
bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
if (bytes)
err2 = bpf_msg_cork_bytes(msg, *bytes);
len1 = (__u64)msg->data_end - (__u64)msg->data;
start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
if (start && end) {
int err;
bpf_printk("sk_msg2: pull(%i:%i)\n",
start ? *start : 0, end ? *end : 0);
err = bpf_msg_pull_data(msg, *start, *end, 0);
if (err)
bpf_printk("sk_msg2: pull_data err %i\n",
err);
len2 = (__u64)msg->data_end - (__u64)msg->data;
bpf_printk("sk_msg2: length update %i->%i\n",
len1, len2);
}
bpf_printk("sk_msg2: data length %i err1 %i err2 %i\n",
len1, err1, err2);
return SK_PASS;
}
SEC("sk_msg3")
int bpf_prog6(struct sk_msg_md *msg)
{
int *bytes, zero = 0, one = 1;
int *start, *end;
bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
if (bytes)
bpf_msg_apply_bytes(msg, *bytes);
bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
if (bytes)
bpf_msg_cork_bytes(msg, *bytes);
start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
if (start && end)
bpf_msg_pull_data(msg, *start, *end, 0);
return bpf_msg_redirect_map(msg, &sock_map_redir, zero, 0);
}
SEC("sk_msg4")
int bpf_prog7(struct sk_msg_md *msg)
{
int err1 = 0, err2 = 0, zero = 0, one = 1;
int *bytes, *start, *end, len1, len2;
bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
if (bytes)
err1 = bpf_msg_apply_bytes(msg, *bytes);
bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
if (bytes)
err2 = bpf_msg_cork_bytes(msg, *bytes);
len1 = (__u64)msg->data_end - (__u64)msg->data;
start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
if (start && end) {
int err;
bpf_printk("sk_msg2: pull(%i:%i)\n",
start ? *start : 0, end ? *end : 0);
err = bpf_msg_pull_data(msg, *start, *end, 0);
if (err)
bpf_printk("sk_msg2: pull_data err %i\n",
err);
len2 = (__u64)msg->data_end - (__u64)msg->data;
bpf_printk("sk_msg2: length update %i->%i\n",
len1, len2);
}
bpf_printk("sk_msg3: redirect(%iB) err1=%i err2=%i\n",
len1, err1, err2);
return bpf_msg_redirect_map(msg, &sock_map_redir, zero, 0);
}
SEC("sk_msg5")
int bpf_prog8(struct sk_msg_md *msg)
{
void *data_end = (void *)(long) msg->data_end;
void *data = (void *)(long) msg->data;
int ret = 0, *bytes, zero = 0;
bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
if (bytes) {
ret = bpf_msg_apply_bytes(msg, *bytes);
if (ret)
return SK_DROP;
} else {
return SK_DROP;
}
return SK_PASS;
}
SEC("sk_msg6")
int bpf_prog9(struct sk_msg_md *msg)
{
void *data_end = (void *)(long) msg->data_end;
void *data = (void *)(long) msg->data;
int ret = 0, *bytes, zero = 0;
bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
if (bytes) {
if (((__u64)data_end - (__u64)data) >= *bytes)
return SK_PASS;
ret = bpf_msg_cork_bytes(msg, *bytes);
if (ret)
return SK_DROP;
}
return SK_PASS;
}
SEC("sk_msg7")
int bpf_prog10(struct sk_msg_md *msg)
{
int *bytes, zero = 0, one = 1;
int *start, *end;
bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
if (bytes)
bpf_msg_apply_bytes(msg, *bytes);
bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
if (bytes)
bpf_msg_cork_bytes(msg, *bytes);
start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
if (start && end)
bpf_msg_pull_data(msg, *start, *end, 0);
return SK_DROP;
}
char _license[] SEC("license") = "GPL";
#Test a bunch of positive cases to verify basic functionality
for prog in "--txmsg" "--txmsg_redir" "--txmsg_drop"; do
for t in "sendmsg" "sendpage"; do
for r in 1 10 100; do
for i in 1 10 100; do
for l in 1 10 100; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
done
done
done
done
#Test max iov
t="sendmsg"
r=1
i=1024
l=1
prog="--txmsg"
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
prog="--txmsg_redir"
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
# Test max iov with 1k send
t="sendmsg"
r=1
i=1024
l=1024
prog="--txmsg"
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
prog="--txmsg_redir"
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
# Test apply with 1B
r=1
i=1024
l=1024
prog="--txmsg_apply 1"
for t in "sendmsg" "sendpage"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply with larger value than send
r=1
i=8
l=1024
prog="--txmsg_apply 2048"
for t in "sendmsg" "sendpage"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply with apply that never reaches limit
r=1024
i=1
l=1
prog="--txmsg_apply 2048"
for t in "sendmsg" "sendpage"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply and redirect with 1B
r=1
i=1024
l=1024
prog="--txmsg_redir --txmsg_apply 1"
for t in "sendmsg" "sendpage"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply and redirect with larger value than send
r=1
i=8
l=1024
prog="--txmsg_redir --txmsg_apply 2048"
for t in "sendmsg" "sendpage"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply and redirect with apply that never reaches limit
r=1024
i=1
l=1
prog="--txmsg_apply 2048"
for t in "sendmsg" "sendpage"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test cork with 1B not really useful but test it anyways
r=1
i=1024
l=1024
prog="--txmsg_cork 1"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test cork with a more reasonable 100B
r=1
i=1000
l=1000
prog="--txmsg_cork 100"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test cork with larger value than send
r=1
i=8
l=1024
prog="--txmsg_cork 2048"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test cork with cork that never reaches limit
r=1024
i=1
l=1
prog="--txmsg_cork 2048"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
r=1
i=1024
l=1024
prog="--txmsg_redir --txmsg_cork 1"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test cork with a more reasonable 100B
r=1
i=1000
l=1000
prog="--txmsg_redir --txmsg_cork 100"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test cork with larger value than send
r=1
i=8
l=1024
prog="--txmsg_redir --txmsg_cork 2048"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test cork with cork that never reaches limit
r=1024
i=1
l=1
prog="--txmsg_cork 2048"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# mix and match cork and apply not really useful but valid programs
# Test apply < cork
r=100
i=1
l=5
prog="--txmsg_apply 10 --txmsg_cork 100"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Try again with larger sizes so we hit overflow case
r=100
i=1000
l=2048
prog="--txmsg_apply 4096 --txmsg_cork 8096"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply > cork
r=100
i=1
l=5
prog="--txmsg_apply 100 --txmsg_cork 10"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Again with larger sizes so we hit overflow cases
r=100
i=1000
l=2048
prog="--txmsg_apply 8096 --txmsg_cork 4096"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply = cork
r=100
i=1
l=5
prog="--txmsg_apply 10 --txmsg_cork 10"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
r=100
i=1000
l=2048
prog="--txmsg_apply 4096 --txmsg_cork 4096"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply < cork
r=100
i=1
l=5
prog="--txmsg_redir --txmsg_apply 10 --txmsg_cork 100"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Try again with larger sizes so we hit overflow case
r=100
i=1000
l=2048
prog="--txmsg_redir --txmsg_apply 4096 --txmsg_cork 8096"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply > cork
r=100
i=1
l=5
prog="--txmsg_redir --txmsg_apply 100 --txmsg_cork 10"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Again with larger sizes so we hit overflow cases
r=100
i=1000
l=2048
prog="--txmsg_redir --txmsg_apply 8096 --txmsg_cork 4096"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Test apply = cork
r=100
i=1
l=5
prog="--txmsg_redir --txmsg_apply 10 --txmsg_cork 10"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
r=100
i=1000
l=2048
prog="--txmsg_redir --txmsg_apply 4096 --txmsg_cork 4096"
for t in "sendpage" "sendmsg"; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog"
echo $TEST
$TEST
sleep 2
done
# Tests for bpf_msg_pull_data()
for i in `seq 99 100 1600`; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t sendpage -r 16 -i 1 -l 100 \
--txmsg --txmsg_start 0 --txmsg_end $i --txmsg_cork 1600"
echo $TEST
$TEST
sleep 2
done
for i in `seq 199 100 1600`; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t sendpage -r 16 -i 1 -l 100 \
--txmsg --txmsg_start 100 --txmsg_end $i --txmsg_cork 1600"
echo $TEST
$TEST
sleep 2
done
TEST="./sockmap --cgroup /mnt/cgroup2/ -t sendpage -r 16 -i 1 -l 100 \
--txmsg --txmsg_start 1500 --txmsg_end 1600 --txmsg_cork 1600"
echo $TEST
$TEST
sleep 2
TEST="./sockmap --cgroup /mnt/cgroup2/ -t sendpage -r 16 -i 1 -l 100 \
--txmsg --txmsg_start 1111 --txmsg_end 1112 --txmsg_cork 1600"
echo $TEST
$TEST
sleep 2
TEST="./sockmap --cgroup /mnt/cgroup2/ -t sendpage -r 16 -i 1 -l 100 \
--txmsg --txmsg_start 1111 --txmsg_end 0 --txmsg_cork 1600"
echo $TEST
$TEST
sleep 2
TEST="./sockmap --cgroup /mnt/cgroup2/ -t sendpage -r 16 -i 1 -l 100 \
--txmsg --txmsg_start 0 --txmsg_end 1601 --txmsg_cork 1600"
echo $TEST
$TEST
sleep 2
TEST="./sockmap --cgroup /mnt/cgroup2/ -t sendpage -r 16 -i 1 -l 100 \
--txmsg --txmsg_start 0 --txmsg_end 1601 --txmsg_cork 1602"
echo $TEST
$TEST
sleep 2
# Run through gamut again with start and end
for prog in "--txmsg" "--txmsg_redir" "--txmsg_drop"; do
for t in "sendmsg" "sendpage"; do
for r in 1 10 100; do
for i in 1 10 100; do
for l in 1 10 100; do
TEST="./sockmap --cgroup /mnt/cgroup2/ -t $t -r $r -i $i -l $l $prog --txmsg_start 1 --txmsg_end 2"
echo $TEST
$TEST
sleep 2
done
done
done
done
done
# Some specific tests to cover specific code paths
./sockmap --cgroup /mnt/cgroup2/ -t sendpage \
-r 5 -i 1 -l 1 --txmsg_redir --txmsg_cork 5 --txmsg_apply 3
./sockmap --cgroup /mnt/cgroup2/ -t sendmsg \
-r 5 -i 1 -l 1 --txmsg_redir --txmsg_cork 5 --txmsg_apply 3
./sockmap --cgroup /mnt/cgroup2/ -t sendpage \
-r 5 -i 1 -l 1 --txmsg_redir --txmsg_cork 5 --txmsg_apply 5
./sockmap --cgroup /mnt/cgroup2/ -t sendmsg \
-r 5 -i 1 -l 1 --txmsg_redir --txmsg_cork 5 --txmsg_apply 5
......@@ -29,6 +29,7 @@
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/types.h>
#include <sys/sendfile.h>
#include <linux/netlink.h>
#include <linux/socket.h>
......@@ -54,6 +55,16 @@ void running_handler(int a);
/* global sockets */
int s1, s2, c1, c2, p1, p2;
int txmsg_pass;
int txmsg_noisy;
int txmsg_redir;
int txmsg_redir_noisy;
int txmsg_drop;
int txmsg_apply;
int txmsg_cork;
int txmsg_start;
int txmsg_end;
static const struct option long_options[] = {
{"help", no_argument, NULL, 'h' },
{"cgroup", required_argument, NULL, 'c' },
......@@ -62,6 +73,16 @@ static const struct option long_options[] = {
{"iov_count", required_argument, NULL, 'i' },
{"length", required_argument, NULL, 'l' },
{"test", required_argument, NULL, 't' },
{"data_test", no_argument, NULL, 'd' },
{"txmsg", no_argument, &txmsg_pass, 1 },
{"txmsg_noisy", no_argument, &txmsg_noisy, 1 },
{"txmsg_redir", no_argument, &txmsg_redir, 1 },
{"txmsg_redir_noisy", no_argument, &txmsg_redir_noisy, 1},
{"txmsg_drop", no_argument, &txmsg_drop, 1 },
{"txmsg_apply", required_argument, NULL, 'a'},
{"txmsg_cork", required_argument, NULL, 'k'},
{"txmsg_start", required_argument, NULL, 's'},
{"txmsg_end", required_argument, NULL, 'e'},
{0, 0, NULL, 0 }
};
......@@ -195,19 +216,71 @@ struct msg_stats {
struct timespec end;
};
struct sockmap_options {
int verbose;
bool base;
bool sendpage;
bool data_test;
bool drop_expected;
};
static int msg_loop_sendpage(int fd, int iov_length, int cnt,
struct msg_stats *s,
struct sockmap_options *opt)
{
bool drop = opt->drop_expected;
unsigned char k = 0;
FILE *file;
int i, fp;
file = fopen(".sendpage_tst.tmp", "w+");
for (i = 0; i < iov_length * cnt; i++, k++)
fwrite(&k, sizeof(char), 1, file);
fflush(file);
fseek(file, 0, SEEK_SET);
fclose(file);
fp = open(".sendpage_tst.tmp", O_RDONLY);
clock_gettime(CLOCK_MONOTONIC, &s->start);
for (i = 0; i < cnt; i++) {
int sent = sendfile(fd, fp, NULL, iov_length);
if (!drop && sent < 0) {
perror("send loop error:");
close(fp);
return sent;
} else if (drop && sent >= 0) {
printf("sendpage loop error expected: %i\n", sent);
close(fp);
return -EIO;
}
if (sent > 0)
s->bytes_sent += sent;
}
clock_gettime(CLOCK_MONOTONIC, &s->end);
close(fp);
return 0;
}
static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
struct msg_stats *s, bool tx)
struct msg_stats *s, bool tx,
struct sockmap_options *opt)
{
struct msghdr msg = {0};
int err, i, flags = MSG_NOSIGNAL;
struct iovec *iov;
unsigned char k;
bool data_test = opt->data_test;
bool drop = opt->drop_expected;
iov = calloc(iov_count, sizeof(struct iovec));
if (!iov)
return errno;
k = 0;
for (i = 0; i < iov_count; i++) {
char *d = calloc(iov_length, sizeof(char));
unsigned char *d = calloc(iov_length, sizeof(char));
if (!d) {
fprintf(stderr, "iov_count %i/%i OOM\n", i, iov_count);
......@@ -215,21 +288,34 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
}
iov[i].iov_base = d;
iov[i].iov_len = iov_length;
if (data_test && tx) {
int j;
for (j = 0; j < iov_length; j++)
d[j] = k++;
}
}
msg.msg_iov = iov;
msg.msg_iovlen = iov_count;
k = 0;
if (tx) {
clock_gettime(CLOCK_MONOTONIC, &s->start);
for (i = 0; i < cnt; i++) {
int sent = sendmsg(fd, &msg, flags);
if (sent < 0) {
if (!drop && sent < 0) {
perror("send loop error:");
goto out_errno;
} else if (drop && sent >= 0) {
printf("send loop error expected: %i\n", sent);
errno = -EIO;
goto out_errno;
}
s->bytes_sent += sent;
if (sent > 0)
s->bytes_sent += sent;
}
clock_gettime(CLOCK_MONOTONIC, &s->end);
} else {
......@@ -272,6 +358,26 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
}
s->bytes_recvd += recv;
if (data_test) {
int j;
for (i = 0; i < msg.msg_iovlen; i++) {
unsigned char *d = iov[i].iov_base;
for (j = 0;
j < iov[i].iov_len && recv; j++) {
if (d[j] != k++) {
errno = -EIO;
fprintf(stderr,
"detected data corruption @iov[%i]:%i %02x != %02x, %02x ?= %02x\n",
i, j, d[j], k - 1, d[j+1], k + 1);
goto out_errno;
}
recv--;
}
}
}
}
clock_gettime(CLOCK_MONOTONIC, &s->end);
}
......@@ -300,7 +406,7 @@ static inline float recvdBps(struct msg_stats s)
}
static int sendmsg_test(int iov_count, int iov_buf, int cnt,
int verbose, bool base)
struct sockmap_options *opt)
{
float sent_Bps = 0, recvd_Bps = 0;
int rx_fd, txpid, rxpid, err = 0;
......@@ -309,14 +415,20 @@ static int sendmsg_test(int iov_count, int iov_buf, int cnt,
errno = 0;
if (base)
if (opt->base)
rx_fd = p1;
else
rx_fd = p2;
rxpid = fork();
if (rxpid == 0) {
err = msg_loop(rx_fd, iov_count, iov_buf, cnt, &s, false);
if (opt->drop_expected)
exit(1);
if (opt->sendpage)
iov_count = 1;
err = msg_loop(rx_fd, iov_count, iov_buf,
cnt, &s, false, opt);
if (err)
fprintf(stderr,
"msg_loop_rx: iov_count %i iov_buf %i cnt %i err %i\n",
......@@ -339,7 +451,12 @@ static int sendmsg_test(int iov_count, int iov_buf, int cnt,
txpid = fork();
if (txpid == 0) {
err = msg_loop(c1, iov_count, iov_buf, cnt, &s, true);
if (opt->sendpage)
err = msg_loop_sendpage(c1, iov_buf, cnt, &s, opt);
else
err = msg_loop(c1, iov_count, iov_buf,
cnt, &s, true, opt);
if (err)
fprintf(stderr,
"msg_loop_tx: iov_count %i iov_buf %i cnt %i err %i\n",
......@@ -364,7 +481,7 @@ static int sendmsg_test(int iov_count, int iov_buf, int cnt,
return err;
}
static int forever_ping_pong(int rate, int verbose)
static int forever_ping_pong(int rate, struct sockmap_options *opt)
{
struct timeval timeout;
char buf[1024] = {0};
......@@ -429,7 +546,7 @@ static int forever_ping_pong(int rate, int verbose)
if (rate)
sleep(rate);
if (verbose) {
if (opt->verbose) {
printf(".");
fflush(stdout);
......@@ -443,20 +560,34 @@ enum {
PING_PONG,
SENDMSG,
BASE,
BASE_SENDPAGE,
SENDPAGE,
};
int main(int argc, char **argv)
{
int iov_count = 1, length = 1024, rate = 1, verbose = 0;
int iov_count = 1, length = 1024, rate = 1, tx_prog_fd;
struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
int opt, longindex, err, cg_fd = 0;
struct sockmap_options options = {0};
int test = PING_PONG;
char filename[256];
while ((opt = getopt_long(argc, argv, "hvc:r:i:l:t:",
while ((opt = getopt_long(argc, argv, ":dhvc:r:i:l:t:",
long_options, &longindex)) != -1) {
switch (opt) {
/* Cgroup configuration */
case 's':
txmsg_start = atoi(optarg);
break;
case 'e':
txmsg_end = atoi(optarg);
break;
case 'a':
txmsg_apply = atoi(optarg);
break;
case 'k':
txmsg_cork = atoi(optarg);
break;
case 'c':
cg_fd = open(optarg, O_DIRECTORY, O_RDONLY);
if (cg_fd < 0) {
......@@ -470,7 +601,7 @@ int main(int argc, char **argv)
rate = atoi(optarg);
break;
case 'v':
verbose = 1;
options.verbose = 1;
break;
case 'i':
iov_count = atoi(optarg);
......@@ -478,6 +609,9 @@ int main(int argc, char **argv)
case 'l':
length = atoi(optarg);
break;
case 'd':
options.data_test = true;
break;
case 't':
if (strcmp(optarg, "ping") == 0) {
test = PING_PONG;
......@@ -485,11 +619,17 @@ int main(int argc, char **argv)
test = SENDMSG;
} else if (strcmp(optarg, "base") == 0) {
test = BASE;
} else if (strcmp(optarg, "base_sendpage") == 0) {
test = BASE_SENDPAGE;
} else if (strcmp(optarg, "sendpage") == 0) {
test = SENDPAGE;
} else {
usage(argv);
return -1;
}
break;
case 0:
break;
case 'h':
default:
usage(argv);
......@@ -515,16 +655,16 @@ int main(int argc, char **argv)
/* catch SIGINT */
signal(SIGINT, running_handler);
/* If base test skip BPF setup */
if (test == BASE)
goto run;
if (load_bpf_file(filename)) {
fprintf(stderr, "load_bpf_file: (%s) %s\n",
filename, strerror(errno));
return 1;
}
/* If base test skip BPF setup */
if (test == BASE || test == BASE_SENDPAGE)
goto run;
/* Attach programs to sockmap */
err = bpf_prog_attach(prog_fd[0], map_fd[0],
BPF_SK_SKB_STREAM_PARSER, 0);
......@@ -557,13 +697,126 @@ int main(int argc, char **argv)
goto out;
}
if (test == PING_PONG)
err = forever_ping_pong(rate, verbose);
else if (test == SENDMSG)
err = sendmsg_test(iov_count, length, rate, verbose, false);
else if (test == BASE)
err = sendmsg_test(iov_count, length, rate, verbose, true);
/* Attach txmsg program to sockmap */
if (txmsg_pass)
tx_prog_fd = prog_fd[3];
else if (txmsg_noisy)
tx_prog_fd = prog_fd[4];
else if (txmsg_redir)
tx_prog_fd = prog_fd[5];
else if (txmsg_redir_noisy)
tx_prog_fd = prog_fd[6];
else if (txmsg_drop)
tx_prog_fd = prog_fd[9];
/* apply and cork must be last */
else if (txmsg_apply)
tx_prog_fd = prog_fd[7];
else if (txmsg_cork)
tx_prog_fd = prog_fd[8];
else
tx_prog_fd = 0;
if (tx_prog_fd) {
int redir_fd, i = 0;
err = bpf_prog_attach(tx_prog_fd,
map_fd[1], BPF_SK_MSG_VERDICT, 0);
if (err) {
fprintf(stderr,
"ERROR: bpf_prog_attach (txmsg): %d (%s)\n",
err, strerror(errno));
return err;
}
err = bpf_map_update_elem(map_fd[1], &i, &c1, BPF_ANY);
if (err) {
fprintf(stderr,
"ERROR: bpf_map_update_elem (txmsg): %d (%s\n",
err, strerror(errno));
return err;
}
if (txmsg_redir || txmsg_redir_noisy)
redir_fd = c2;
else
redir_fd = c1;
err = bpf_map_update_elem(map_fd[2], &i, &redir_fd, BPF_ANY);
if (err) {
fprintf(stderr,
"ERROR: bpf_map_update_elem (txmsg): %d (%s\n",
err, strerror(errno));
return err;
}
if (txmsg_apply) {
err = bpf_map_update_elem(map_fd[3],
&i, &txmsg_apply, BPF_ANY);
if (err) {
fprintf(stderr,
"ERROR: bpf_map_update_elem (apply_bytes): %d (%s\n",
err, strerror(errno));
return err;
}
}
if (txmsg_cork) {
err = bpf_map_update_elem(map_fd[4],
&i, &txmsg_cork, BPF_ANY);
if (err) {
fprintf(stderr,
"ERROR: bpf_map_update_elem (cork_bytes): %d (%s\n",
err, strerror(errno));
return err;
}
}
if (txmsg_start) {
err = bpf_map_update_elem(map_fd[5],
&i, &txmsg_start, BPF_ANY);
if (err) {
fprintf(stderr,
"ERROR: bpf_map_update_elem (txmsg_start): %d (%s)\n",
err, strerror(errno));
return err;
}
}
if (txmsg_end) {
i = 1;
err = bpf_map_update_elem(map_fd[5],
&i, &txmsg_end, BPF_ANY);
if (err) {
fprintf(stderr,
"ERROR: bpf_map_update_elem (txmsg_end): %d (%s)\n",
err, strerror(errno));
return err;
}
}
}
if (txmsg_drop)
options.drop_expected = true;
if (test == PING_PONG)
err = forever_ping_pong(rate, &options);
else if (test == SENDMSG) {
options.base = false;
options.sendpage = false;
err = sendmsg_test(iov_count, length, rate, &options);
} else if (test == SENDPAGE) {
options.base = false;
options.sendpage = true;
err = sendmsg_test(iov_count, length, rate, &options);
} else if (test == BASE) {
options.base = true;
options.sendpage = false;
err = sendmsg_test(iov_count, length, rate, &options);
} else if (test == BASE_SENDPAGE) {
options.base = true;
options.sendpage = true;
err = sendmsg_test(iov_count, length, rate, &options);
} else
fprintf(stderr, "unknown test\n");
out:
bpf_prog_detach2(prog_fd[2], cg_fd, BPF_CGROUP_SOCK_OPS);
......
......@@ -133,6 +133,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
BPF_PROG_TYPE_SK_MSG,
};
enum bpf_attach_type {
......@@ -143,6 +144,7 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_PARSER,
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
BPF_SK_MSG_VERDICT,
__MAX_BPF_ATTACH_TYPE
};
......@@ -718,6 +720,15 @@ union bpf_attr {
* int bpf_override_return(pt_regs, rc)
* @pt_regs: pointer to struct pt_regs
* @rc: the return value to set
*
* int bpf_msg_redirect_map(map, key, flags)
* Redirect msg to a sock in map using key as a lookup key for the
* sock in map.
* @map: pointer to sockmap
* @key: key to lookup sock in map
* @flags: reserved for future use
* Return: SK_PASS
*
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
......@@ -779,7 +790,11 @@ union bpf_attr {
FN(perf_prog_read_value), \
FN(getsockopt), \
FN(override_return), \
FN(sock_ops_cb_flags_set),
FN(sock_ops_cb_flags_set), \
FN(msg_redirect_map), \
FN(msg_apply_bytes), \
FN(msg_cork_bytes), \
FN(msg_pull_data),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
......@@ -941,6 +956,14 @@ enum sk_action {
SK_PASS,
};
/* user accessible metadata for SK_MSG packet hook, new fields must
* be added to the end of this structure
*/
struct sk_msg_md {
void *data;
void *data_end;
};
#define BPF_TAG_SIZE 8
struct bpf_prog_info {
......
......@@ -1857,6 +1857,7 @@ static const struct {
BPF_PROG_SEC("lwt_xmit", BPF_PROG_TYPE_LWT_XMIT),
BPF_PROG_SEC("sockops", BPF_PROG_TYPE_SOCK_OPS),
BPF_PROG_SEC("sk_skb", BPF_PROG_TYPE_SK_SKB),
BPF_PROG_SEC("sk_msg", BPF_PROG_TYPE_SK_MSG),
};
#undef BPF_PROG_SEC
......
......@@ -29,7 +29,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o
# Order correspond to 'make run_tests' order
TEST_PROGS := test_kmod.sh \
......
......@@ -86,6 +86,14 @@ static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
(void *) BPF_FUNC_perf_prog_read_value;
static int (*bpf_override_return)(void *ctx, unsigned long rc) =
(void *) BPF_FUNC_override_return;
static int (*bpf_msg_redirect_map)(void *ctx, void *map, int key, int flags) =
(void *) BPF_FUNC_msg_redirect_map;
static int (*bpf_msg_apply_bytes)(void *ctx, int len) =
(void *) BPF_FUNC_msg_apply_bytes;
static int (*bpf_msg_cork_bytes)(void *ctx, int len) =
(void *) BPF_FUNC_msg_cork_bytes;
static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
......@@ -123,6 +131,8 @@ static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_skb_under_cgroup;
static int (*bpf_skb_change_head)(void *, int len, int flags) =
(void *) BPF_FUNC_skb_change_head;
static int (*bpf_skb_pull_data)(void *, int len) =
(void *) BPF_FUNC_skb_pull_data;
/* Scan the ARCH passed in from ARCH env variable (see Makefile) */
#if defined(__TARGET_ARCH_x86)
......
......@@ -20,14 +20,25 @@ int bpf_prog1(struct __sk_buff *skb)
__u32 lport = skb->local_port;
__u32 rport = skb->remote_port;
__u8 *d = data;
__u32 len = (__u32) data_end - (__u32) data;
int err;
if (data + 10 > data_end)
return skb->len;
if (data + 10 > data_end) {
err = bpf_skb_pull_data(skb, 10);
if (err)
return SK_DROP;
data_end = (void *)(long)skb->data_end;
data = (void *)(long)skb->data;
if (data + 10 > data_end)
return SK_DROP;
}
/* This write/read is a bit pointless but tests the verifier and
* strparser handler for read/write pkt data and access into sk
* fields.
*/
d = data;
d[7] = 1;
return skb->len;
}
......
#include <linux/bpf.h>
#include "bpf_helpers.h"
#include "bpf_util.h"
#include "bpf_endian.h"
int _version SEC("version") = 1;
#define bpf_printk(fmt, ...) \
({ \
char ____fmt[] = fmt; \
bpf_trace_printk(____fmt, sizeof(____fmt), \
##__VA_ARGS__); \
})
SEC("sk_msg1")
int bpf_prog1(struct sk_msg_md *msg)
{
void *data_end = (void *)(long) msg->data_end;
void *data = (void *)(long) msg->data;
char *d;
if (data + 8 > data_end)
return SK_DROP;
bpf_printk("data length %i\n", (__u64)msg->data_end - (__u64)msg->data);
d = (char *)data;
bpf_printk("hello sendmsg hook %i %i\n", d[0], d[1]);
return SK_PASS;
}
char _license[] SEC("license") = "GPL";
......@@ -26,6 +26,13 @@ struct bpf_map_def SEC("maps") sock_map_tx = {
.max_entries = 20,
};
struct bpf_map_def SEC("maps") sock_map_msg = {
.type = BPF_MAP_TYPE_SOCKMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 20,
};
struct bpf_map_def SEC("maps") sock_map_break = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
......
......@@ -464,15 +464,17 @@ static void test_devmap(int task, void *data)
#include <linux/err.h>
#define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o"
#define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o"
#define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o"
static void test_sockmap(int tasks, void *data)
{
int one = 1, map_fd_rx = 0, map_fd_tx = 0, map_fd_break, s, sc, rc;
struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_break;
struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break;
int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break;
int ports[] = {50200, 50201, 50202, 50204};
int err, i, fd, udp, sfd[6] = {0xdeadbeef};
u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0};
int parse_prog, verdict_prog;
int parse_prog, verdict_prog, msg_prog;
struct sockaddr_in addr;
int one = 1, s, sc, rc;
struct bpf_object *obj;
struct timeval to;
__u32 key, value;
......@@ -584,6 +586,12 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
err = bpf_prog_attach(-1, fd, BPF_SK_MSG_VERDICT, 0);
if (!err) {
printf("Failed invalid msg verdict prog attach\n");
goto out_sockmap;
}
err = bpf_prog_attach(-1, fd, __MAX_BPF_ATTACH_TYPE, 0);
if (!err) {
printf("Failed unknown prog attach\n");
......@@ -602,6 +610,12 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
err = bpf_prog_detach(fd, BPF_SK_MSG_VERDICT);
if (err) {
printf("Failed empty msg verdict prog detach\n");
goto out_sockmap;
}
err = bpf_prog_detach(fd, __MAX_BPF_ATTACH_TYPE);
if (!err) {
printf("Detach invalid prog successful\n");
......@@ -616,6 +630,13 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
err = bpf_prog_load(SOCKMAP_TCP_MSG_PROG,
BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog);
if (err) {
printf("Failed to load SK_SKB msg prog\n");
goto out_sockmap;
}
err = bpf_prog_load(SOCKMAP_VERDICT_PROG,
BPF_PROG_TYPE_SK_SKB, &obj, &verdict_prog);
if (err) {
......@@ -631,7 +652,7 @@ static void test_sockmap(int tasks, void *data)
map_fd_rx = bpf_map__fd(bpf_map_rx);
if (map_fd_rx < 0) {
printf("Failed to get map fd\n");
printf("Failed to get map rx fd\n");
goto out_sockmap;
}
......@@ -647,6 +668,18 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
bpf_map_msg = bpf_object__find_map_by_name(obj, "sock_map_msg");
if (IS_ERR(bpf_map_msg)) {
printf("Failed to load map msg from msg_verdict prog\n");
goto out_sockmap;
}
map_fd_msg = bpf_map__fd(bpf_map_msg);
if (map_fd_msg < 0) {
printf("Failed to get map msg fd\n");
goto out_sockmap;
}
bpf_map_break = bpf_object__find_map_by_name(obj, "sock_map_break");
if (IS_ERR(bpf_map_break)) {
printf("Failed to load map tx from verdict prog\n");
......@@ -680,6 +713,12 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
err = bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
if (err) {
printf("Failed msg verdict bpf prog attach\n");
goto out_sockmap;
}
err = bpf_prog_attach(verdict_prog, map_fd_rx,
__MAX_BPF_ATTACH_TYPE, 0);
if (!err) {
......@@ -719,6 +758,14 @@ static void test_sockmap(int tasks, void *data)
}
}
/* Put sfd[2] (sending fd below) into msg map to test sendmsg bpf */
i = 0;
err = bpf_map_update_elem(map_fd_msg, &i, &sfd[2], BPF_ANY);
if (err) {
printf("Failed map_fd_msg update sockmap %i\n", err);
goto out_sockmap;
}
/* Test map send/recv */
for (i = 0; i < 2; i++) {
buf[0] = i;
......
......@@ -1596,6 +1596,60 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SK_SKB,
},
{
"direct packet read for SK_MSG",
.insns = {
BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
offsetof(struct sk_msg_md, data)),
BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
offsetof(struct sk_msg_md, data_end)),
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SK_MSG,
},
{
"direct packet write for SK_MSG",
.insns = {
BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
offsetof(struct sk_msg_md, data)),
BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
offsetof(struct sk_msg_md, data_end)),
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
BPF_STX_MEM(BPF_B, BPF_REG_2, BPF_REG_2, 0),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SK_MSG,
},
{
"overlapping checks for direct packet access SK_MSG",
.insns = {
BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
offsetof(struct sk_msg_md, data)),
BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
offsetof(struct sk_msg_md, data_end)),
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 4),
BPF_MOV64_REG(BPF_REG_1, BPF_REG_2),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 6),
BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_3, 1),
BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_2, 6),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SK_MSG,
},
{
"check skb->mark is not writeable by sockets",
.insns = {
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment