Commits · 28f7936cdf7d445570b214123f45866d3f6aa836 · nexedi / linux

09 Dec, 2014 34 commits

Alexander Duyck authored Dec 03, 2014

Replace the standard layout for padding an ethernet frame with the
eth_skb_pad call.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28f7936c

emulex: Use skb_put_padto instead of skb_padto() and skb->len assignment · 74b6939d
Alexander Duyck authored Dec 03, 2014
```
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
74b6939d

ethernet/intel: Use eth_skb_pad and skb_put_padto helpers · a94d9e22

Alexander Duyck authored Dec 03, 2014

Update the Intel Ethernet drivers to use eth_skb_pad() and skb_put_padto
instead of doing their own implementations of the function.

Also this cleans up two other spots where skb_pad was called but the length
and tail pointers were being manipulated directly instead of just having
the padding length added via __skb_put.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a94d9e22

net: Add functions for handling padding frame and adding to length · 9c0c1124

Alexander Duyck authored Dec 03, 2014

This patch adds two new helper functions skb_put_padto and eth_skb_pad.
These functions deviate from the standard skb_pad or skb_padto in that they
will also update the length and tail pointers so that they reflect the
padding added to the frame.

The eth_skb_pad helper is meant to be used with Ethernet devices to update
either Rx or Tx frames so that they report the correct size. The
skb_put_padto helper is meant to be used primarily in the transmit path for
network devices that need frames to be padded up to some minimum size and
don't wish to simply update the length somewhere external to the frame.

The motivation behind this is that there are a number of implementations
throughout the network device drivers that are all doing the same thing,
but each a little bit differently and as a result several implementations
contain bugs such as updating the length without updating the tail offset
and other similar issues.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9c0c1124

Merge branch 'mlx5-next' · 177211b9

David S. Miller authored Dec 08, 2014

Eli Cohen says:

====================
mlx5 driver updates

The following series contains some fixes to mlx5 as well as update to the list
of supported devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

177211b9

mlx5: Fix error flow in add_keys · d14e7110

Eli Cohen authored Dec 02, 2014

If mlx5_core_create_mkey fails, decrease the pending counter to undo the
previous increment.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d14e7110

mlx5: Fix sparse warnings · 6a4f139a

Eli Cohen authored Dec 02, 2014

1. Add required __acquire/__release statements to balance spinlock usage.
2. Change the index parameter of begin_wqe() to be unsigned to match supplied
argument type.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a4f139a

net/mlx5_core: Add more supported devices · 28c167fa

Eli Cohen authored Dec 02, 2014

Add ConnectX-4LX to the list of supported devices as well as their virtual
functions.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28c167fa

net/mlx5_core: Clear outbox of dealloc uar · 6b60d5e2

Majd Dibbiny authored Dec 02, 2014

The outbox should be cleared before executing the command.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6b60d5e2

net/mlx5_core: Print resource number on QP/SRQ async events · ab62924e

Eli Cohen authored Dec 02, 2014

Useful for debugging purposes.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab62924e

net/mlx5_core: Remove unused dev cap enum fields · 0c7aac85

Eli Cohen authored Dec 02, 2014

These enumerations are not used so remove them.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0c7aac85

net/mlx5_core: Fix command queue size enforcement · 2d446d18

Eli Cohen authored Dec 02, 2014

Command queue descriptor page size is 4KB and not the page size used by the
kernel.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2d446d18

net/mlx5_core: Fix min vectors value in mlx5_enable_msix · 3a9e161a

Eli Cohen authored Dec 02, 2014

mlx5 requires at least one interrupt vector for completions so fix the minvec
argument to pci_enable_msix_range() accordingly.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3a9e161a

net/mlx5_core: Request the mlx5 IB module on driver load · f66f049f

Eli Cohen authored Dec 02, 2014

Call request module on mlx5_ib so it will be available for applications
requiring it, such as installers that require boot over IB.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f66f049f

Merge branch 'r8169-next' · 4945c1d6

David S. Miller authored Dec 08, 2014

Chunhao Lin says:

====================
r8169:change hardware setting

This patch series contains two hardware setting modification to prevent
hardware become abnormal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4945c1d6

r8169:disable rtl8168ep cmac engine · 003609da

Chun-Hao Lin authored Dec 02, 2014

Cmac engine is the bridge between driver and dash firmware.
Other os may not disable cmac when leave. And r8169 did not allocate any
resources for cmac engine. Disable it to prevent abnormal system behavior.
Signed-off-by: Chunhao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

003609da

r8169:prevent enable hardware tx/rx too early · d6e57291

Chun-Hao Lin authored Dec 02, 2014

For RTL8168G/GU/H/EP and RTL8411B remove enable tx/rx from its own hw_start
function. This will prevent enable tx/rx before complete hardware tx/rx
setting.

Tx/Rx will be enabled in the end of function rtl_hw_start_8168.
Signed-off-by: Chunhao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6e57291

Merge branch 'tipc-next' · 66813d4d

David S. Miller authored Dec 08, 2014

Ying Xue says:

====================
tipc: convert name table read-write lock to RCU

Now TIPC name table is statically allocated and is protected with a
Read-Write lock. To enhance the performance of TIPC name table lookup,
we are going to involve RCU lock to protect the name table. As a
consequence, it becomes lockless to concurrently look up name table on
read side. However, before the conversion can be successfully made,
the following two things must be first done:

- change allocation way of name table from static to dynamic
- fix several incorrect locking policy issues
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

66813d4d

tipc: convert name table read-write lock to RCU · 97ede29e

Ying Xue authored Dec 02, 2014

Convert tipc name table read-write lock to RCU. After this change,
a new spin lock is used to protect name table on write side while
RCU is applied on read side.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

97ede29e

tipc: remove unnecessary INIT_LIST_HEAD · 834caafa

Ying Xue authored Dec 02, 2014

When a list_head variable is seen as a new entry to be added to a
list head, it's unnecessary to be initialized with INIT_LIST_HEAD().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

834caafa

tipc: simplify relationship between name table lock and node lock · 5492390a

Ying Xue authored Dec 02, 2014

When tipc name sequence is published, name table lock is released
before name sequence buffer is delivered to remote nodes through its
underlying unicast links. However, when name sequence is withdrawn,
the name table lock is held until the transmission of the removal
message of name sequence is finished. During the process, node lock
is nested in name table lock. To prevent node lock from being nested
in name table lock, while withdrawing name, we should adopt the same
locking policy of publishing name sequence: name table lock should
be released before message is sent.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5492390a

tipc: any name table member must be protected under name table lock · 3493d25c

Ying Xue authored Dec 02, 2014

As tipc_nametbl_lock is used to protect name_table structure, the lock
must be held while all members of name_table structure are accessed.
However, the lock is not obtained while a member of name_table
structure - local_publ_count is read in tipc_nametbl_publish(), as
a consequence, an inconsistent value of local_publ_count might be got.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3493d25c

tipc: ensure all name sequences are properly protected with its lock · fb9962f3

Ying Xue authored Dec 02, 2014

TIPC internally created a name table which is used to store name
sequences. Now there is a read-write lock - tipc_nametbl_lock to
protect the table, and each name sequence saved in the table is
protected with its private lock. When a name sequence is inserted
or removed to or from the table, its members might need to change.
Therefore, in normal case, the two locks must be held while TIPC
operates the table. However, there are still several places where
we only hold tipc_nametbl_lock without proprerly obtaining name
sequence lock, which might cause the corruption of name sequence.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fb9962f3

tipc: ensure all name sequences are released when name table is stopped · 38622f41

Ying Xue authored Dec 02, 2014

As TIPC subscriber server is terminated before name table, no user
depends on subscription list of name sequence when name table is
stopped. Therefore, all name sequences stored in name table should
be released whatever their subscriptions lists are empty or not,
otherwise, memory leak might happen.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

38622f41

tipc: make name table allocated dynamically · 993bfe5d

Ying Xue authored Dec 02, 2014

Name table locking policy is going to be adjusted from read-write
lock protection to RCU lock protection in the future commits. But
its essential precondition is to convert the allocation way of name
table from static to dynamic mode.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

993bfe5d

tipc: remove size variable from publ_list struct · 1b61e70a

Ying Xue authored Dec 02, 2014

The size variable is introduced in publ_list struct to help us exactly
calculate SKB buffer sizes needed by publications when all publications
in name table are delivered in bulk in named_distribute(). But if
publication SKB buffer size is assumed to MTU, the size variable in
publ_list struct can be completely eliminated at the cost of wasting
a bit memory space for last SKB.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Tero Aho <tero.aho@coriant.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1b61e70a

udp: Neaten and reduce size of compute_score functions · 60c04aec

Joe Perches authored Dec 01, 2014

The compute_score functions are a bit difficult to read.

Neaten them a bit to reduce object sizes and make them a
bit more intelligible.

Return early to avoid indentation and avoid unnecessary
initializations.

(allyesconfig, but w/ -O2 and no profiling)

$ size net/ipv[46]/udp.o.*
   text    data     bss     dec     hex filename
  28680    1184      25   29889    74c1 net/ipv4/udp.o.new
  28756    1184      25   29965    750d net/ipv4/udp.o.old
  17600    1010       2   18612    48b4 net/ipv6/udp.o.new
  17632    1010       2   18644    48d4 net/ipv6/udp.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

60c04aec

net: bcmgenet: enable driver to work without a device tree · b0ba512e

Petri Gynther authored Dec 01, 2014

Modify bcmgenet driver so that it can be used on Broadcom 7xxx
MIPS-based STB platforms without a device tree.
Signed-off-by: Petri Gynther <pgynther@google.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0ba512e

hyperv: Add support for vNIC hot removal · c3582a2c

Haiyang Zhang authored Dec 01, 2014

This patch adds proper handling of the vNIC hot removal event, which includes
a rescind-channel-offer message from the host side that triggers vNIC close and
removal. In this case, the notices to the host during close and removal is not
necessary because the channel is rescinded. This patch blocks these unnecessary
messages, and lets vNIC removal process complete normally.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3582a2c

test: bpf: expand DIV_KX to DIV_MOD_KX · 6867b17b

Denis Kirjanov authored Dec 01, 2014

Expand DIV_KX to use BPF_MOD operation in the
DIV_KX bpf 'classic' test.

CC: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6867b17b

Merge branch 'tstamp-next' · aae68bc6

David S. Miller authored Dec 08, 2014

Willem de Bruijn says:

====================
timestamping updates

The main goal for this patchset is to allow correlating timestamps
with the egress interface. Also introduce a warning, as discussed
previously, and update the tests to verify the new feature.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

aae68bc6

net-timestamp: expand documentation and test · cbd3aad5

Willem de Bruijn authored Nov 30, 2014

Documentation:
  expand explanation of timestamp counter

Test:
  new: flag -I requests and prints PKTINFO
  new: flag -x prints payload (possibly truncated)
  fix: remove pretty print that breaks common flag '-l 1'
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cbd3aad5

net-timestamp: allow reading recv cmsg on errqueue with origin tstamp · 829ae9d6

Willem de Bruijn authored Nov 30, 2014

Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.

on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.

on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.

In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.

The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:

ipv4:        if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6:        if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {

At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.
Signed-off-by: Willem de Bruijn <willemb@google.com>

----

Changes
  v1 -> v2
    large rewrite
    - integrate with existing pktinfo cmsg generation code
    - on ipv4: only send with new flag, to maintain legacy behavior
    - on ipv6: send at most a single pktinfo cmsg
    - on ipv6: initialize fields if not yet initialized

The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches

1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
   (http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
   cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>

829ae9d6

ipv4: warn once on passing AF_INET6 socket to ip_recv_error · 7ce875e5

Willem de Bruijn authored Nov 30, 2014

One line change, in response to catching an occurrence of this bug.
See also fix f4713a3d ("net-timestamp: make tcp_recvmsg call ...")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ce875e5

06 Dec, 2014 6 commits

Merge branch 'ebpf-next' · 8d0c4697

David S. Miller authored Dec 05, 2014

Alexei Starovoitov says:

====================
allow eBPF programs to be attached to sockets

V1->V2:

fixed comments in sample code to state clearly that packet data is accessed
with LD_ABS instructions and not internal skb fields.
Also replaced constants in:
BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
with:
BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),

V1 cover:

Introduce BPF_PROG_TYPE_SOCKET_FILTER type of eBPF programs that can be
attached to sockets with setsockopt().
Allow such programs to access maps via lookup/update/delete helpers.

This feature was previewed by bpf manpage in commit b4fc1a46("Merge branch 'bpf-next'")
Now it can actually run.

1st patch adds LD_ABS/LD_IND instruction verification and
2nd patch adds new setsockopt() flag.
Patches 3-6 are examples in assembler and in C.

Though native eBPF programs are way more powerful than classic filters
(attachable through similar setsockopt() call), they don't have skb field
accessors yet. Like skb->pkt_type, skb->dev->ifindex are not accessible.
There are sevaral ways to achieve that. That will be in the next set of patches.
So in this set native eBPF programs can only read data from packet and
access maps.

The most powerful example is sockex2_kern.c from patch 6 where ~200 lines of C
are compiled into ~300 of eBPF instructions.
It shows how quite complex packet parsing can be done.

LLVM used to build examples is at https://github.com/iovisor/llvm
which is fork of llvm trunk that I'm cleaning up for upstreaming.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8d0c4697

samples: bpf: large eBPF program in C · fbe33108

Alexei Starovoitov authored Dec 01, 2014

sockex2_kern.c is purposefully large eBPF program in C.
llvm compiles ~200 lines of C code into ~300 eBPF instructions.

It's similar to __skb_flow_dissect() to demonstrate that complex packet parsing
can be done by eBPF.
Then it uses (struct flow_keys)->dst IP address (or hash of ipv6 dst) to keep
stats of number of packets per IP.
User space loads eBPF program, attaches it to loopback interface and prints
dest_ip->#packets stats every second.

Usage:
$sudo samples/bpf/sockex2
ip 127.0.0.1 count 19
ip 127.0.0.1 count 178115
ip 127.0.0.1 count 369437
ip 127.0.0.1 count 559841
ip 127.0.0.1 count 750539
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fbe33108

samples: bpf: trivial eBPF program in C · a8085782

Alexei Starovoitov authored Dec 01, 2014

this example does the same task as previous socket example
in assembler, but this one does it in C.

eBPF program in kernel does:
    /* assume that packet is IPv4, load one byte of IP->proto */
    int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    long *value;

    value = bpf_map_lookup_elem(&my_map, &index);
    if (value)
        __sync_fetch_and_add(value, 1);

Corresponding user space reads map[tcp], map[udp], map[icmp]
and prints protocol stats every second
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a8085782

samples: bpf: elf_bpf file loader · 249b812d

Alexei Starovoitov authored Dec 01, 2014

simple .o parser and loader using BPF syscall.
.o is a standard ELF generated by LLVM backend

It parses elf file compiled by llvm .c->.o
- parses 'maps' section and creates maps via BPF syscall
- parses 'license' section and passes it to syscall
- parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns
  by storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
- loads eBPF programs via BPF syscall

One ELF file can contain multiple BPF programs.

int load_bpf_file(char *path);
populates prog_fd[] and map_fd[] with FDs received from bpf syscall

bpf_helpers.h - helper functions available to eBPF programs written in C
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

249b812d

samples: bpf: example of stateful socket filtering · 03f4723e

Alexei Starovoitov authored Dec 01, 2014

this socket filter example does:
- creates arraymap in kernel with key 4 bytes and value 8 bytes

- loads eBPF program which assumes that packet is IPv4 and loads one byte of
  IP->proto from the packet and uses it as a key in a map

  r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
  *(u32*)(fp - 4) = r0;
  value = bpf_map_lookup_elem(map_fd, fp - 4);
  if (value)
       (*(u64*)value) += 1;

- attaches this program to raw socket

- every second user space reads map[IPPROTO_TCP], map[IPPROTO_UDP], map[IPPROTO_ICMP]
  to see how many packets of given protocol were seen on loopback interface

Usage:
$sudo samples/bpf/sock_example
TCP 0 UDP 0 ICMP 0 packets
TCP 187600 UDP 0 ICMP 4 packets
TCP 376504 UDP 0 ICMP 8 packets
TCP 563116 UDP 0 ICMP 12 packets
TCP 753144 UDP 0 ICMP 16 packets
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

03f4723e

net: sock: allow eBPF programs to be attached to sockets · 89aa0758

Alexei Starovoitov authored Dec 01, 2014

introduce new setsockopt() command:

setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))

where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER

setsockopt() calls bpf_prog_get() which increments refcnt of the program,
so it doesn't get unloaded while socket is using the program.

The same eBPF program can be attached to multiple sockets.

User task exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

89aa0758