Commit 07f81727 authored by David S. Miller's avatar David S. Miller

Merge branch 'net-ReST-part-two'

Mauro Carvalho Chehab says:

====================
net: manually convert files to ReST format - part 2

That's the second part of my work to convert the networking
text files into ReST. it is based on today's linux-next (next-20200430).

The full series (including those ones) are at:

	https://git.linuxtv.org/mchehab/experimental.git/log/?h=net-docs

I should be sending the remaining patches (another /38 series)
after getting those merged at -next.

The documents, converted to HTML via the building system are at:

	https://www.infradead.org/~mchehab/kernel_docs/networking/
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 9f049606 4ac0b122
......@@ -638,7 +638,7 @@
See Documentation/admin-guide/serial-console.rst for more
information. See
Documentation/networking/netconsole.txt for an
Documentation/networking/netconsole.rst for an
alternative.
uart[8250],io,<addr>[,options]
......
......@@ -54,7 +54,7 @@ You will need to create a new device to use ``/dev/console``. The official
``/dev/console`` is now character device 5,1.
(You can also use a network device as a console. See
``Documentation/networking/netconsole.txt`` for information on that.)
``Documentation/networking/netconsole.rst`` for information on that.)
Here's an example that will use ``/dev/ttyS1`` (COM2) as the console.
Replace the sample values as needed.
......
......@@ -70,7 +70,7 @@ list of volume location server IP addresses::
The first module is the AF_RXRPC network protocol driver. This provides the
RxRPC remote operation protocol and may also be accessed from userspace. See:
Documentation/networking/rxrpc.txt
Documentation/networking/rxrpc.rst
The second module is the kerberos RxRPC security driver, and the third module
is the actual filesystem driver for the AFS filesystem.
......
......@@ -1639,7 +1639,7 @@ can safely be sent over either interface. Such configurations may be achieved
using the traffic control utilities inherent in linux.
By default the bonding driver is multiqueue aware and 16 queues are created
when the driver initializes (see Documentation/networking/multiqueue.txt
when the driver initializes (see Documentation/networking/multiqueue.rst
for details). If more or less queues are desired the module parameter
tx_queues can be used to change this value. There is no sysfs parameter
available as the allocation is done at module init time.
......
......@@ -1058,7 +1058,7 @@ drivers you mainly have to deal with:
- TX: Put the CAN frame from the socket buffer to the CAN controller.
- RX: Put the CAN frame from the CAN controller to the socket buffer.
See e.g. at Documentation/networking/netdevices.txt . The differences
See e.g. at Documentation/networking/netdevices.rst . The differences
for writing CAN network device driver are described below:
......
......@@ -59,7 +59,7 @@ recomputed for each resulting segment. See the skbuff.h comment (section 'E')
for more details.
A driver declares its offload capabilities in netdev->hw_features; see
Documentation/networking/netdev-features.txt for more. Note that a device
Documentation/networking/netdev-features.rst for more. Note that a device
which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and
csum_offset given in the SKB; if it tries to deduce these itself in hardware
(as some NICs do) the driver should check that the values in the SKB match
......
......@@ -74,6 +74,43 @@ Contents:
ipvlan
ipvs-sysctl
kcm
l2tp
lapb-module
ltpc
mac80211-injection
mpls-sysctl
multiqueue
netconsole
netdev-features
netdevices
netfilter-sysctl
netif-msg
nf_conntrack-sysctl
nf_flowtable
openvswitch
operstates
packet_mmap
phonet
pktgen
plip
ppp_generic
proc_net_tcp
radiotap-headers
ray_cs
rds
regulatory
rxrpc
sctp
secid
seg6-sysctl
skfp
strparser
switchdev
tc-actions-env-rules
tcp-thin
team
timestamping
tproxy
.. only:: subproject and html
......
......@@ -886,7 +886,7 @@ tcp_thin_linear_timeouts - BOOLEAN
initiated. This improves retransmission latency for
non-aggressive thin streams, often found to be time-dependent.
For more information on thin streams, see
Documentation/networking/tcp-thin.txt
Documentation/networking/tcp-thin.rst
Default: 0
......
.. SPDX-License-Identifier: GPL-2.0
====
L2TP
====
This document describes how to use the kernel's L2TP drivers to
provide L2TP functionality. L2TP is a protocol that tunnels one or
more sessions over an IP tunnel. It is commonly used for VPNs
......@@ -121,14 +127,16 @@ Userspace may control behavior of the tunnel or session using
setsockopt and ioctl on the PPPoX socket. The following socket
options are supported:-
DEBUG - bitmask of debug message categories. See below.
SENDSEQ - 0 => don't send packets with sequence numbers
1 => send packets with sequence numbers
RECVSEQ - 0 => receive packet sequence numbers are optional
1 => drop receive packets without sequence numbers
LNSMODE - 0 => act as LAC.
1 => act as LNS.
REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder.
========= ===========================================================
DEBUG bitmask of debug message categories. See below.
SENDSEQ - 0 => don't send packets with sequence numbers
- 1 => send packets with sequence numbers
RECVSEQ - 0 => receive packet sequence numbers are optional
- 1 => drop receive packets without sequence numbers
LNSMODE - 0 => act as LAC.
- 1 => act as LNS.
REORDERTO reorder timeout (in millisecs). If 0, don't try to reorder.
========= ===========================================================
Only the DEBUG option is supported by the special tunnel management
PPPoX socket.
......@@ -177,20 +185,22 @@ setsockopt on the PPPoX socket to set a debug mask.
The following debug mask bits are available:
================ ==============================
L2TP_MSG_DEBUG verbose debug (if compiled in)
L2TP_MSG_CONTROL userspace - kernel interface
L2TP_MSG_SEQ sequence numbers handling
L2TP_MSG_DATA data packets
================ ==============================
If enabled, files under a l2tp debugfs directory can be used to dump
kernel state about L2TP tunnels and sessions. To access it, the
debugfs filesystem must first be mounted.
debugfs filesystem must first be mounted::
# mount -t debugfs debugfs /debug
# mount -t debugfs debugfs /debug
Files under the l2tp directory can then be accessed.
Files under the l2tp directory can then be accessed::
# cat /debug/l2tp/tunnels
# cat /debug/l2tp/tunnels
The debugfs files should not be used by applications to obtain L2TP
state information because the file format is subject to change. It is
......@@ -211,14 +221,14 @@ iproute2's ip utility to support this.
To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1
and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the
tunnel endpoints:-
tunnel endpoints::
# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2
# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
# ip -s -d show dev l2tpeth0
# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0
# ip li set dev l2tpeth0 up
# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2
# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
# ip -s -d show dev l2tpeth0
# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0
# ip li set dev l2tpeth0 up
Choose IP addresses to be the address of a local IP interface and that
of the remote system. The IP addresses of the l2tpeth0 interface can be
......@@ -228,75 +238,78 @@ Repeat the above at the peer, with ports, tunnel/session ids and IP
addresses reversed. The tunnel and session IDs can be any non-zero
32-bit number, but the values must be reversed at the peer.
======================== ===================
Host 1 Host2
======================== ===================
udp_sport=5000 udp_sport=5001
udp_dport=5001 udp_dport=5000
tunnel_id=42 tunnel_id=45
peer_tunnel_id=45 peer_tunnel_id=42
session_id=128 session_id=5196755
peer_session_id=5196755 peer_session_id=128
======================== ===================
When done at both ends of the tunnel, it should be possible to send
data over the network. e.g.
data over the network. e.g.::
# ping 10.5.1.1
# ping 10.5.1.1
Sample Userspace Code
=====================
1. Create tunnel management PPPoX socket
kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
if (kernel_fd >= 0) {
struct sockaddr_pppol2tp sax;
struct sockaddr_in const *peer_addr;
peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
memset(&sax, 0, sizeof(sax));
sax.sa_family = AF_PPPOX;
sax.sa_protocol = PX_PROTO_OL2TP;
sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */
sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
sax.pppol2tp.addr.sin_family = AF_INET;
sax.pppol2tp.s_tunnel = tunnel_id;
sax.pppol2tp.s_session = 0; /* special case: mgmt socket */
sax.pppol2tp.d_tunnel = 0;
sax.pppol2tp.d_session = 0; /* special case: mgmt socket */
if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
perror("connect failed");
result = -errno;
goto err;
}
}
2. Create session PPPoX data socket
struct sockaddr_pppol2tp sax;
int fd;
/* Note, the target socket must be bound already, else it will not be ready */
sax.sa_family = AF_PPPOX;
sax.sa_protocol = PX_PROTO_OL2TP;
sax.pppol2tp.fd = tunnel_fd;
sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
sax.pppol2tp.addr.sin_port = addr->sin_port;
sax.pppol2tp.addr.sin_family = AF_INET;
sax.pppol2tp.s_tunnel = tunnel_id;
sax.pppol2tp.s_session = session_id;
sax.pppol2tp.d_tunnel = peer_tunnel_id;
sax.pppol2tp.d_session = peer_session_id;
/* session_fd is the fd of the session's PPPoL2TP socket.
* tunnel_fd is the fd of the tunnel UDP socket.
*/
fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
if (fd < 0 ) {
return -errno;
}
return 0;
1. Create tunnel management PPPoX socket::
kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
if (kernel_fd >= 0) {
struct sockaddr_pppol2tp sax;
struct sockaddr_in const *peer_addr;
peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
memset(&sax, 0, sizeof(sax));
sax.sa_family = AF_PPPOX;
sax.sa_protocol = PX_PROTO_OL2TP;
sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */
sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
sax.pppol2tp.addr.sin_family = AF_INET;
sax.pppol2tp.s_tunnel = tunnel_id;
sax.pppol2tp.s_session = 0; /* special case: mgmt socket */
sax.pppol2tp.d_tunnel = 0;
sax.pppol2tp.d_session = 0; /* special case: mgmt socket */
if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
perror("connect failed");
result = -errno;
goto err;
}
}
2. Create session PPPoX data socket::
struct sockaddr_pppol2tp sax;
int fd;
/* Note, the target socket must be bound already, else it will not be ready */
sax.sa_family = AF_PPPOX;
sax.sa_protocol = PX_PROTO_OL2TP;
sax.pppol2tp.fd = tunnel_fd;
sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
sax.pppol2tp.addr.sin_port = addr->sin_port;
sax.pppol2tp.addr.sin_family = AF_INET;
sax.pppol2tp.s_tunnel = tunnel_id;
sax.pppol2tp.s_session = session_id;
sax.pppol2tp.d_tunnel = peer_tunnel_id;
sax.pppol2tp.d_session = peer_session_id;
/* session_fd is the fd of the session's PPPoL2TP socket.
* tunnel_fd is the fd of the tunnel UDP socket.
*/
fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
if (fd < 0 ) {
return -errno;
}
return 0;
Internal Implementation
=======================
......
The Linux LAPB Module Interface 1.3
.. SPDX-License-Identifier: GPL-2.0
Jonathan Naylor 29.12.96
===============================
The Linux LAPB Module Interface
===============================
Changed (Henner Eisen, 2000-10-29): int return value for data_indication()
Version 1.3
Jonathan Naylor 29.12.96
Changed (Henner Eisen, 2000-10-29): int return value for data_indication()
The LAPB module will be a separately compiled module for use by any parts of
the Linux operating system that require a LAPB service. This document
......@@ -32,16 +38,16 @@ LAPB Initialisation Structure
This structure is used only once, in the call to lapb_register (see below).
It contains information about the device driver that requires the services
of the LAPB module.
of the LAPB module::
struct lapb_register_struct {
void (*connect_confirmation)(int token, int reason);
void (*connect_indication)(int token, int reason);
void (*disconnect_confirmation)(int token, int reason);
void (*disconnect_indication)(int token, int reason);
int (*data_indication)(int token, struct sk_buff *skb);
void (*data_transmit)(int token, struct sk_buff *skb);
};
struct lapb_register_struct {
void (*connect_confirmation)(int token, int reason);
void (*connect_indication)(int token, int reason);
void (*disconnect_confirmation)(int token, int reason);
void (*disconnect_indication)(int token, int reason);
int (*data_indication)(int token, struct sk_buff *skb);
void (*data_transmit)(int token, struct sk_buff *skb);
};
Each member of this structure corresponds to a function in the device driver
that is called when a particular event in the LAPB module occurs. These will
......@@ -54,19 +60,19 @@ LAPB Parameter Structure
This structure is used with the lapb_getparms and lapb_setparms functions
(see below). They are used to allow the device driver to get and set the
operational parameters of the LAPB implementation for a given connection.
struct lapb_parms_struct {
unsigned int t1;
unsigned int t1timer;
unsigned int t2;
unsigned int t2timer;
unsigned int n2;
unsigned int n2count;
unsigned int window;
unsigned int state;
unsigned int mode;
};
operational parameters of the LAPB implementation for a given connection::
struct lapb_parms_struct {
unsigned int t1;
unsigned int t1timer;
unsigned int t2;
unsigned int t2timer;
unsigned int n2;
unsigned int n2count;
unsigned int window;
unsigned int state;
unsigned int mode;
};
T1 and T2 are protocol timing parameters and are given in units of 100ms. N2
is the maximum number of tries on the link before it is declared a failure.
......@@ -78,11 +84,14 @@ link.
The mode variable is a bit field used for setting (at present) three values.
The bit fields have the following meanings:
====== =================================================
Bit Meaning
====== =================================================
0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED).
1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP).
2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE)
3-31 Reserved, must be 0.
====== =================================================
Extended LAPB operation indicates the use of extended sequence numbers and
consequently larger window sizes, the default is standard LAPB operation.
......@@ -99,8 +108,9 @@ Functions
The LAPB module provides a number of function entry points.
::
int lapb_register(void *token, struct lapb_register_struct);
int lapb_register(void *token, struct lapb_register_struct);
This must be called before the LAPB module may be used. If the call is
successful then LAPB_OK is returned. The token must be a unique identifier
......@@ -111,33 +121,42 @@ For multiple LAPB links in a single device driver, multiple calls to
lapb_register must be made. The format of the lapb_register_struct is given
above. The return values are:
============= =============================
LAPB_OK LAPB registered successfully.
LAPB_BADTOKEN Token is already registered.
LAPB_NOMEM Out of memory
============= =============================
::
int lapb_unregister(void *token);
int lapb_unregister(void *token);
This releases all the resources associated with a LAPB link. Any current
LAPB link will be abandoned without further messages being passed. After
this call, the value of token is no longer valid for any calls to the LAPB
function. The valid return values are:
============= ===============================
LAPB_OK LAPB unregistered successfully.
LAPB_BADTOKEN Invalid/unknown LAPB token.
============= ===============================
::
int lapb_getparms(void *token, struct lapb_parms_struct *parms);
int lapb_getparms(void *token, struct lapb_parms_struct *parms);
This allows the device driver to get the values of the current LAPB
variables, the lapb_parms_struct is described above. The valid return values
are:
============= =============================
LAPB_OK LAPB getparms was successful.
LAPB_BADTOKEN Invalid/unknown LAPB token.
============= =============================
::
int lapb_setparms(void *token, struct lapb_parms_struct *parms);
int lapb_setparms(void *token, struct lapb_parms_struct *parms);
This allows the device driver to set the values of the current LAPB
variables, the lapb_parms_struct is described above. The values of t1timer,
......@@ -145,42 +164,54 @@ t2timer and n2count are ignored, likewise changing the mode bits when
connected will be ignored. An error implies that none of the values have
been changed. The valid return values are:
============= =================================================
LAPB_OK LAPB getparms was successful.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_INVALUE One of the values was out of its allowable range.
============= =================================================
::
int lapb_connect_request(void *token);
int lapb_connect_request(void *token);
Initiate a connect using the current parameter settings. The valid return
values are:
============== =================================
LAPB_OK LAPB is starting to connect.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_CONNECTED LAPB module is already connected.
============== =================================
::
int lapb_disconnect_request(void *token);
int lapb_disconnect_request(void *token);
Initiate a disconnect. The valid return values are:
================= ===============================
LAPB_OK LAPB is starting to disconnect.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_NOTCONNECTED LAPB module is not connected.
================= ===============================
::
int lapb_data_request(void *token, struct sk_buff *skb);
int lapb_data_request(void *token, struct sk_buff *skb);
Queue data with the LAPB module for transmitting over the link. If the call
is successful then the skbuff is owned by the LAPB module and may not be
used by the device driver again. The valid return values are:
================= =============================
LAPB_OK LAPB has accepted the data.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_NOTCONNECTED LAPB module is not connected.
================= =============================
::
int lapb_data_received(void *token, struct sk_buff *skb);
int lapb_data_received(void *token, struct sk_buff *skb);
Queue data with the LAPB module which has been received from the device. It
is expected that the data passed to the LAPB module has skb->data pointing
......@@ -188,9 +219,10 @@ to the beginning of the LAPB data. If the call is successful then the skbuff
is owned by the LAPB module and may not be used by the device driver again.
The valid return values are:
============= ===========================
LAPB_OK LAPB has accepted the data.
LAPB_BADTOKEN Invalid/unknown LAPB token.
============= ===========================
Callbacks
---------
......@@ -200,49 +232,58 @@ module to call when an event occurs. They are registered with the LAPB
module with lapb_register (see above) in the structure lapb_register_struct
(see above).
::
void (*connect_confirmation)(void *token, int reason);
void (*connect_confirmation)(void *token, int reason);
This is called by the LAPB module when a connection is established after
being requested by a call to lapb_connect_request (see above). The reason is
always LAPB_OK.
::
void (*connect_indication)(void *token, int reason);
void (*connect_indication)(void *token, int reason);
This is called by the LAPB module when the link is established by the remote
system. The value of reason is always LAPB_OK.
::
void (*disconnect_confirmation)(void *token, int reason);
void (*disconnect_confirmation)(void *token, int reason);
This is called by the LAPB module when an event occurs after the device
driver has called lapb_disconnect_request (see above). The reason indicates
what has happened. In all cases the LAPB link can be regarded as being
terminated. The values for reason are:
================= ====================================================
LAPB_OK The LAPB link was terminated normally.
LAPB_NOTCONNECTED The remote system was not connected.
LAPB_TIMEDOUT No response was received in N2 tries from the remote
system.
================= ====================================================
::
void (*disconnect_indication)(void *token, int reason);
void (*disconnect_indication)(void *token, int reason);
This is called by the LAPB module when the link is terminated by the remote
system or another event has occurred to terminate the link. This may be
returned in response to a lapb_connect_request (see above) if the remote
system refused the request. The values for reason are:
================= ====================================================
LAPB_OK The LAPB link was terminated normally by the remote
system.
LAPB_REFUSED The remote system refused the connect request.
LAPB_NOTCONNECTED The remote system was not connected.
LAPB_TIMEDOUT No response was received in N2 tries from the remote
system.
================= ====================================================
::
int (*data_indication)(void *token, struct sk_buff *skb);
int (*data_indication)(void *token, struct sk_buff *skb);
This is called by the LAPB module when data has been received from the
remote system that should be passed onto the next layer in the protocol
......@@ -254,8 +295,9 @@ This method should return NET_RX_DROP (as defined in the header
file include/linux/netdevice.h) if and only if the frame was dropped
before it could be delivered to the upper layer.
::
void (*data_transmit)(void *token, struct sk_buff *skb);
void (*data_transmit)(void *token, struct sk_buff *skb);
This is called by the LAPB module when data is to be transmitted to the
remote system by the device driver. The skbuff becomes the property of the
......
.. SPDX-License-Identifier: GPL-2.0
===========
LTPC Driver
===========
This is the ALPHA version of the ltpc driver.
In order to use it, you will need at least version 1.3.3 of the
......@@ -15,7 +21,7 @@ yourself. (see "Card Configuration" below for how to determine or
change the settings on your card)
When the driver is compiled into the kernel, you can add a line such
as the following to your /etc/lilo.conf:
as the following to your /etc/lilo.conf::
append="ltpc=0x240,9,1"
......@@ -25,13 +31,13 @@ the driver will try to determine them itself.
If you load the driver as a module, you can pass the parameters "io=",
"irq=", and "dma=" on the command line with insmod or modprobe, or add
them as options in a configuration file in /etc/modprobe.d/ directory:
them as options in a configuration file in /etc/modprobe.d/ directory::
alias lt0 ltpc # autoload the module when the interface is configured
options ltpc io=0x240 irq=9 dma=1
Before starting up the netatalk demons (perhaps in rc.local), you
need to add a line such as:
need to add a line such as::
/sbin/ifconfig lt0 127.0.0.42
......@@ -42,7 +48,7 @@ The appropriate netatalk configuration depends on whether you are
attached to a network that includes AppleTalk routers or not. If,
like me, you are simply connecting to your home Macintoshes and
printers, you need to set up netatalk to "seed". The way I do this
is to have the lines
is to have the lines::
dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033"
lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033"
......@@ -57,13 +63,13 @@ such.
If you are attached to an extended AppleTalk network, with routers on
it, then you don't need to fool around with this -- the appropriate
line in atalkd.conf is
line in atalkd.conf is::
lt0 -phase 1
--------------------------------------
Card Configuration:
Card Configuration
==================
The interrupts and so forth are configured via the dipswitch on the
board. Set the switches so as not to conflict with other hardware.
......@@ -73,26 +79,32 @@ board. Set the switches so as not to conflict with other hardware.
original documentation refers to IRQ2. Since you'll be running
this on an AT (or later) class machine, that really means IRQ9.
=== ===========================================================
SW1 IRQ 4
SW2 IRQ 3
SW3 IRQ 9 (2 in original card documentation only applies to XT)
=== ===========================================================
DMA -- choose DMA 1 or 3, and set both corresponding switches.
=== =====
SW4 DMA 3
SW5 DMA 1
SW6 DMA 3
SW7 DMA 1
=== =====
I/O address -- choose one.
=== =========
SW8 220 / 240
=== =========
--------------------------------------
IP:
IP
==
Yes, it is possible to do IP over LocalTalk. However, you can't just
treat the LocalTalk device like an ordinary Ethernet device, even if
......@@ -102,9 +114,9 @@ Instead, you follow the same procedure as for doing IP in EtherTalk.
See Documentation/networking/ipddp.rst for more information about the
kernel driver and userspace tools needed.
--------------------------------------
BUGS:
Bugs
====
IRQ autoprobing often doesn't work on a cold boot. To get around
this, either compile the driver as a module, or pass the parameters
......@@ -120,12 +132,13 @@ It may theoretically be possible to use two LTPC cards in the same
machine, but this is unsupported, so if you really want to do this,
you'll probably have to hack the initialization code a bit.
______________________________________
THANKS:
Thanks to Alan Cox for helpful discussions early on in this
Thanks
======
Thanks to Alan Cox for helpful discussions early on in this
work, and to Denis Hainsworth for doing the bleeding-edge testing.
-- Bradford Johnson <bradford@math.umn.edu>
Bradford Johnson <bradford@math.umn.edu>
-- Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org>
Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org>
.. SPDX-License-Identifier: GPL-2.0
=========================================
How to use packet injection with mac80211
=========================================
mac80211 now allows arbitrary packets to be injected down any Monitor Mode
interface from userland. The packet you inject needs to be composed in the
following format:
following format::
[ radiotap header ]
[ ieee80211 header ]
[ payload ]
The radiotap format is discussed in
./Documentation/networking/radiotap-headers.txt.
./Documentation/networking/radiotap-headers.rst.
Despite many radiotap parameters being currently defined, most only make sense
to appear on received packets. The following information is parsed from the
......@@ -18,15 +21,19 @@ radiotap headers and used to control injection:
* IEEE80211_RADIOTAP_FLAGS
IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated
IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available
IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the
========================= ===========================================
IEEE80211_RADIOTAP_F_FCS FCS will be removed and recalculated
IEEE80211_RADIOTAP_F_WEP frame will be encrypted if key available
IEEE80211_RADIOTAP_F_FRAG frame will be fragmented if longer than the
current fragmentation threshold.
========================= ===========================================
* IEEE80211_RADIOTAP_TX_FLAGS
IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for
============================= ========================================
IEEE80211_RADIOTAP_F_TX_NOACK frame should be sent without waiting for
an ACK even if it is a unicast frame
============================= ========================================
* IEEE80211_RADIOTAP_RATE
......@@ -37,8 +44,10 @@ radiotap headers and used to control injection:
HT rate for the transmission (only for devices without own rate control).
Also some flags are parsed
IEEE80211_RADIOTAP_MCS_SGI: use short guard interval
IEEE80211_RADIOTAP_MCS_BW_40: send in HT40 mode
============================ ========================
IEEE80211_RADIOTAP_MCS_SGI use short guard interval
IEEE80211_RADIOTAP_MCS_BW_40 send in HT40 mode
============================ ========================
* IEEE80211_RADIOTAP_DATA_RETRIES
......@@ -51,17 +60,17 @@ radiotap headers and used to control injection:
without own rate control). Also other fields are parsed
flags field
IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval
IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval
bandwidth field
1: send using 40MHz channel width
4: send using 80MHz channel width
11: send using 160MHz channel width
* 1: send using 40MHz channel width
* 4: send using 80MHz channel width
* 11: send using 160MHz channel width
The injection code can also skip all other currently defined radiotap fields
facilitating replay of captured radiotap headers directly.
Here is an example valid radiotap header defining some parameters
Here is an example valid radiotap header defining some parameters::
0x00, 0x00, // <-- radiotap version
0x0b, 0x00, // <- radiotap header length
......@@ -71,7 +80,7 @@ Here is an example valid radiotap header defining some parameters
0x01 //<-- antenna
The ieee80211 header follows immediately afterwards, looking for example like
this:
this::
0x08, 0x01, 0x00, 0x00,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
......@@ -84,10 +93,10 @@ Then lastly there is the payload.
After composing the packet contents, it is sent by send()-ing it to a logical
mac80211 interface that is in Monitor mode. Libpcap can also be used,
(which is easier than doing the work to bind the socket to the right
interface), along the following lines:
interface), along the following lines:::
ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf);
...
...
r = pcap_inject(ppcap, u8aSendBuffer, nLength);
You can also find a link to a complete inject application here:
......
.. SPDX-License-Identifier: GPL-2.0
====================
MPLS Sysfs variables
====================
/proc/sys/net/mpls/* Variables:
===============================
platform_labels - INTEGER
Number of entries in the platform label table. It is not
......@@ -17,6 +24,7 @@ platform_labels - INTEGER
no longer fit in the table.
Possible values: 0 - 1048575
Default: 0
ip_ttl_propagate - BOOL
......@@ -27,8 +35,8 @@ ip_ttl_propagate - BOOL
If disabled, the MPLS transport network will appear as a
single hop to transit traffic.
0 - disabled / RFC 3443 [Short] Pipe Model
1 - enabled / RFC 3443 Uniform Model (default)
* 0 - disabled / RFC 3443 [Short] Pipe Model
* 1 - enabled / RFC 3443 Uniform Model (default)
default_ttl - INTEGER
Default TTL value to use for MPLS packets where it cannot be
......@@ -36,6 +44,7 @@ default_ttl - INTEGER
or ip_ttl_propagate has been disabled.
Possible values: 1 - 255
Default: 255
conf/<interface>/input - BOOL
......@@ -44,5 +53,5 @@ conf/<interface>/input - BOOL
If disabled, packets will be discarded without further
processing.
0 - disabled (default)
not 0 - enabled
* 0 - disabled (default)
* not 0 - enabled
.. SPDX-License-Identifier: GPL-2.0
HOWTO for multiqueue network device support
===========================================
===========================================
HOWTO for multiqueue network device support
===========================================
Section 1: Base driver requirements for implementing multiqueue support
=======================================================================
Intro: Kernel support for multiqueue devices
---------------------------------------------------------
Kernel support for multiqueue devices is always present.
Section 1: Base driver requirements for implementing multiqueue support
-----------------------------------------------------------------------
Base drivers are required to use the new alloc_etherdev_mq() or
alloc_netdev_mq() functions to allocate the subqueues for the device. The
underlying kernel API will take care of the allocation and deallocation of
......@@ -26,8 +26,7 @@ comes online or when it's completely shut down (unregister_netdev(), etc.).
Section 2: Qdisc support for multiqueue devices
-----------------------------------------------
===============================================
Currently two qdiscs are optimized for multiqueue devices. The first is the
default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue.
......@@ -46,22 +45,22 @@ will be queued to the band associated with the hardware queue.
Section 3: Brief howto using MULTIQ for multiqueue devices
---------------------------------------------------------------
==========================================================
The userspace command 'tc,' part of the iproute2 package, is used to configure
qdiscs. To add the MULTIQ qdisc to your network device, assuming the device
is called eth0, run the following command:
is called eth0, run the following command::
# tc qdisc add dev eth0 root handle 1: multiq
# tc qdisc add dev eth0 root handle 1: multiq
The qdisc will allocate the number of bands to equal the number of queues that
the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx
queues, the band mapping would look like:
queues, the band mapping would look like::
band 0 => queue 0
band 1 => queue 1
band 2 => queue 2
band 3 => queue 3
band 0 => queue 0
band 1 => queue 1
band 2 => queue 2
band 3 => queue 3
Traffic will begin flowing through each queue based on either the simple_tx_hash
function or based on netdev->select_queue() if you have it defined.
......@@ -69,11 +68,11 @@ function or based on netdev->select_queue() if you have it defined.
The behavior of tc filters remains the same. However a new tc action,
skbedit, has been added. Assuming you wanted to route all traffic to a
specific host, for example 192.168.0.3, through a specific queue you could use
this action and establish a filter such as:
this action and establish a filter such as::
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
match ip dst 192.168.0.3 \
action skbedit queue_mapping 3
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
match ip dst 192.168.0.3 \
action skbedit queue_mapping 3
Author: Alexander Duyck <alexander.h.duyck@intel.com>
Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
:Author: Alexander Duyck <alexander.h.duyck@intel.com>
:Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
.. SPDX-License-Identifier: GPL-2.0
==========
Netconsole
==========
started by Ingo Molnar <mingo@redhat.com>, 2001.09.17
2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003
IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013
Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015
Please send bug reports to Matt Mackall <mpm@selenic.com>
......@@ -23,34 +32,34 @@ Sender and receiver configuration:
==================================
It takes a string configuration parameter "netconsole" in the
following format:
following format::
netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
where
+ if present, enable extended console support
src-port source for UDP packets (defaults to 6665)
src-ip source IP to use (interface address)
dev network interface (eth0)
tgt-port port for logging agent (6666)
tgt-ip IP address for logging agent
tgt-macaddr ethernet MAC address for logging agent (broadcast)
+ if present, enable extended console support
src-port source for UDP packets (defaults to 6665)
src-ip source IP to use (interface address)
dev network interface (eth0)
tgt-port port for logging agent (6666)
tgt-ip IP address for logging agent
tgt-macaddr ethernet MAC address for logging agent (broadcast)
Examples:
Examples::
linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
or
or::
insmod netconsole netconsole=@/,@10.0.0.2/
or using IPv6
or using IPv6::
insmod netconsole netconsole=@/,@fd00:1:2:3::1/
It also supports logging to multiple remote agents by specifying
parameters for the multiple agents separated by semicolons and the
complete string enclosed in "quotes", thusly:
complete string enclosed in "quotes", thusly::
modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/"
......@@ -67,14 +76,19 @@ for example:
On distributions using a BSD-based netcat version (e.g. Fedora,
openSUSE and Ubuntu) the listening port must be specified without
the -p switch:
the -p switch::
nc -u -l -p <port>' / 'nc -u -l <port>
or::
'nc -u -l -p <port>' / 'nc -u -l <port>' or
'netcat -u -l -p <port>' / 'netcat -u -l <port>'
netcat -u -l -p <port>' / 'netcat -u -l <port>
3) socat
'socat udp-recv:<port> -'
::
socat udp-recv:<port> -
Dynamic reconfiguration:
========================
......@@ -92,7 +106,7 @@ netconsole module (or kernel, if netconsole is built-in).
Some examples follow (where configfs is mounted at the /sys/kernel/config
mountpoint).
To add a remote logging target (target names can be arbitrary):
To add a remote logging target (target names can be arbitrary)::
cd /sys/kernel/config/netconsole/
mkdir target1
......@@ -102,12 +116,13 @@ above) and are disabled by default -- they must first be enabled by writing
"1" to the "enabled" attribute (usually after setting parameters accordingly)
as described below.
To remove a target:
To remove a target::
rmdir /sys/kernel/config/netconsole/othertarget/
The interface exposes these parameters of a netconsole target to userspace:
============== ================================= ============
enabled Is this target currently enabled? (read-write)
extended Extended mode enabled (read-write)
dev_name Local network interface name (read-write)
......@@ -117,12 +132,13 @@ The interface exposes these parameters of a netconsole target to userspace:
remote_ip Remote agent's IP address (read-write)
local_mac Local interface's MAC address (read-only)
remote_mac Remote agent's MAC address (read-write)
============== ================================= ============
The "enabled" attribute is also used to control whether the parameters of
a target can be updated or not -- you can modify the parameters of only
disabled targets (i.e. if "enabled" is 0).
To update a target's parameters:
To update a target's parameters::
cat enabled # check if enabled is 1
echo 0 > enabled # disable the target (if required)
......@@ -140,12 +156,12 @@ Extended console:
If '+' is prefixed to the configuration line or "extended" config file
is set to 1, extended console support is enabled. An example boot
param follows.
param follows::
linux netconsole=+4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
Log messages are transmitted with extended metadata header in the
following format which is the same as /dev/kmsg.
following format which is the same as /dev/kmsg::
<level>,<sequnum>,<timestamp>,<contflag>;<message text>
......@@ -155,12 +171,12 @@ newline is used as the delimeter.
If a message doesn't fit in certain number of bytes (currently 1000),
the message is split into multiple fragments by netconsole. These
fragments are transmitted with "ncfrag" header field added.
fragments are transmitted with "ncfrag" header field added::
ncfrag=<byte-offset>/<total-bytes>
For example, assuming a lot smaller chunk size, a message "the first
chunk, the 2nd chunk." may be split as follows.
chunk, the 2nd chunk." may be split as follows::
6,416,1758426,-,ncfrag=0/31;the first chunk,
6,416,1758426,-,ncfrag=16/31; the 2nd chunk.
......@@ -168,39 +184,52 @@ chunk, the 2nd chunk." may be split as follows.
Miscellaneous notes:
====================
WARNING: the default target ethernet setting uses the broadcast
ethernet address to send packets, which can cause increased load on
other systems on the same ethernet segment.
.. Warning::
the default target ethernet setting uses the broadcast
ethernet address to send packets, which can cause increased load on
other systems on the same ethernet segment.
.. Tip::
some LAN switches may be configured to suppress ethernet broadcasts
so it is advised to explicitly specify the remote agents' MAC addresses
from the config parameters passed to netconsole.
.. Tip::
to find out the MAC address of, say, 10.0.0.2, you may try using::
ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
TIP: some LAN switches may be configured to suppress ethernet broadcasts
so it is advised to explicitly specify the remote agents' MAC addresses
from the config parameters passed to netconsole.
.. Tip::
TIP: to find out the MAC address of, say, 10.0.0.2, you may try using:
in case the remote logging agent is on a separate LAN subnet than
the sender, it is suggested to try specifying the MAC address of the
default gateway (you may use /sbin/route -n to find it out) as the
remote MAC address instead.
ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
.. note::
TIP: in case the remote logging agent is on a separate LAN subnet than
the sender, it is suggested to try specifying the MAC address of the
default gateway (you may use /sbin/route -n to find it out) as the
remote MAC address instead.
the network device (eth1 in the above case) can run any kind
of other network traffic, netconsole is not intrusive. Netconsole
might cause slight delays in other traffic if the volume of kernel
messages is high, but should have no other impact.
NOTE: the network device (eth1 in the above case) can run any kind
of other network traffic, netconsole is not intrusive. Netconsole
might cause slight delays in other traffic if the volume of kernel
messages is high, but should have no other impact.
.. note::
NOTE: if you find that the remote logging agent is not receiving or
printing all messages from the sender, it is likely that you have set
the "console_loglevel" parameter (on the sender) to only send high
priority messages to the console. You can change this at runtime using:
if you find that the remote logging agent is not receiving or
printing all messages from the sender, it is likely that you have set
the "console_loglevel" parameter (on the sender) to only send high
priority messages to the console. You can change this at runtime using::
dmesg -n 8
dmesg -n 8
or by specifying "debug" on the kernel command line at boot, to send
all kernel messages to the console. A specific value for this parameter
can also be set using the "loglevel" kernel boot option. See the
dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst for details.
or by specifying "debug" on the kernel command line at boot, to send
all kernel messages to the console. A specific value for this parameter
can also be set using the "loglevel" kernel boot option. See the
dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst
for details.
Netconsole was designed to be as instantaneous as possible, to
enable the logging of even the most critical kernel bugs. It works
......
.. SPDX-License-Identifier: GPL-2.0
=====================================================
Netdev features mess and how to get out from it alive
=====================================================
......@@ -6,8 +9,8 @@ Author:
Part I: Feature sets
======================
Part I: Feature sets
====================
Long gone are the days when a network card would just take and give packets
verbatim. Today's devices add multiple features and bugs (read: offloads)
......@@ -39,8 +42,8 @@ one used internally by network core:
Part II: Controlling enabled features
=======================================
Part II: Controlling enabled features
=====================================
When current feature set (netdev->features) is to be changed, new set
is calculated and filtered by calling ndo_fix_features callback
......@@ -65,8 +68,8 @@ driver except by means of ndo_fix_features callback.
Part III: Implementation hints
================================
Part III: Implementation hints
==============================
* ndo_fix_features:
......@@ -94,8 +97,8 @@ Errors returned are not (and cannot be) propagated anywhere except dmesg.
Part IV: Features
===================
Part IV: Features
=================
For current list of features, see include/linux/netdev_features.h.
This section describes semantics of some of them.
......
.. SPDX-License-Identifier: GPL-2.0
=====================================
Network Devices, the Kernel, and You!
=====================================
Introduction
......@@ -75,11 +78,12 @@ ndo_start_xmit:
Don't use it for new drivers.
Context: Process with BHs disabled or BH (timer),
will be called with interrupts disabled by netconsole.
will be called with interrupts disabled by netconsole.
Return codes:
o NETDEV_TX_OK everything ok.
o NETDEV_TX_BUSY Cannot transmit packet, try later
Return codes:
* NETDEV_TX_OK everything ok.
* NETDEV_TX_BUSY Cannot transmit packet, try later
Usually a bug, means queue start/stop flow control is broken in
the driver. Note: the driver must NOT put the skb in its DMA ring.
......@@ -95,10 +99,13 @@ ndo_set_rx_mode:
struct napi_struct synchronization rules
========================================
napi->poll:
Synchronization: NAPI_STATE_SCHED bit in napi->state. Device
Synchronization:
NAPI_STATE_SCHED bit in napi->state. Device
driver's ndo_stop method will invoke napi_disable() on
all NAPI instances which will do a sleeping poll on the
NAPI_STATE_SCHED napi->state bit, waiting for all pending
NAPI activity to cease.
Context: softirq
will be called with interrupts disabled by netconsole.
Context:
softirq
will be called with interrupts disabled by netconsole.
.. SPDX-License-Identifier: GPL-2.0
=========================
Netfilter Sysfs variables
=========================
/proc/sys/net/netfilter/* Variables:
====================================
nf_log_all_netns - BOOLEAN
0 - disabled (default)
not 0 - enabled
- 0 - disabled (default)
- not 0 - enabled
By default, only init_net namespace can log packets into kernel log
with LOG target; this aims to prevent containers from flooding host
......
.. SPDX-License-Identifier: GPL-2.0
________________
===============
NETIF Msg Level
===============
The design of the network interface message level setting.
History
-------
The design of the debugging message interface was guided and
constrained by backwards compatibility previous practice. It is useful
......@@ -18,14 +21,15 @@ History
The message level was not precisely defined past level 3, but were
always implemented within +-1 of the specified level. Drivers tended
to shed the more verbose level messages as they matured.
0 Minimal messages, only essential information on fatal errors.
1 Standard messages, initialization status. No run-time messages
2 Special media selection messages, generally timer-driver.
3 Interface starts and stops, including normal status messages
4 Tx and Rx frame error messages, and abnormal driver operation
5 Tx packet queue information, interrupt events.
6 Status on each completed Tx packet and received Rx packets
7 Initial contents of Tx and Rx packets
- 0 Minimal messages, only essential information on fatal errors.
- 1 Standard messages, initialization status. No run-time messages
- 2 Special media selection messages, generally timer-driver.
- 3 Interface starts and stops, including normal status messages
- 4 Tx and Rx frame error messages, and abnormal driver operation
- 5 Tx packet queue information, interrupt events.
- 6 Status on each completed Tx packet and received Rx packets
- 7 Initial contents of Tx and Rx packets
Initially this message level variable was uniquely named in each driver
e.g. "lance_debug", so that a kernel symbolic debugger could locate and
......@@ -36,44 +40,56 @@ History
This approach worked well. However there is always a demand for
additional features. Over the years the following emerged as
reasonable and easily implemented enhancements
Using an ioctl() call to modify the level.
Per-interface rather than per-driver message level setting.
More selective control over the type of messages emitted.
- Using an ioctl() call to modify the level.
- Per-interface rather than per-driver message level setting.
- More selective control over the type of messages emitted.
The netif_msg recommendation adds these features with only a minor
complexity and code size increase.
The recommendation is the following points
Retaining the per-driver integer variable "debug" as a module
- Retaining the per-driver integer variable "debug" as a module
parameter with a default level of '1'.
Adding a per-interface private variable named "msg_enable". The
variable is a bit map rather than a level, and is initialized as
- Adding a per-interface private variable named "msg_enable". The
variable is a bit map rather than a level, and is initialized as::
1 << debug
Or more precisely
debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug)
Messages should changes from
Or more precisely::
debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug)
Messages should changes from::
if (debug > 1)
printk(MSG_DEBUG "%s: ...
to
printk(MSG_DEBUG "%s: ...
to::
if (np->msg_enable & NETIF_MSG_LINK)
printk(MSG_DEBUG "%s: ...
printk(MSG_DEBUG "%s: ...
The set of message levels is named
Old level Name Bit position
0 NETIF_MSG_DRV 0x0001
1 NETIF_MSG_PROBE 0x0002
2 NETIF_MSG_LINK 0x0004
2 NETIF_MSG_TIMER 0x0004
3 NETIF_MSG_IFDOWN 0x0008
3 NETIF_MSG_IFUP 0x0008
4 NETIF_MSG_RX_ERR 0x0010
4 NETIF_MSG_TX_ERR 0x0010
5 NETIF_MSG_TX_QUEUED 0x0020
5 NETIF_MSG_INTR 0x0020
6 NETIF_MSG_TX_DONE 0x0040
6 NETIF_MSG_RX_STATUS 0x0040
7 NETIF_MSG_PKTDATA 0x0080
========= =================== ============
Old level Name Bit position
========= =================== ============
0 NETIF_MSG_DRV 0x0001
1 NETIF_MSG_PROBE 0x0002
2 NETIF_MSG_LINK 0x0004
2 NETIF_MSG_TIMER 0x0004
3 NETIF_MSG_IFDOWN 0x0008
3 NETIF_MSG_IFUP 0x0008
4 NETIF_MSG_RX_ERR 0x0010
4 NETIF_MSG_TX_ERR 0x0010
5 NETIF_MSG_TX_QUEUED 0x0020
5 NETIF_MSG_INTR 0x0020
6 NETIF_MSG_TX_DONE 0x0040
6 NETIF_MSG_RX_STATUS 0x0040
7 NETIF_MSG_PKTDATA 0x0080
========= =================== ============
.. SPDX-License-Identifier: GPL-2.0
===================================
Netfilter Conntrack Sysfs variables
===================================
/proc/sys/net/netfilter/nf_conntrack_* Variables:
=================================================
nf_conntrack_acct - BOOLEAN
0 - disabled (default)
not 0 - enabled
- 0 - disabled (default)
- not 0 - enabled
Enable connection tracking flow accounting. 64-bit byte and packet
counters per flow are added.
......@@ -16,8 +23,8 @@ nf_conntrack_buckets - INTEGER
This sysctl is only writeable in the initial net namespace.
nf_conntrack_checksum - BOOLEAN
0 - disabled
not 0 - enabled (default)
- 0 - disabled
- not 0 - enabled (default)
Verify checksum of incoming packets. Packets with bad checksums are
in INVALID state. If this is enabled, such packets will not be
......@@ -27,8 +34,8 @@ nf_conntrack_count - INTEGER (read-only)
Number of currently allocated flow entries.
nf_conntrack_events - BOOLEAN
0 - disabled
not 0 - enabled (default)
- 0 - disabled
- not 0 - enabled (default)
If this option is enabled, the connection tracking code will
provide userspace with connection tracking events via ctnetlink.
......@@ -62,8 +69,8 @@ nf_conntrack_generic_timeout - INTEGER (seconds)
protocols.
nf_conntrack_helper - BOOLEAN
0 - disabled (default)
not 0 - enabled
- 0 - disabled (default)
- not 0 - enabled
Enable automatic conntrack helper assignment.
If disabled it is required to set up iptables rules to assign
......@@ -81,14 +88,14 @@ nf_conntrack_icmpv6_timeout - INTEGER (seconds)
Default for ICMP6 timeout.
nf_conntrack_log_invalid - INTEGER
0 - disable (default)
1 - log ICMP packets
6 - log TCP packets
17 - log UDP packets
33 - log DCCP packets
41 - log ICMPv6 packets
136 - log UDPLITE packets
255 - log packets of any protocol
- 0 - disable (default)
- 1 - log ICMP packets
- 6 - log TCP packets
- 17 - log UDP packets
- 33 - log DCCP packets
- 41 - log ICMPv6 packets
- 136 - log UDPLITE packets
- 255 - log packets of any protocol
Log invalid packets of a type specified by value.
......@@ -97,15 +104,15 @@ nf_conntrack_max - INTEGER
nf_conntrack_buckets value * 4.
nf_conntrack_tcp_be_liberal - BOOLEAN
0 - disabled (default)
not 0 - enabled
- 0 - disabled (default)
- not 0 - enabled
Be conservative in what you do, be liberal in what you accept from others.
If it's non-zero, we mark only out of window RST segments as INVALID.
nf_conntrack_tcp_loose - BOOLEAN
0 - disabled
not 0 - enabled (default)
- 0 - disabled
- not 0 - enabled (default)
If it is set to zero, we disable picking up already established
connections.
......@@ -148,8 +155,8 @@ nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds)
default 300
nf_conntrack_timestamp - BOOLEAN
0 - disabled (default)
not 0 - enabled
- 0 - disabled (default)
- not 0 - enabled
Enable connection tracking flow timestamping.
......
.. SPDX-License-Identifier: GPL-2.0
====================================
Netfilter's flowtable infrastructure
====================================
......@@ -31,15 +34,17 @@ to use this new alternative forwarding path via nftables policy.
This is represented in Fig.1, which describes the classic forwarding path
including the Netfilter hooks and the flowtable fastpath bypass.
userspace process
^ |
| |
_____|____ ____\/___
/ \ / \
| input | | output |
\__________/ \_________/
^ |
| |
::
userspace process
^ |
| |
_____|____ ____\/___
/ \ / \
| input | | output |
\__________/ \_________/
^ |
| |
_________ __________ --------- _____\/_____
/ \ / \ |Routing | / \
--> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
......@@ -59,7 +64,7 @@ including the Netfilter hooks and the flowtable fastpath bypass.
\ / |
|__yes_________________fastpath bypass ____________________________|
Fig.1 Netfilter hooks and flowtable interactions
Fig.1 Netfilter hooks and flowtable interactions
The flowtable entry also stores the NAT configuration, so all packets are
mangled according to the NAT policy that matches the initial packets that went
......@@ -72,18 +77,18 @@ Example configuration
---------------------
Enabling the flowtable bypass is relatively easy, you only need to create a
flowtable and add one rule to your forward chain.
flowtable and add one rule to your forward chain::
table inet x {
table inet x {
flowtable f {
hook ingress priority 0; devices = { eth0, eth1 };
}
chain y {
type filter hook forward priority 0; policy accept;
ip protocol tcp flow offload @f
counter packets 0 bytes 0
}
}
chain y {
type filter hook forward priority 0; policy accept;
ip protocol tcp flow offload @f
counter packets 0 bytes 0
}
}
This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
netdevices. You can create as many flowtables as you want in case you need to
......@@ -101,12 +106,12 @@ forwarding bypass.
More reading
------------
This documentation is based on the LWN.net articles [1][2]. Rafal Milecki also
made a very complete and comprehensive summary called "A state of network
This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
also made a very complete and comprehensive summary called "A state of network
acceleration" that describes how things were before this infrastructure was
mailined [3] and it also makes a rough summary of this work [4].
mailined [3]_ and it also makes a rough summary of this work [4]_.
[1] https://lwn.net/Articles/738214/
[2] https://lwn.net/Articles/742164/
[3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
[4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
.. [1] https://lwn.net/Articles/738214/
.. [2] https://lwn.net/Articles/742164/
.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
.. SPDX-License-Identifier: GPL-2.0
=============================================
Open vSwitch datapath developer documentation
=============================================
......@@ -80,13 +83,13 @@ The <linux/openvswitch.h> header file defines the exact format of the
flow key attributes. For informal explanatory purposes here, we write
them as comma-separated strings, with parentheses indicating arguments
and nesting. For example, the following could represent a flow key
corresponding to a TCP packet that arrived on vport 1:
corresponding to a TCP packet that arrived on vport 1::
in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
frag=no), tcp(src=49163, dst=80)
Often we ellipsize arguments not important to the discussion, e.g.:
Often we ellipsize arguments not important to the discussion, e.g.::
in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
......@@ -151,20 +154,20 @@ Some care is needed to really maintain forward and backward
compatibility for applications that follow the rules listed under
"Flow key compatibility" above.
The basic rule is obvious:
The basic rule is obvious::
------------------------------------------------------------------
==================================================================
New network protocol support must only supplement existing flow
key attributes. It must not change the meaning of already defined
flow key attributes.
------------------------------------------------------------------
==================================================================
This rule does have less-obvious consequences so it is worth working
through a few examples. Suppose, for example, that the kernel module
did not already implement VLAN parsing. Instead, it just interpreted
the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
packet. The flow key for any packet with an 802.1Q header would look
essentially like this, ignoring metadata:
essentially like this, ignoring metadata::
eth(...), eth_type(0x8100)
......@@ -172,7 +175,7 @@ Naively, to add VLAN support, it makes sense to add a new "vlan" flow
key attribute to contain the VLAN tag, then continue to decode the
encapsulated headers beyond the VLAN tag using the existing field
definitions. With this change, a TCP packet in VLAN 10 would have a
flow key much like this:
flow key much like this::
eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
......@@ -187,7 +190,7 @@ across kernel versions even though it follows the compatibility rules.
The solution is to use a set of nested attributes. This is, for
example, why 802.1Q support uses nested attributes. A TCP packet in
VLAN 10 is actually expressed as:
VLAN 10 is actually expressed as::
eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
ip(proto=6, ...), tcp(...)))
......@@ -215,14 +218,14 @@ For example, consider a packet that contains an IP header that
indicates protocol 6 for TCP, but which is truncated just after the IP
header, so that the TCP header is missing. The flow key for this
packet would include a tcp attribute with all-zero src and dst, like
this:
this::
eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
As another example, consider a packet with an Ethernet type of 0x8100,
indicating that a VLAN TCI should follow, but which is truncated just
after the Ethernet type. The flow key for this packet would include
an all-zero-bits vlan and an empty encap attribute, like this:
an all-zero-bits vlan and an empty encap attribute, like this::
eth(...), eth_type(0x8100), vlan(0), encap()
......
.. SPDX-License-Identifier: GPL-2.0
==================
Operational States
==================
1. Introduction
===============
Linux distinguishes between administrative and operational state of an
interface. Administrative state is the result of "ip link set dev
......@@ -20,6 +27,7 @@ and changeable from userspace under certain rules.
2. Querying from userspace
==========================
Both admin and operational state can be queried via the netlink
operation RTM_GETLINK. It is also possible to subscribe to RTNLGRP_LINK
......@@ -30,16 +38,20 @@ These values contain interface state:
ifinfomsg::if_flags & IFF_UP:
Interface is admin up
ifinfomsg::if_flags & IFF_RUNNING:
Interface is in RFC2863 operational state UP or UNKNOWN. This is for
backward compatibility, routing daemons, dhcp clients can use this
flag to determine whether they should use the interface.
ifinfomsg::if_flags & IFF_LOWER_UP:
Driver has signaled netif_carrier_on()
ifinfomsg::if_flags & IFF_DORMANT:
Driver has signaled netif_dormant_on()
TLV IFLA_OPERSTATE
------------------
contains RFC2863 state of the interface in numeric representation:
......@@ -47,26 +59,33 @@ IF_OPER_UNKNOWN (0):
Interface is in unknown state, neither driver nor userspace has set
operational state. Interface must be considered for user data as
setting operational state has not been implemented in every driver.
IF_OPER_NOTPRESENT (1):
Unused in current kernel (notpresent interfaces normally disappear),
just a numerical placeholder.
IF_OPER_DOWN (2):
Interface is unable to transfer data on L1, f.e. ethernet is not
plugged or interface is ADMIN down.
IF_OPER_LOWERLAYERDOWN (3):
Interfaces stacked on an interface that is IF_OPER_DOWN show this
state (f.e. VLAN).
IF_OPER_TESTING (4):
Unused in current kernel.
IF_OPER_DORMANT (5):
Interface is L1 up, but waiting for an external event, f.e. for a
protocol to establish. (802.1X)
IF_OPER_UP (6):
Interface is operational up and can be used.
This TLV can also be queried via sysfs.
TLV IFLA_LINKMODE
-----------------
contains link policy. This is needed for userspace interaction
described below.
......@@ -75,6 +94,7 @@ This TLV can also be queried via sysfs.
3. Kernel driver API
====================
Kernel drivers have access to two flags that map to IFF_LOWER_UP and
IFF_DORMANT. These flags can be set from everywhere, even from
......@@ -126,6 +146,7 @@ netif_carrier_ok() && !netif_dormant():
4. Setting from userspace
=========================
Applications have to use the netlink interface to influence the
RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1
......@@ -139,18 +160,18 @@ are multicasted on the netlink group RTNLGRP_LINK.
So basically a 802.1X supplicant interacts with the kernel like this:
-subscribe to RTNLGRP_LINK
-set IFLA_LINKMODE to 1 via RTM_SETLINK
-query RTM_GETLINK once to get initial state
-if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
netlink multicast signals this state
-do 802.1X, eventually abort if flags go down again
-send RTM_SETLINK to set operstate to IF_OPER_UP if authentication
succeeds, IF_OPER_DORMANT otherwise
-see how operstate and IFF_RUNNING is echoed via netlink multicast
-set interface back to IF_OPER_DORMANT if 802.1X reauthentication
fails
-restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag
- subscribe to RTNLGRP_LINK
- set IFLA_LINKMODE to 1 via RTM_SETLINK
- query RTM_GETLINK once to get initial state
- if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
netlink multicast signals this state
- do 802.1X, eventually abort if flags go down again
- send RTM_SETLINK to set operstate to IF_OPER_UP if authentication
succeeds, IF_OPER_DORMANT otherwise
- see how operstate and IFF_RUNNING is echoed via netlink multicast
- set interface back to IF_OPER_DORMANT if 802.1X reauthentication
fails
- restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag
if supplicant goes down, bring back IFLA_LINKMODE to 0 and
IFLA_OPERSTATE to a sane value.
......
--------------------------------------------------------------------------------
+ ABSTRACT
--------------------------------------------------------------------------------
.. SPDX-License-Identifier: GPL-2.0
===========
Packet MMAP
===========
Abstract
========
This file documents the mmap() facility available with the PACKET
socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
i) capture network traffic with utilities like tcpdump, ii) transmit network
traffic, or any other that needs raw access to network interface.
i) capture network traffic with utilities like tcpdump,
ii) transmit network traffic, or any other that needs raw
access to network interface.
Howto can be found at:
https://sites.google.com/site/packetmmap/
Please send your comments to
Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
Johann Baudy
- Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
- Johann Baudy
-------------------------------------------------------------------------------
+ Why use PACKET_MMAP
--------------------------------------------------------------------------------
Why use PACKET_MMAP
===================
In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
inefficient. It uses very limited buffers and requires one system call to
capture each packet, it requires two if you want to get packet's timestamp
(like libpcap always does).
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
configurable circular buffer mapped in user space that can be used to either
send or receive packets. This way reading packets just needs to wait for them,
most of the time there is no need to issue a single system call. Concerning
......@@ -40,47 +47,45 @@ enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
supported by devices of your network. CPU IRQ pinning of your network interface
card can also be an advantage.
--------------------------------------------------------------------------------
+ How to use mmap() to improve capture process
--------------------------------------------------------------------------------
How to use mmap() to improve capture process
============================================
From the user standpoint, you should use the higher level libpcap library, which
is a de facto standard, portable across nearly all operating systems
including Win32.
including Win32.
Packet MMAP support was integrated into libpcap around the time of version 1.3.0;
TPACKET_V3 support was added in version 1.5.0
--------------------------------------------------------------------------------
+ How to use mmap() directly to improve capture process
--------------------------------------------------------------------------------
How to use mmap() directly to improve capture process
=====================================================
From the system calls stand point, the use of PACKET_MMAP involves
the following process:
the following process::
[setup] socket() -------> creation of the capture socket
setsockopt() ---> allocation of the circular buffer (ring)
option: PACKET_RX_RING
mmap() ---------> mapping of the allocated buffer to the
user process
[setup] socket() -------> creation of the capture socket
setsockopt() ---> allocation of the circular buffer (ring)
option: PACKET_RX_RING
mmap() ---------> mapping of the allocated buffer to the
user process
[capture] poll() ---------> to wait for incoming packets
[capture] poll() ---------> to wait for incoming packets
[shutdown] close() --------> destruction of the capture socket and
deallocation of all associated
resources.
[shutdown] close() --------> destruction of the capture socket and
deallocation of all associated
resources.
socket creation and destruction is straight forward, and is done
the same way with or without PACKET_MMAP:
socket creation and destruction is straight forward, and is done
the same way with or without PACKET_MMAP::
int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
where mode is SOCK_RAW for the raw interface were link level
information can be captured or SOCK_DGRAM for the cooked
interface where link level information capture is not
supported and a link level pseudo-header is provided
interface where link level information capture is not
supported and a link level pseudo-header is provided
by the kernel.
The destruction of the socket and all associated resources
......@@ -92,32 +97,31 @@ allocated RX and TX buffer ring with a single mmap() call.
See "Mapping and use of the circular buffer (ring)".
Next I will describe PACKET_MMAP settings and its constraints,
also the mapping of the circular buffer in the user process and
also the mapping of the circular buffer in the user process and
the use of this buffer.
--------------------------------------------------------------------------------
+ How to use mmap() directly to improve transmission process
--------------------------------------------------------------------------------
Transmission process is similar to capture as shown below.
How to use mmap() directly to improve transmission process
==========================================================
Transmission process is similar to capture as shown below::
[setup] socket() -------> creation of the transmission socket
setsockopt() ---> allocation of the circular buffer (ring)
option: PACKET_TX_RING
bind() ---------> bind transmission socket with a network interface
mmap() ---------> mapping of the allocated buffer to the
user process
[setup] socket() -------> creation of the transmission socket
setsockopt() ---> allocation of the circular buffer (ring)
option: PACKET_TX_RING
bind() ---------> bind transmission socket with a network interface
mmap() ---------> mapping of the allocated buffer to the
user process
[transmission] poll() ---------> wait for free packets (optional)
send() ---------> send all packets that are set as ready in
the ring
The flag MSG_DONTWAIT can be used to return
before end of transfer.
[transmission] poll() ---------> wait for free packets (optional)
send() ---------> send all packets that are set as ready in
the ring
The flag MSG_DONTWAIT can be used to return
before end of transfer.
[shutdown] close() --------> destruction of the transmission socket and
deallocation of all associated resources.
[shutdown] close() --------> destruction of the transmission socket and
deallocation of all associated resources.
Socket creation and destruction is also straight forward, and is done
the same way as in capturing described in the previous paragraph:
the same way as in capturing described in the previous paragraph::
int fd = socket(PF_PACKET, mode, 0);
......@@ -129,46 +133,48 @@ set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
Binding the socket to your network interface is mandatory (with zero copy) to
know the header size of frames used in the circular buffer.
As capture, each frame contains two parts:
As capture, each frame contains two parts::
--------------------
| struct tpacket_hdr | Header. It contains the status of
| | of this frame
|--------------------|
| data buffer |
. . Data that will be sent over the network interface.
. .
--------------------
--------------------
| struct tpacket_hdr | Header. It contains the status of
| | of this frame
|--------------------|
| data buffer |
. . Data that will be sent over the network interface.
. .
--------------------
bind() associates the socket to your network interface thanks to
sll_ifindex parameter of struct sockaddr_ll.
Initialization example:
Initialization example::
struct sockaddr_ll my_addr;
struct ifreq s_ifr;
...
struct sockaddr_ll my_addr;
struct ifreq s_ifr;
...
strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
/* get interface index of eth0 */
ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
/* get interface index of eth0 */
ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
/* fill sockaddr_ll struct to prepare binding */
my_addr.sll_family = AF_PACKET;
my_addr.sll_protocol = htons(ETH_P_ALL);
my_addr.sll_ifindex = s_ifr.ifr_ifindex;
/* fill sockaddr_ll struct to prepare binding */
my_addr.sll_family = AF_PACKET;
my_addr.sll_protocol = htons(ETH_P_ALL);
my_addr.sll_ifindex = s_ifr.ifr_ifindex;
/* bind socket to eth0 */
bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
/* bind socket to eth0 */
bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
A complete tutorial is available at: https://sites.google.com/site/packetmmap/
By default, the user should put data at :
By default, the user should put data at::
frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
the beginning of the user data will be at :
the beginning of the user data will be at::
frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
If you wish to put user data at a custom offset from the beginning of
......@@ -177,112 +183,113 @@ can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
to make this work it must be enabled previously with setsockopt()
and the PACKET_TX_HAS_OFF option.
--------------------------------------------------------------------------------
+ PACKET_MMAP settings
--------------------------------------------------------------------------------
PACKET_MMAP settings
====================
To setup PACKET_MMAP from user level code is done with a call like
- Capture process
- Capture process::
setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
- Transmission process
- Transmission process::
setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
The most significant argument in the previous call is the req parameter,
this parameter must to have the following structure:
The most significant argument in the previous call is the req parameter,
this parameter must to have the following structure::
struct tpacket_req
{
unsigned int tp_block_size; /* Minimal size of contiguous block */
unsigned int tp_block_nr; /* Number of blocks */
unsigned int tp_frame_size; /* Size of frame */
unsigned int tp_frame_nr; /* Total number of frames */
unsigned int tp_block_size; /* Minimal size of contiguous block */
unsigned int tp_block_nr; /* Number of blocks */
unsigned int tp_frame_size; /* Size of frame */
unsigned int tp_frame_nr; /* Total number of frames */
};
This structure is defined in /usr/include/linux/if_packet.h and establishes a
This structure is defined in /usr/include/linux/if_packet.h and establishes a
circular buffer (ring) of unswappable memory.
Being mapped in the capture process allows reading the captured frames and
Being mapped in the capture process allows reading the captured frames and
related meta-information like timestamps without requiring a system call.
Frames are grouped in blocks. Each block is a physically contiguous
region of memory and holds tp_block_size/tp_frame_size frames. The total number
of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
region of memory and holds tp_block_size/tp_frame_size frames. The total number
of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because::
frames_per_block = tp_block_size/tp_frame_size
indeed, packet_set_ring checks that the following condition is true
indeed, packet_set_ring checks that the following condition is true::
frames_per_block * tp_block_nr == tp_frame_nr
Lets see an example, with the following values:
Lets see an example, with the following values::
tp_block_size= 4096
tp_frame_size= 2048
tp_block_nr = 4
tp_frame_nr = 8
we will get the following buffer structure:
we will get the following buffer structure::
block #1 block #2
+---------+---------+ +---------+---------+
| frame 1 | frame 2 | | frame 3 | frame 4 |
+---------+---------+ +---------+---------+
block #1 block #2
+---------+---------+ +---------+---------+
| frame 1 | frame 2 | | frame 3 | frame 4 |
+---------+---------+ +---------+---------+
block #3 block #4
+---------+---------+ +---------+---------+
| frame 5 | frame 6 | | frame 7 | frame 8 |
+---------+---------+ +---------+---------+
block #3 block #4
+---------+---------+ +---------+---------+
| frame 5 | frame 6 | | frame 7 | frame 8 |
+---------+---------+ +---------+---------+
A frame can be of any size with the only condition it can fit in a block. A block
can only hold an integer number of frames, or in other words, a frame cannot
be spawned across two blocks, so there are some details you have to take into
account when choosing the frame_size. See "Mapping and use of the circular
can only hold an integer number of frames, or in other words, a frame cannot
be spawned across two blocks, so there are some details you have to take into
account when choosing the frame_size. See "Mapping and use of the circular
buffer (ring)".
--------------------------------------------------------------------------------
+ PACKET_MMAP setting constraints
--------------------------------------------------------------------------------
PACKET_MMAP setting constraints
===============================
In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
16384 in a 64 bit architecture. For information on these kernel versions
see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
Block size limit
------------------
Block size limit
----------------
As stated earlier, each block is a contiguous physical region of memory. These
memory regions are allocated with calls to the __get_free_pages() function. As
As stated earlier, each block is a contiguous physical region of memory. These
memory regions are allocated with calls to the __get_free_pages() function. As
the name indicates, this function allocates pages of memory, and the second
argument is "order" or a power of two number of pages, that is
(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
order=2 ==> 16384 bytes, etc. The maximum size of a
region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
precisely the limit can be calculated as:
argument is "order" or a power of two number of pages, that is
(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
order=2 ==> 16384 bytes, etc. The maximum size of a
region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
precisely the limit can be calculated as::
PAGE_SIZE << MAX_ORDER
In a i386 architecture PAGE_SIZE is 4096 bytes
In a i386 architecture PAGE_SIZE is 4096 bytes
In a 2.4/i386 kernel MAX_ORDER is 10
In a 2.6/i386 kernel MAX_ORDER is 11
So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
respectively, with an i386 architecture.
User space programs can include /usr/include/sys/user.h and
User space programs can include /usr/include/sys/user.h and
/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
The pagesize can also be determined dynamically with the getpagesize (2)
system call.
The pagesize can also be determined dynamically with the getpagesize (2)
system call.
Block number limit
--------------------
Block number limit
------------------
To understand the constraints of PACKET_MMAP, we have to see the structure
To understand the constraints of PACKET_MMAP, we have to see the structure
used to hold the pointers to each block.
Currently, this structure is a dynamically allocated vector with kmalloc
called pg_vec, its size limits the number of blocks that can be allocated.
Currently, this structure is a dynamically allocated vector with kmalloc
called pg_vec, its size limits the number of blocks that can be allocated::
+---+---+---+---+
| x | x | x | x |
......@@ -294,121 +301,123 @@ called pg_vec, its size limits the number of blocks that can be allocated.
v block #2
block #1
kmalloc allocates any number of bytes of physically contiguous memory from
a pool of pre-determined sizes. This pool of memory is maintained by the slab
allocator which is at the end the responsible for doing the allocation and
hence which imposes the maximum memory that kmalloc can allocate.
kmalloc allocates any number of bytes of physically contiguous memory from
a pool of pre-determined sizes. This pool of memory is maintained by the slab
allocator which is at the end the responsible for doing the allocation and
hence which imposes the maximum memory that kmalloc can allocate.
In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
entries of /proc/slabinfo
In a 32 bit architecture, pointers are 4 bytes long, so the total number of
pointers to blocks is
In a 32 bit architecture, pointers are 4 bytes long, so the total number of
pointers to blocks is::
131072/4 = 32768 blocks
PACKET_MMAP buffer size calculator
------------------------------------
PACKET_MMAP buffer size calculator
==================================
Definitions:
<size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo)
<pointer size>: depends on the architecture -- sizeof(void *)
<page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2)
<max-order> : is the value defined with MAX_ORDER
<frame size> : it's an upper bound of frame's capture size (more on this later)
============== ================================================================
<size-max> is the maximum size of allocable with kmalloc
(see /proc/slabinfo)
<pointer size> depends on the architecture -- ``sizeof(void *)``
<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
<max-order> is the value defined with MAX_ORDER
<frame size> it's an upper bound of frame's capture size (more on this later)
============== ================================================================
from these definitions we will derive
from these definitions we will derive::
<block number> = <size-max>/<pointer size>
<block size> = <pagesize> << <max-order>
so, the max buffer size is
so, the max buffer size is::
<block number> * <block size>
and, the number of frames be
and, the number of frames be::
<block number> * <block size> / <frame size>
Suppose the following parameters, which apply for 2.6 kernel and an
i386 architecture:
i386 architecture::
<size-max> = 131072 bytes
<pointer size> = 4 bytes
<pagesize> = 4096 bytes
<max-order> = 11
and a value for <frame size> of 2048 bytes. These parameters will yield
and a value for <frame size> of 2048 bytes. These parameters will yield::
<block number> = 131072/4 = 32768 blocks
<block size> = 4096 << 11 = 8 MiB.
and hence the buffer will have a 262144 MiB size. So it can hold
and hence the buffer will have a 262144 MiB size. So it can hold
262144 MiB / 2048 bytes = 134217728 frames
Actually, this buffer size is not possible with an i386 architecture.
Remember that the memory is allocated in kernel space, in the case of
Actually, this buffer size is not possible with an i386 architecture.
Remember that the memory is allocated in kernel space, in the case of
an i386 kernel's memory size is limited to 1GiB.
All memory allocations are not freed until the socket is closed. The memory
allocations are done with GFP_KERNEL priority, this basically means that
the allocation can wait and swap other process' memory in order to allocate
All memory allocations are not freed until the socket is closed. The memory
allocations are done with GFP_KERNEL priority, this basically means that
the allocation can wait and swap other process' memory in order to allocate
the necessary memory, so normally limits can be reached.
Other constraints
-------------------
Other constraints
-----------------
If you check the source code you will see that what I draw here as a frame
is not only the link level frame. At the beginning of each frame there is a
is not only the link level frame. At the beginning of each frame there is a
header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
meta information like timestamp. So what we draw here a frame it's really
the following (from include/linux/if_packet.h):
meta information like timestamp. So what we draw here a frame it's really
the following (from include/linux/if_packet.h)::
/*
/*
Frame structure:
- Start. Frame must be aligned to TPACKET_ALIGNMENT=16
- struct tpacket_hdr
- pad to TPACKET_ALIGNMENT=16
- struct sockaddr_ll
- Gap, chosen so that packet data (Start+tp_net) aligns to
- Gap, chosen so that packet data (Start+tp_net) aligns to
TPACKET_ALIGNMENT=16
- Start+tp_mac: [ Optional MAC header ]
- Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
- Pad to align to TPACKET_ALIGNMENT=16
*/
The following are conditions that are checked in packet_set_ring
tp_block_size must be a multiple of PAGE_SIZE (1)
tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
tp_frame_size must be a multiple of TPACKET_ALIGNMENT
tp_frame_nr must be exactly frames_per_block*tp_block_nr
The following are conditions that are checked in packet_set_ring
- tp_block_size must be a multiple of PAGE_SIZE (1)
- tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
- tp_frame_size must be a multiple of TPACKET_ALIGNMENT
- tp_frame_nr must be exactly frames_per_block*tp_block_nr
Note that tp_block_size should be chosen to be a power of two or there will
be a waste of memory.
--------------------------------------------------------------------------------
+ Mapping and use of the circular buffer (ring)
--------------------------------------------------------------------------------
Mapping and use of the circular buffer (ring)
---------------------------------------------
The mapping of the buffer in the user process is done with the conventional
The mapping of the buffer in the user process is done with the conventional
mmap function. Even the circular buffer is compound of several physically
discontiguous blocks of memory, they are contiguous to the user space, hence
just one call to mmap is needed:
just one call to mmap is needed::
mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
If tp_frame_size is a divisor of tp_block_size frames will be
If tp_frame_size is a divisor of tp_block_size frames will be
contiguously spaced by tp_frame_size bytes. If not, each
tp_block_size/tp_frame_size frames there will be a gap between
tp_block_size/tp_frame_size frames there will be a gap between
the frames. This is because a frame cannot be spawn across two
blocks.
blocks.
To use one socket for capture and transmission, the mapping of both the
RX and TX buffer ring has to be done with one call to mmap:
RX and TX buffer ring has to be done with one call to mmap::
...
setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
......@@ -420,12 +429,14 @@ RX and TX buffer ring has to be done with one call to mmap:
RX must be the first as the kernel maps the TX ring memory right
after the RX one.
At the beginning of each frame there is an status field (see
At the beginning of each frame there is an status field (see
struct tpacket_hdr). If this field is 0 means that the frame is ready
to be used for the kernel, If not, there is a frame the user can read
to be used for the kernel, If not, there is a frame the user can read
and the following flags apply:
+++ Capture process:
Capture process
^^^^^^^^^^^^^^^
from include/linux/if_packet.h
#define TP_STATUS_COPY (1 << 1)
......@@ -433,35 +444,37 @@ and the following flags apply:
#define TP_STATUS_CSUMNOTREADY (1 << 3)
#define TP_STATUS_CSUM_VALID (1 << 7)
TP_STATUS_COPY : This flag indicates that the frame (and associated
meta information) has been truncated because it's
larger than tp_frame_size. This packet can be
read entirely with recvfrom().
In order to make this work it must to be
enabled previously with setsockopt() and
the PACKET_COPY_THRESH option.
====================== =======================================================
TP_STATUS_COPY This flag indicates that the frame (and associated
meta information) has been truncated because it's
larger than tp_frame_size. This packet can be
read entirely with recvfrom().
The number of frames that can be buffered to
be read with recvfrom is limited like a normal socket.
See the SO_RCVBUF option in the socket (7) man page.
In order to make this work it must to be
enabled previously with setsockopt() and
the PACKET_COPY_THRESH option.
TP_STATUS_LOSING : indicates there were packet drops from last time
statistics where checked with getsockopt() and
the PACKET_STATISTICS option.
The number of frames that can be buffered to
be read with recvfrom is limited like a normal socket.
See the SO_RCVBUF option in the socket (7) man page.
TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which
its checksum will be done in hardware. So while
reading the packet we should not try to check the
checksum.
TP_STATUS_LOSING indicates there were packet drops from last time
statistics where checked with getsockopt() and
the PACKET_STATISTICS option.
TP_STATUS_CSUM_VALID : This flag indicates that at least the transport
header checksum of the packet has been already
validated on the kernel side. If the flag is not set
then we are free to check the checksum by ourselves
provided that TP_STATUS_CSUMNOTREADY is also not set.
TP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which
its checksum will be done in hardware. So while
reading the packet we should not try to check the
checksum.
for convenience there are also the following defines:
TP_STATUS_CSUM_VALID This flag indicates that at least the transport
header checksum of the packet has been already
validated on the kernel side. If the flag is not set
then we are free to check the checksum by ourselves
provided that TP_STATUS_CSUMNOTREADY is also not set.
====================== =======================================================
for convenience there are also the following defines::
#define TP_STATUS_KERNEL 0
#define TP_STATUS_USER 1
......@@ -469,11 +482,11 @@ for convenience there are also the following defines:
The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
receives a packet it puts in the buffer and updates the status with
at least the TP_STATUS_USER flag. Then the user can read the packet,
once the packet is read the user must zero the status field, so the kernel
once the packet is read the user must zero the status field, so the kernel
can use again that frame buffer.
The user can use poll (any other variant should apply too) to check if new
packets are in the ring:
packets are in the ring::
struct pollfd pfd;
......@@ -482,13 +495,15 @@ packets are in the ring:
pfd.events = POLLIN|POLLRDNORM|POLLERR;
if (status == TP_STATUS_KERNEL)
retval = poll(&pfd, 1, timeout);
retval = poll(&pfd, 1, timeout);
It doesn't incur in a race condition to first check the status value and
It doesn't incur in a race condition to first check the status value and
then poll for frames.
++ Transmission process
Those defines are also used for transmission:
Transmission process
^^^^^^^^^^^^^^^^^^^^
Those defines are also used for transmission::
#define TP_STATUS_AVAILABLE 0 // Frame is available
#define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
......@@ -502,24 +517,31 @@ This can be done on multiple frames. Once the user is ready to transmit, it
calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
forwarded to the network device. The kernel updates each status of sent
frames with TP_STATUS_SENDING until the end of transfer.
At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
::
header->tp_len = in_i_size;
header->tp_status = TP_STATUS_SEND_REQUEST;
retval = send(this->socket, NULL, 0, 0);
The user can also use poll() to check if a buffer is available:
(status == TP_STATUS_SENDING)
::
struct pollfd pfd;
pfd.fd = fd;
pfd.revents = 0;
pfd.events = POLLOUT;
retval = poll(&pfd, 1, timeout);
-------------------------------------------------------------------------------
+ What TPACKET versions are available and when to use them?
-------------------------------------------------------------------------------
What TPACKET versions are available and when to use them?
=========================================================
::
int val = tpacket_version;
setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
......@@ -540,17 +562,20 @@ TPACKET_V1 --> TPACKET_V2:
- VLAN metadata information available for packets
(TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
in the tpacket2_hdr structure:
- TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
that the tp_vlan_tci field has valid VLAN TCI value
- TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
indicates that the tp_vlan_tpid field has valid VLAN TPID value
- How to switch to TPACKET_V2:
1. Replace struct tpacket_hdr by struct tpacket2_hdr
2. Query header len and save
3. Set protocol version to 2, set up ring as usual
4. For getting the sockaddr_ll,
use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of
``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))``
TPACKET_V2 --> TPACKET_V3:
- Flexible buffer implementation for RX_RING:
......@@ -559,8 +584,10 @@ TPACKET_V2 --> TPACKET_V3:
3. Added poll timeout to avoid indefinite user-space wait
on idle links
4. Added user-configurable knobs:
4.1 block::timeout
4.2 tpkt_hdr::sk_rxhash
- RX Hash data available in user space
- TX_RING semantics are conceptually similar to TPACKET_V2;
use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
......@@ -569,9 +596,8 @@ TPACKET_V2 --> TPACKET_V3:
zero, indicating that the ring does not hold variable sized frames.
Packets with non-zero values of tp_next_offset will be dropped.
-------------------------------------------------------------------------------
+ AF_PACKET fanout mode
-------------------------------------------------------------------------------
AF_PACKET fanout mode
=====================
In the AF_PACKET fanout mode, packet reception can be load balanced among
processes. This also works in combination with mmap(2) on packet sockets.
......@@ -586,403 +612,402 @@ Currently implemented fanout policies are:
- PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
Minimal example code by David S. Miller (try things like "./test eth0 hash",
"./test eth0 lb", etc.):
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <net/if.h>
static const char *device_name;
static int fanout_type;
static int fanout_id;
#ifndef PACKET_FANOUT
# define PACKET_FANOUT 18
# define PACKET_FANOUT_HASH 0
# define PACKET_FANOUT_LB 1
#endif
static int setup_socket(void)
{
int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
struct sockaddr_ll ll;
struct ifreq ifr;
int fanout_arg;
if (fd < 0) {
perror("socket");
return EXIT_FAILURE;
}
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, device_name);
err = ioctl(fd, SIOCGIFINDEX, &ifr);
if (err < 0) {
perror("SIOCGIFINDEX");
return EXIT_FAILURE;
}
memset(&ll, 0, sizeof(ll));
ll.sll_family = AF_PACKET;
ll.sll_ifindex = ifr.ifr_ifindex;
err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
if (err < 0) {
perror("bind");
return EXIT_FAILURE;
}
fanout_arg = (fanout_id | (fanout_type << 16));
err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
&fanout_arg, sizeof(fanout_arg));
if (err) {
perror("setsockopt");
return EXIT_FAILURE;
}
return fd;
}
static void fanout_thread(void)
{
int fd = setup_socket();
int limit = 10000;
if (fd < 0)
exit(fd);
while (limit-- > 0) {
char buf[1600];
int err;
err = read(fd, buf, sizeof(buf));
if (err < 0) {
perror("read");
exit(EXIT_FAILURE);
}
if ((limit % 10) == 0)
fprintf(stdout, "(%d) \n", getpid());
}
fprintf(stdout, "%d: Received 10000 packets\n", getpid());
close(fd);
exit(0);
}
int main(int argc, char **argp)
{
int fd, err;
int i;
if (argc != 3) {
fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
return EXIT_FAILURE;
}
if (!strcmp(argp[2], "hash"))
fanout_type = PACKET_FANOUT_HASH;
else if (!strcmp(argp[2], "lb"))
fanout_type = PACKET_FANOUT_LB;
else {
fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
exit(EXIT_FAILURE);
}
device_name = argp[1];
fanout_id = getpid() & 0xffff;
for (i = 0; i < 4; i++) {
pid_t pid = fork();
switch (pid) {
case 0:
fanout_thread();
case -1:
perror("fork");
exit(EXIT_FAILURE);
}
}
for (i = 0; i < 4; i++) {
int status;
wait(&status);
}
return 0;
}
-------------------------------------------------------------------------------
+ AF_PACKET TPACKET_V3 example
-------------------------------------------------------------------------------
"./test eth0 lb", etc.)::
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <net/if.h>
static const char *device_name;
static int fanout_type;
static int fanout_id;
#ifndef PACKET_FANOUT
# define PACKET_FANOUT 18
# define PACKET_FANOUT_HASH 0
# define PACKET_FANOUT_LB 1
#endif
static int setup_socket(void)
{
int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
struct sockaddr_ll ll;
struct ifreq ifr;
int fanout_arg;
if (fd < 0) {
perror("socket");
return EXIT_FAILURE;
}
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, device_name);
err = ioctl(fd, SIOCGIFINDEX, &ifr);
if (err < 0) {
perror("SIOCGIFINDEX");
return EXIT_FAILURE;
}
memset(&ll, 0, sizeof(ll));
ll.sll_family = AF_PACKET;
ll.sll_ifindex = ifr.ifr_ifindex;
err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
if (err < 0) {
perror("bind");
return EXIT_FAILURE;
}
fanout_arg = (fanout_id | (fanout_type << 16));
err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
&fanout_arg, sizeof(fanout_arg));
if (err) {
perror("setsockopt");
return EXIT_FAILURE;
}
return fd;
}
static void fanout_thread(void)
{
int fd = setup_socket();
int limit = 10000;
if (fd < 0)
exit(fd);
while (limit-- > 0) {
char buf[1600];
int err;
err = read(fd, buf, sizeof(buf));
if (err < 0) {
perror("read");
exit(EXIT_FAILURE);
}
if ((limit % 10) == 0)
fprintf(stdout, "(%d) \n", getpid());
}
fprintf(stdout, "%d: Received 10000 packets\n", getpid());
close(fd);
exit(0);
}
int main(int argc, char **argp)
{
int fd, err;
int i;
if (argc != 3) {
fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
return EXIT_FAILURE;
}
if (!strcmp(argp[2], "hash"))
fanout_type = PACKET_FANOUT_HASH;
else if (!strcmp(argp[2], "lb"))
fanout_type = PACKET_FANOUT_LB;
else {
fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
exit(EXIT_FAILURE);
}
device_name = argp[1];
fanout_id = getpid() & 0xffff;
for (i = 0; i < 4; i++) {
pid_t pid = fork();
switch (pid) {
case 0:
fanout_thread();
case -1:
perror("fork");
exit(EXIT_FAILURE);
}
}
for (i = 0; i < 4; i++) {
int status;
wait(&status);
}
return 0;
}
AF_PACKET TPACKET_V3 example
============================
AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
sizes by doing it's own memory management. It is based on blocks where polling
works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
It is said that TPACKET_V3 brings the following benefits:
*) ~15 - 20% reduction in CPU-usage
*) ~20% increase in packet capture rate
*) ~2x increase in packet density
*) Port aggregation analysis
*) Non static frame size to capture entire packet payload
* ~15% - 20% reduction in CPU-usage
* ~20% increase in packet capture rate
* ~2x increase in packet density
* Port aggregation analysis
* Non static frame size to capture entire packet payload
So it seems to be a good candidate to be used with packet fanout.
Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.)::
/* Written from scratch, but kernel-to-user space API usage
* dissected from lolpcap:
* Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
* License: GPL, version 2.0
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>
#include <net/if.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <poll.h>
#include <unistd.h>
#include <signal.h>
#include <inttypes.h>
#include <sys/socket.h>
#include <sys/mman.h>
#include <linux/if_packet.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#ifndef likely
# define likely(x) __builtin_expect(!!(x), 1)
#endif
#ifndef unlikely
# define unlikely(x) __builtin_expect(!!(x), 0)
#endif
struct block_desc {
uint32_t version;
uint32_t offset_to_priv;
struct tpacket_hdr_v1 h1;
};
/* Written from scratch, but kernel-to-user space API usage
* dissected from lolpcap:
* Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
* License: GPL, version 2.0
*/
struct ring {
struct iovec *rd;
uint8_t *map;
struct tpacket_req3 req;
};
static unsigned long packets_total = 0, bytes_total = 0;
static sig_atomic_t sigint = 0;
static void sighandler(int num)
{
sigint = 1;
}
static int setup_socket(struct ring *ring, char *netdev)
{
int err, i, fd, v = TPACKET_V3;
struct sockaddr_ll ll;
unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
unsigned int blocknum = 64;
fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (fd < 0) {
perror("socket");
exit(1);
}
err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
if (err < 0) {
perror("setsockopt");
exit(1);
}
memset(&ring->req, 0, sizeof(ring->req));
ring->req.tp_block_size = blocksiz;
ring->req.tp_frame_size = framesiz;
ring->req.tp_block_nr = blocknum;
ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
ring->req.tp_retire_blk_tov = 60;
ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
sizeof(ring->req));
if (err < 0) {
perror("setsockopt");
exit(1);
}
ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
if (ring->map == MAP_FAILED) {
perror("mmap");
exit(1);
}
ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
assert(ring->rd);
for (i = 0; i < ring->req.tp_block_nr; ++i) {
ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
ring->rd[i].iov_len = ring->req.tp_block_size;
}
memset(&ll, 0, sizeof(ll));
ll.sll_family = PF_PACKET;
ll.sll_protocol = htons(ETH_P_ALL);
ll.sll_ifindex = if_nametoindex(netdev);
ll.sll_hatype = 0;
ll.sll_pkttype = 0;
ll.sll_halen = 0;
err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
if (err < 0) {
perror("bind");
exit(1);
}
return fd;
}
static void display(struct tpacket3_hdr *ppd)
{
struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
if (eth->h_proto == htons(ETH_P_IP)) {
struct sockaddr_in ss, sd;
char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
memset(&ss, 0, sizeof(ss));
ss.sin_family = PF_INET;
ss.sin_addr.s_addr = ip->saddr;
getnameinfo((struct sockaddr *) &ss, sizeof(ss),
sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
memset(&sd, 0, sizeof(sd));
sd.sin_family = PF_INET;
sd.sin_addr.s_addr = ip->daddr;
getnameinfo((struct sockaddr *) &sd, sizeof(sd),
dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
printf("%s -> %s, ", sbuff, dbuff);
}
printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
}
static void walk_block(struct block_desc *pbd, const int block_num)
{
int num_pkts = pbd->h1.num_pkts, i;
unsigned long bytes = 0;
struct tpacket3_hdr *ppd;
ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
pbd->h1.offset_to_first_pkt);
for (i = 0; i < num_pkts; ++i) {
bytes += ppd->tp_snaplen;
display(ppd);
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>
#include <net/if.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <poll.h>
#include <unistd.h>
#include <signal.h>
#include <inttypes.h>
#include <sys/socket.h>
#include <sys/mman.h>
#include <linux/if_packet.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#ifndef likely
# define likely(x) __builtin_expect(!!(x), 1)
#endif
#ifndef unlikely
# define unlikely(x) __builtin_expect(!!(x), 0)
#endif
struct block_desc {
uint32_t version;
uint32_t offset_to_priv;
struct tpacket_hdr_v1 h1;
};
struct ring {
struct iovec *rd;
uint8_t *map;
struct tpacket_req3 req;
};
static unsigned long packets_total = 0, bytes_total = 0;
static sig_atomic_t sigint = 0;
static void sighandler(int num)
{
sigint = 1;
}
static int setup_socket(struct ring *ring, char *netdev)
{
int err, i, fd, v = TPACKET_V3;
struct sockaddr_ll ll;
unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
unsigned int blocknum = 64;
fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (fd < 0) {
perror("socket");
exit(1);
}
err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
if (err < 0) {
perror("setsockopt");
exit(1);
}
memset(&ring->req, 0, sizeof(ring->req));
ring->req.tp_block_size = blocksiz;
ring->req.tp_frame_size = framesiz;
ring->req.tp_block_nr = blocknum;
ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
ring->req.tp_retire_blk_tov = 60;
ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
sizeof(ring->req));
if (err < 0) {
perror("setsockopt");
exit(1);
}
ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
if (ring->map == MAP_FAILED) {
perror("mmap");
exit(1);
}
ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
assert(ring->rd);
for (i = 0; i < ring->req.tp_block_nr; ++i) {
ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
ring->rd[i].iov_len = ring->req.tp_block_size;
}
memset(&ll, 0, sizeof(ll));
ll.sll_family = PF_PACKET;
ll.sll_protocol = htons(ETH_P_ALL);
ll.sll_ifindex = if_nametoindex(netdev);
ll.sll_hatype = 0;
ll.sll_pkttype = 0;
ll.sll_halen = 0;
err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
if (err < 0) {
perror("bind");
exit(1);
}
return fd;
}
static void display(struct tpacket3_hdr *ppd)
{
struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
if (eth->h_proto == htons(ETH_P_IP)) {
struct sockaddr_in ss, sd;
char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
memset(&ss, 0, sizeof(ss));
ss.sin_family = PF_INET;
ss.sin_addr.s_addr = ip->saddr;
getnameinfo((struct sockaddr *) &ss, sizeof(ss),
sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
memset(&sd, 0, sizeof(sd));
sd.sin_family = PF_INET;
sd.sin_addr.s_addr = ip->daddr;
getnameinfo((struct sockaddr *) &sd, sizeof(sd),
dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
printf("%s -> %s, ", sbuff, dbuff);
}
printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
}
static void walk_block(struct block_desc *pbd, const int block_num)
{
int num_pkts = pbd->h1.num_pkts, i;
unsigned long bytes = 0;
struct tpacket3_hdr *ppd;
ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
pbd->h1.offset_to_first_pkt);
for (i = 0; i < num_pkts; ++i) {
bytes += ppd->tp_snaplen;
display(ppd);
ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
ppd->tp_next_offset);
}
packets_total += num_pkts;
bytes_total += bytes;
}
static void flush_block(struct block_desc *pbd)
{
pbd->h1.block_status = TP_STATUS_KERNEL;
}
static void teardown_socket(struct ring *ring, int fd)
{
munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
free(ring->rd);
close(fd);
}
int main(int argc, char **argp)
{
int fd, err;
socklen_t len;
struct ring ring;
struct pollfd pfd;
unsigned int block_num = 0, blocks = 64;
struct block_desc *pbd;
struct tpacket_stats_v3 stats;
if (argc != 2) {
fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
return EXIT_FAILURE;
}
signal(SIGINT, sighandler);
memset(&ring, 0, sizeof(ring));
fd = setup_socket(&ring, argp[argc - 1]);
assert(fd > 0);
memset(&pfd, 0, sizeof(pfd));
pfd.fd = fd;
pfd.events = POLLIN | POLLERR;
pfd.revents = 0;
while (likely(!sigint)) {
pbd = (struct block_desc *) ring.rd[block_num].iov_base;
if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
poll(&pfd, 1, -1);
continue;
}
walk_block(pbd, block_num);
flush_block(pbd);
block_num = (block_num + 1) % blocks;
}
len = sizeof(stats);
err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
if (err < 0) {
perror("getsockopt");
exit(1);
}
fflush(stdout);
printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
stats.tp_packets, bytes_total, stats.tp_drops,
stats.tp_freeze_q_cnt);
teardown_socket(&ring, fd);
return 0;
}
-------------------------------------------------------------------------------
+ PACKET_QDISC_BYPASS
-------------------------------------------------------------------------------
ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
ppd->tp_next_offset);
}
packets_total += num_pkts;
bytes_total += bytes;
}
static void flush_block(struct block_desc *pbd)
{
pbd->h1.block_status = TP_STATUS_KERNEL;
}
static void teardown_socket(struct ring *ring, int fd)
{
munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
free(ring->rd);
close(fd);
}
int main(int argc, char **argp)
{
int fd, err;
socklen_t len;
struct ring ring;
struct pollfd pfd;
unsigned int block_num = 0, blocks = 64;
struct block_desc *pbd;
struct tpacket_stats_v3 stats;
if (argc != 2) {
fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
return EXIT_FAILURE;
}
signal(SIGINT, sighandler);
memset(&ring, 0, sizeof(ring));
fd = setup_socket(&ring, argp[argc - 1]);
assert(fd > 0);
memset(&pfd, 0, sizeof(pfd));
pfd.fd = fd;
pfd.events = POLLIN | POLLERR;
pfd.revents = 0;
while (likely(!sigint)) {
pbd = (struct block_desc *) ring.rd[block_num].iov_base;
if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
poll(&pfd, 1, -1);
continue;
}
walk_block(pbd, block_num);
flush_block(pbd);
block_num = (block_num + 1) % blocks;
}
len = sizeof(stats);
err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
if (err < 0) {
perror("getsockopt");
exit(1);
}
fflush(stdout);
printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
stats.tp_packets, bytes_total, stats.tp_drops,
stats.tp_freeze_q_cnt);
teardown_socket(&ring, fd);
return 0;
}
PACKET_QDISC_BYPASS
===================
If there is a requirement to load the network with many packets in a similar
fashion as pktgen does, you might set the following option after socket
creation:
creation::
int one = 1;
setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
......@@ -997,31 +1022,32 @@ components of a system.
On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
on PF_PACKET sockets.
-------------------------------------------------------------------------------
+ PACKET_TIMESTAMP
-------------------------------------------------------------------------------
PACKET_TIMESTAMP
================
The PACKET_TIMESTAMP setting determines the source of the timestamp in
the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
NIC is capable of timestamping packets in hardware, you can request those
hardware timestamps to be used. Note: you may need to enable the generation
of hardware timestamps with SIOCSHWTSTAMP (see related information from
Documentation/networking/timestamping.txt).
Documentation/networking/timestamping.rst).
PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:
PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING::
int req = SOF_TIMESTAMPING_RAW_HARDWARE;
setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
For the mmap(2)ed ring buffers, such timestamps are stored in the
tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
what kind of timestamp has been reported, the tp_status field is binary |'ed
with the following possible bits ...
``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members.
To determine what kind of timestamp has been reported, the tp_status field
is binary or'ed with the following possible bits ...
::
TP_STATUS_TS_RAW_HARDWARE
TP_STATUS_TS_SOFTWARE
... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the
RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
software fallback was invoked *within* PF_PACKET's processing code (less
precise).
......@@ -1043,19 +1069,16 @@ TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
members do not contain a valid value. For TX_RINGs, by default no timestamp
is generated!
See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt
See include/linux/net_tstamp.h and Documentation/networking/timestamping.rst
for more information on hardware timestamps.
-------------------------------------------------------------------------------
+ Miscellaneous bits
-------------------------------------------------------------------------------
Miscellaneous bits
==================
- Packet sockets work well together with Linux socket filters, thus you also
might want to have a look at Documentation/networking/filter.rst
--------------------------------------------------------------------------------
+ THANKS
--------------------------------------------------------------------------------
Jesse Brandeburg, for fixing my grammathical/spelling errors
THANKS
======
Jesse Brandeburg, for fixing my grammathical/spelling errors
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
============================
Linux Phonet protocol family
============================
......@@ -11,6 +15,7 @@ device attached to the modem. The modem takes care of routing.
Phonet packets can be exchanged through various hardware connections
depending on the device, such as:
- USB with the CDC Phonet interface,
- infrared,
- Bluetooth,
......@@ -21,7 +26,7 @@ depending on the device, such as:
Packets format
--------------
Phonet packets have a common header as follows:
Phonet packets have a common header as follows::
struct phonethdr {
uint8_t pn_media; /* Media type (link-layer identifier) */
......@@ -72,7 +77,7 @@ only the (default) Linux FIFO qdisc should be used with them.
Network layer
-------------
The Phonet socket address family maps the Phonet packet header:
The Phonet socket address family maps the Phonet packet header::
struct sockaddr_pn {
sa_family_t spn_family; /* AF_PHONET */
......@@ -94,6 +99,8 @@ protocol from the PF_PHONET family. Each socket is bound to one of the
2^10 object IDs available, and can send and receive packets with any
other peer.
::
struct sockaddr_pn addr = { .spn_family = AF_PHONET, };
ssize_t len;
socklen_t addrlen = sizeof(addr);
......@@ -105,7 +112,7 @@ other peer.
sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr));
len = recvfrom(fd, buf, sizeof(buf), 0,
(struct sockaddr *)&addr, &addrlen);
(struct sockaddr *)&addr, &addrlen);
This protocol follows the SOCK_DGRAM connection-less semantics.
However, connect() and getpeername() are not supported, as they did
......@@ -116,7 +123,7 @@ Resource subscription
---------------------
A Phonet datagram socket can be subscribed to any number of 8-bits
Phonet resources, as follow:
Phonet resources, as follow::
uint32_t res = 0xXX;
ioctl(fd, SIOCPNADDRESOURCE, &res);
......@@ -137,6 +144,8 @@ socket paradigm. The listening socket is bound to an unique free object
ID. Each listening socket can handle up to 255 simultaneous
connections, one per accept()'d socket.
::
int lfd, cfd;
lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE);
......@@ -161,7 +170,7 @@ Connections are traditionally established between two endpoints by a
As of Linux kernel version 2.6.39, it is also possible to connect
two endpoints directly, using connect() on the active side. This is
intended to support the newer Nokia Wireless Modem API, as found in
e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform:
e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform::
struct sockaddr_spn spn;
int fd;
......@@ -177,38 +186,45 @@ e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform:
close(fd);
WARNING:
When polling a connected pipe socket for writability, there is an
intrinsic race condition whereby writability might be lost between the
polling and the writing system calls. In this case, the socket will
block until write becomes possible again, unless non-blocking mode
is enabled.
.. Warning:
When polling a connected pipe socket for writability, there is an
intrinsic race condition whereby writability might be lost between the
polling and the writing system calls. In this case, the socket will
block until write becomes possible again, unless non-blocking mode
is enabled.
The pipe protocol provides two socket options at the SOL_PNPIPE level:
PNPIPE_ENCAP accepts one integer value (int) of:
PNPIPE_ENCAP_NONE: The socket operates normally (default).
PNPIPE_ENCAP_NONE:
The socket operates normally (default).
PNPIPE_ENCAP_IP: The socket is used as a backend for a virtual IP
PNPIPE_ENCAP_IP:
The socket is used as a backend for a virtual IP
interface. This requires CAP_NET_ADMIN capability. GPRS data
support on Nokia modems can use this. Note that the socket cannot
be reliably poll()'d or read() from while in this mode.
PNPIPE_IFINDEX is a read-only integer value. It contains the
interface index of the network interface created by PNPIPE_ENCAP,
or zero if encapsulation is off.
PNPIPE_IFINDEX
is a read-only integer value. It contains the
interface index of the network interface created by PNPIPE_ENCAP,
or zero if encapsulation is off.
PNPIPE_HANDLE is a read-only integer value. It contains the underlying
identifier ("pipe handle") of the pipe. This is only defined for
socket descriptors that are already connected or being connected.
PNPIPE_HANDLE
is a read-only integer value. It contains the underlying
identifier ("pipe handle") of the pipe. This is only defined for
socket descriptors that are already connected or being connected.
Authors
-------
Linux Phonet was initially written by Sakari Ailus.
Other contributors include Mikä Liljeberg, Andras Domokos,
Carlos Chinea and Rémi Denis-Courmont.
Copyright (C) 2008 Nokia Corporation.
Copyright |copy| 2008 Nokia Corporation.
.. SPDX-License-Identifier: GPL-2.0
HOWTO for the linux packet generator
------------------------------------
====================================
HOWTO for the linux packet generator
====================================
Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel
or as a module. A module is preferred; modprobe pktgen if needed. Once
......@@ -9,17 +10,18 @@ running, pktgen creates a thread for each CPU with affinity to that CPU.
Monitoring and controlling is done via /proc. It is easiest to select a
suitable sample script and configure that.
On a dual CPU:
On a dual CPU::
ps aux | grep pkt
root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0]
root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1]
ps aux | grep pkt
root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0]
root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1]
For monitoring and control pktgen creates::
For monitoring and control pktgen creates:
/proc/net/pktgen/pgctrl
/proc/net/pktgen/kpktgend_X
/proc/net/pktgen/ethX
/proc/net/pktgen/ethX
Tuning NIC for max performance
......@@ -28,7 +30,8 @@ Tuning NIC for max performance
The default NIC settings are (likely) not tuned for pktgen's artificial
overload type of benchmarking, as this could hurt the normal use-case.
Specifically increasing the TX ring buffer in the NIC:
Specifically increasing the TX ring buffer in the NIC::
# ethtool -G ethX tx 1024
A larger TX ring can improve pktgen's performance, while it can hurt
......@@ -46,7 +49,8 @@ This cleanup issue is specifically the case for the driver ixgbe
and the cleanup interval is affected by the ethtool --coalesce setting
of parameter "rx-usecs".
For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6):
For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6)::
# ethtool -C ethX rx-usecs 30
......@@ -55,7 +59,7 @@ Kernel threads
Pktgen creates a thread for each CPU with affinity to that CPU.
Which is controlled through procfile /proc/net/pktgen/kpktgend_X.
Example: /proc/net/pktgen/kpktgend_0
Example: /proc/net/pktgen/kpktgend_0::
Running:
Stopped: eth4@0
......@@ -64,6 +68,7 @@ Example: /proc/net/pktgen/kpktgend_0
Most important are the devices assigned to the thread.
The two basic thread commands are:
* add_device DEVICE@NAME -- adds a single device
* rem_device_all -- remove all associated devices
......@@ -73,7 +78,7 @@ be unique.
To support adding the same device to multiple threads, which is useful
with multi queue NICs, the device naming scheme is extended with "@":
device@something
device@something
The part after "@" can be anything, but it is custom to use the thread
number.
......@@ -83,30 +88,30 @@ Viewing devices
The Params section holds configured information. The Current section
holds running statistics. The Result is printed after a run or after
interruption. Example:
/proc/net/pktgen/eth4@0
Params: count 100000 min_pkt_size: 60 max_pkt_size: 60
frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0
flows: 0 flowlen: 0
queue_map_min: 0 queue_map_max: 0
dst_min: 192.168.81.2 dst_max:
src_min: src_max:
src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8
udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9
src_mac_count: 0 dst_mac_count: 0
Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU
Current:
pkts-sofar: 100000 errors: 0
started: 623913381008us stopped: 623913396439us idle: 25us
seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2
cur_udp_dst: 9 cur_udp_src: 42
cur_queue_map: 0
flows: 0
Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags)
6480562pps 3110Mb/sec (3110669760bps) errors: 0
interruption. Example::
/proc/net/pktgen/eth4@0
Params: count 100000 min_pkt_size: 60 max_pkt_size: 60
frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0
flows: 0 flowlen: 0
queue_map_min: 0 queue_map_max: 0
dst_min: 192.168.81.2 dst_max:
src_min: src_max:
src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8
udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9
src_mac_count: 0 dst_mac_count: 0
Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU
Current:
pkts-sofar: 100000 errors: 0
started: 623913381008us stopped: 623913396439us idle: 25us
seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2
cur_udp_dst: 9 cur_udp_src: 42
cur_queue_map: 0
flows: 0
Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags)
6480562pps 3110Mb/sec (3110669760bps) errors: 0
Configuring devices
......@@ -114,11 +119,12 @@ Configuring devices
This is done via the /proc interface, and most easily done via pgset
as defined in the sample scripts.
You need to specify PGDEV environment variable to use functions from sample
scripts, i.e.:
export PGDEV=/proc/net/pktgen/eth4@0
source samples/pktgen/functions.sh
scripts, i.e.::
export PGDEV=/proc/net/pktgen/eth4@0
source samples/pktgen/functions.sh
Examples:
Examples::
pg_ctrl start starts injection.
pg_ctrl stop aborts injection. Also, ^C aborts generator.
......@@ -126,17 +132,17 @@ Examples:
pgset "clone_skb 1" sets the number of copies of the same packet
pgset "clone_skb 0" use single SKB for all transmits
pgset "burst 8" uses xmit_more API to queue 8 copies of the same
packet and update HW tx queue tail pointer once.
"burst 1" is the default
packet and update HW tx queue tail pointer once.
"burst 1" is the default
pgset "pkt_size 9014" sets packet size to 9014
pgset "frags 5" packet will consist of 5 fragments
pgset "count 200000" sets number of packets to send, set to zero
for continuous sends until explicitly stopped.
for continuous sends until explicitly stopped.
pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds
pgset "dst 10.0.0.1" sets IP destination address
(BEWARE! This generator is very aggressive!)
(BEWARE! This generator is very aggressive!)
pgset "dst_min 10.0.0.1" Same as dst
pgset "dst_max 10.0.0.254" Set the maximum destination IP.
......@@ -149,46 +155,46 @@ Examples:
pgset "queue_map_min 0" Sets the min value of tx queue interval
pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices
To select queue 1 of a given device,
use queue_map_min=1 and queue_map_max=1
To select queue 1 of a given device,
use queue_map_min=1 and queue_map_max=1
pgset "src_mac_count 1" Sets the number of MACs we'll range through.
The 'minimum' MAC is what you set with srcmac.
The 'minimum' MAC is what you set with srcmac.
pgset "dst_mac_count 1" Sets the number of MACs we'll range through.
The 'minimum' MAC is what you set with dstmac.
The 'minimum' MAC is what you set with dstmac.
pgset "flag [name]" Set a flag to determine behaviour. Current flags
are: IPSRC_RND # IP source is random (between min/max)
IPDST_RND # IP destination is random
UDPSRC_RND, UDPDST_RND,
MACSRC_RND, MACDST_RND
TXSIZE_RND, IPV6,
MPLS_RND, VID_RND, SVID_RND
FLOW_SEQ,
QUEUE_MAP_RND # queue map random
QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
UDPCSUM,
IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
NODE_ALLOC # node specific memory allocation
NO_TIMESTAMP # disable timestamping
are: IPSRC_RND # IP source is random (between min/max)
IPDST_RND # IP destination is random
UDPSRC_RND, UDPDST_RND,
MACSRC_RND, MACDST_RND
TXSIZE_RND, IPV6,
MPLS_RND, VID_RND, SVID_RND
FLOW_SEQ,
QUEUE_MAP_RND # queue map random
QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
UDPCSUM,
IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
NODE_ALLOC # node specific memory allocation
NO_TIMESTAMP # disable timestamping
pgset 'flag ![name]' Clear a flag to determine behaviour.
Note that you might need to use single quote in
interactive mode, so that your shell wouldn't expand
the specified flag as a history command.
Note that you might need to use single quote in
interactive mode, so that your shell wouldn't expand
the specified flag as a history command.
pgset "spi [SPI_VALUE]" Set specific SA used to transform packet.
pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then
cycle through the port range.
cycle through the port range.
pgset "udp_src_max 9" set UDP source port max.
pgset "udp_dst_min 9" set UDP destination port min, If < udp_dst_max, then
cycle through the port range.
cycle through the port range.
pgset "udp_dst_max 9" set UDP destination port max.
pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example
outer label=16,middle label=32,
outer label=16,middle label=32,
inner label=0 (IPv4 NULL)) Note that
there must be no spaces between the
arguments. Leading zeros are required.
......@@ -232,10 +238,14 @@ A collection of tutorial scripts and helpers for pktgen is in the
samples/pktgen directory. The helper parameters.sh file support easy
and consistent parameter parsing across the sample scripts.
Usage example and help:
Usage example and help::
./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2
Usage: ./pktgen_sample01_simple.sh [-vx] -i ethX
Usage:::
./pktgen_sample01_simple.sh [-vx] -i ethX
-i : ($DEV) output interface/device (required)
-s : ($PKT_SIZE) packet size
-d : ($DEST_IP) destination IP
......@@ -250,13 +260,13 @@ The global variables being set are also listed. E.g. the required
interface/device parameter "-i" sets variable $DEV. Copy the
pktgen_sampleXX scripts and modify them to fit your own needs.
The old scripts:
The old scripts::
pktgen.conf-1-2 # 1 CPU 2 dev
pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS
pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6
pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS
pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows.
pktgen.conf-1-2 # 1 CPU 2 dev
pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS
pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6
pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS
pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows.
Interrupt affinity
......@@ -271,10 +281,10 @@ to the running threads CPU (directly from smp_processor_id()).
Enable IPsec
============
Default IPsec transformation with ESP encapsulation plus transport mode
can be enabled by simply setting:
can be enabled by simply setting::
pgset "flag IPSEC"
pgset "flows 1"
pgset "flag IPSEC"
pgset "flows 1"
To avoid breaking existing testbed scripts for using AH type and tunnel mode,
you can use "pgset spi SPI_VALUE" to specify which transformation mode
......@@ -284,115 +294,117 @@ to employ.
Current commands and configuration options
==========================================
** Pgcontrol commands:
**Pgcontrol commands**::
start
stop
reset
start
stop
reset
** Thread commands:
**Thread commands**::
add_device
rem_device_all
add_device
rem_device_all
** Device commands:
**Device commands**::
count
clone_skb
burst
debug
count
clone_skb
burst
debug
frags
delay
frags
delay
src_mac_count
dst_mac_count
src_mac_count
dst_mac_count
pkt_size
min_pkt_size
max_pkt_size
pkt_size
min_pkt_size
max_pkt_size
queue_map_min
queue_map_max
skb_priority
queue_map_min
queue_map_max
skb_priority
tos (ipv4)
traffic_class (ipv6)
tos (ipv4)
traffic_class (ipv6)
mpls
mpls
udp_src_min
udp_src_max
udp_src_min
udp_src_max
udp_dst_min
udp_dst_max
udp_dst_min
udp_dst_max
node
node
flag
IPSRC_RND
IPDST_RND
UDPSRC_RND
UDPDST_RND
MACSRC_RND
MACDST_RND
TXSIZE_RND
IPV6
MPLS_RND
VID_RND
SVID_RND
FLOW_SEQ
QUEUE_MAP_RND
QUEUE_MAP_CPU
UDPCSUM
IPSEC
NODE_ALLOC
NO_TIMESTAMP
flag
IPSRC_RND
IPDST_RND
UDPSRC_RND
UDPDST_RND
MACSRC_RND
MACDST_RND
TXSIZE_RND
IPV6
MPLS_RND
VID_RND
SVID_RND
FLOW_SEQ
QUEUE_MAP_RND
QUEUE_MAP_CPU
UDPCSUM
IPSEC
NODE_ALLOC
NO_TIMESTAMP
spi (ipsec)
spi (ipsec)
dst_min
dst_max
dst_min
dst_max
src_min
src_max
src_min
src_max
dst_mac
src_mac
dst_mac
src_mac
clear_counters
clear_counters
src6
dst6
dst6_max
dst6_min
src6
dst6
dst6_max
dst6_min
flows
flowlen
flows
flowlen
rate
ratep
rate
ratep
xmit_mode <start_xmit|netif_receive>
xmit_mode <start_xmit|netif_receive>
vlan_cfi
vlan_id
vlan_p
vlan_cfi
vlan_id
vlan_p
svlan_cfi
svlan_id
svlan_p
svlan_cfi
svlan_id
svlan_p
References:
ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
- tp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
Paper from Linux-Kongress in Erlangen 2004.
ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
Thanks to:
Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek
Stephen Hemminger, Andi Kleen, Dave Miller and many others.
......
.. SPDX-License-Identifier: GPL-2.0
================================================
PLIP: The Parallel Line Internet Protocol Device
================================================
Donald Becker (becker@super.org)
I.D.A. Supercomputing Research Center, Bowie MD 20715
......@@ -83,7 +87,7 @@ When the PLIP driver is used in IRQ mode, the timeout used for triggering a
data transfer (the maximal time the PLIP driver would allow the other side
before announcing a timeout, when trying to handshake a transfer of some
data) is, by default, 500usec. As IRQ delivery is more or less immediate,
this timeout is quite sufficient.
this timeout is quite sufficient.
When in IRQ-less mode, the PLIP driver polls the parallel port HZ times
per second (where HZ is typically 100 on most platforms, and 1024 on an
......@@ -115,7 +119,7 @@ printer "null" cable to transfer data four bits at a time using
data bit outputs connected to status bit inputs.
The second data transfer method relies on both machines having
bi-directional parallel ports, rather than output-only ``printer''
bi-directional parallel ports, rather than output-only ``printer``
ports. This allows byte-wide transfers and avoids reconstructing
nibbles into bytes, leading to much faster transfers.
......@@ -132,7 +136,7 @@ bits with standard status register implementation.
A cable that implements this protocol is available commercially as a
"Null Printer" or "Turbo Laplink" cable. It can be constructed with
two DB-25 male connectors symmetrically connected as follows:
two DB-25 male connectors symmetrically connected as follows::
STROBE output 1*
D0->ERROR 2 - 15 15 - 2
......@@ -146,7 +150,8 @@ two DB-25 male connectors symmetrically connected as follows:
SLCTIN 17 - 17
extra grounds are 18*,19*,20*,21*,22*,23*,24*
GROUND 25 - 25
* Do not connect these pins on either end
* Do not connect these pins on either end
If the cable you are using has a metallic shield it should be
connected to the metallic DB-25 shell at one end only.
......@@ -155,14 +160,14 @@ Parallel Transfer Mode 1
========================
The second data transfer method relies on both machines having
bi-directional parallel ports, rather than output-only ``printer''
bi-directional parallel ports, rather than output-only ``printer``
ports. This allows byte-wide transfers, and avoids reconstructing
nibbles into bytes. This cable should not be used on unidirectional
``printer'' (as opposed to ``parallel'') ports or when the machine
``printer`` (as opposed to ``parallel``) ports or when the machine
isn't configured for PLIP, as it will result in output driver
conflicts and the (unlikely) possibility of damage.
The cable for this transfer mode should be constructed as follows:
The cable for this transfer mode should be constructed as follows::
STROBE->BUSY 1 - 11
D0->D0 2 - 2
......@@ -179,7 +184,8 @@ The cable for this transfer mode should be constructed as follows:
GND->ERROR 18 - 15
extra grounds are 19*,20*,21*,22*,23*,24*
GROUND 25 - 25
* Do not connect these pins on either end
* Do not connect these pins on either end
Once again, if the cable you are using has a metallic shield it should
be connected to the metallic DB-25 shell at one end only.
......@@ -188,7 +194,7 @@ PLIP Mode 0 transfer protocol
=============================
The PLIP driver is compatible with the "Crynwr" parallel port transfer
standard in Mode 0. That standard specifies the following protocol:
standard in Mode 0. That standard specifies the following protocol::
send header nibble '0x8'
count-low octet
......@@ -196,20 +202,21 @@ standard in Mode 0. That standard specifies the following protocol:
... data octets
checksum octet
Each octet is sent as
Each octet is sent as::
<wait for rx. '0x1?'> <send 0x10+(octet&0x0F)>
<wait for rx. '0x0?'> <send 0x00+((octet>>4)&0x0F)>
To start a transfer the transmitting machine outputs a nibble 0x08.
That raises the ACK line, triggering an interrupt in the receiving
machine. The receiving machine disables interrupts and raises its own ACK
line.
line.
Restated:
Restated::
(OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise)
Send_Byte:
OUT := low nibble, OUT.4 := 1
WAIT FOR IN.4 = 1
OUT := high nibble, OUT.4 := 0
WAIT FOR IN.4 = 0
(OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise)
Send_Byte:
OUT := low nibble, OUT.4 := 1
WAIT FOR IN.4 = 1
OUT := high nibble, OUT.4 := 0
WAIT FOR IN.4 = 0
PPP Generic Driver and Channel Interface
----------------------------------------
.. SPDX-License-Identifier: GPL-2.0
Paul Mackerras
========================================
PPP Generic Driver and Channel Interface
========================================
Paul Mackerras
paulus@samba.org
7 Feb 2002
The generic PPP driver in linux-2.4 provides an implementation of the
......@@ -19,7 +23,7 @@ functionality which is of use in any PPP implementation, including:
* simple packet filtering
For sending and receiving PPP frames, the generic PPP driver calls on
the services of PPP `channels'. A PPP channel encapsulates a
the services of PPP ``channels``. A PPP channel encapsulates a
mechanism for transporting PPP frames from one machine to another. A
PPP channel implementation can be arbitrarily complex internally but
has a very simple interface with the generic PPP code: it merely has
......@@ -102,7 +106,7 @@ communications medium and prepare it to do PPP. For example, with an
async tty, this can involve setting the tty speed and modes, issuing
modem commands, and then going through some sort of dialog with the
remote system to invoke PPP service there. We refer to this process
as `discovery'. Then the user-level process tells the medium to
as ``discovery``. Then the user-level process tells the medium to
become a PPP channel and register itself with the generic PPP layer.
The channel then has to report the channel number assigned to it back
to the user-level process. From that point, the PPP negotiation code
......@@ -111,8 +115,8 @@ negotiation, accessing the channel through the /dev/ppp interface.
At the interface to the PPP generic layer, PPP frames are stored in
skbuff structures and start with the two-byte PPP protocol number.
The frame does *not* include the 0xff `address' byte or the 0x03
`control' byte that are optionally used in async PPP. Nor is there
The frame does *not* include the 0xff ``address`` byte or the 0x03
``control`` byte that are optionally used in async PPP. Nor is there
any escaping of control characters, nor are there any FCS or framing
characters included. That is all the responsibility of the channel
code, if it is needed for the particular medium. That is, the skbuffs
......@@ -121,16 +125,16 @@ protocol number and the data, and the skbuffs presented to ppp_input()
must be in the same format.
The channel must provide an instance of a ppp_channel struct to
represent the channel. The channel is free to use the `private' field
however it wishes. The channel should initialize the `mtu' and
`hdrlen' fields before calling ppp_register_channel() and not change
them until after ppp_unregister_channel() returns. The `mtu' field
represent the channel. The channel is free to use the ``private`` field
however it wishes. The channel should initialize the ``mtu`` and
``hdrlen`` fields before calling ppp_register_channel() and not change
them until after ppp_unregister_channel() returns. The ``mtu`` field
represents the maximum size of the data part of the PPP frames, that
is, it does not include the 2-byte protocol number.
If the channel needs some headroom in the skbuffs presented to it for
transmission (i.e., some space free in the skbuff data area before the
start of the PPP frame), it should set the `hdrlen' field of the
start of the PPP frame), it should set the ``hdrlen`` field of the
ppp_channel struct to the amount of headroom required. The generic
PPP layer will attempt to provide that much headroom but the channel
should still check if there is sufficient headroom and copy the skbuff
......@@ -322,6 +326,8 @@ an interface unit are:
interface. The argument should be a pointer to an int containing
the new flags value. The bits in the flags value that can be set
are:
================ ========================================
SC_COMP_TCP enable transmit TCP header compression
SC_NO_TCP_CCID disable connection-id compression for
TCP header compression
......@@ -335,6 +341,7 @@ an interface unit are:
SC_MP_SHORTSEQ expect short multilink sequence
numbers on received multilink fragments
SC_MP_XSHORTSEQ transmit short multilink sequence nos.
================ ========================================
The values of these flags are defined in <linux/ppp-ioctl.h>. Note
that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and
......@@ -345,17 +352,20 @@ an interface unit are:
interface unit. The argument should point to an int where the ioctl
will store the flags value. As well as the values listed above for
PPPIOCSFLAGS, the following bits may be set in the returned value:
================ =========================================
SC_COMP_RUN CCP compressor is running
SC_DECOMP_RUN CCP decompressor is running
SC_DC_ERROR CCP decompressor detected non-fatal error
SC_DC_FERROR CCP decompressor detected fatal error
================ =========================================
* PPPIOCSCOMPRESS sets the parameters for packet compression or
decompression. The argument should point to a ppp_option_data
structure (defined in <linux/ppp-ioctl.h>), which contains a
pointer/length pair which should describe a block of memory
containing a CCP option specifying a compression method and its
parameters. The ppp_option_data struct also contains a `transmit'
parameters. The ppp_option_data struct also contains a ``transmit``
field. If this is 0, the ioctl will affect the receive path,
otherwise the transmit path.
......@@ -377,7 +387,7 @@ an interface unit are:
ppp_idle structure (defined in <linux/ppp_defs.h>). If the
CONFIG_PPP_FILTER option is enabled, the set of packets which reset
the transmit and receive idle timers is restricted to those which
pass the `active' packet filter.
pass the ``active`` packet filter.
Two versions of this command exist, to deal with user space
expecting times as either 32-bit or 64-bit time_t seconds.
......@@ -391,31 +401,33 @@ an interface unit are:
* PPPIOCSNPMODE sets the network-protocol mode for a given network
protocol. The argument should point to an npioctl struct (defined
in <linux/ppp-ioctl.h>). The `protocol' field gives the PPP protocol
number for the protocol to be affected, and the `mode' field
in <linux/ppp-ioctl.h>). The ``protocol`` field gives the PPP protocol
number for the protocol to be affected, and the ``mode`` field
specifies what to do with packets for that protocol:
============= ==============================================
NPMODE_PASS normal operation, transmit and receive packets
NPMODE_DROP silently drop packets for this protocol
NPMODE_ERROR drop packets and return an error on transmit
NPMODE_QUEUE queue up packets for transmit, drop received
packets
============= ==============================================
At present NPMODE_ERROR and NPMODE_QUEUE have the same effect as
NPMODE_DROP.
* PPPIOCGNPMODE returns the network-protocol mode for a given
protocol. The argument should point to an npioctl struct with the
`protocol' field set to the PPP protocol number for the protocol of
interest. On return the `mode' field will be set to the network-
``protocol`` field set to the PPP protocol number for the protocol of
interest. On return the ``mode`` field will be set to the network-
protocol mode for that protocol.
* PPPIOCSPASS and PPPIOCSACTIVE set the `pass' and `active' packet
* PPPIOCSPASS and PPPIOCSACTIVE set the ``pass`` and ``active`` packet
filters. These ioctls are only available if the CONFIG_PPP_FILTER
option is selected. The argument should point to a sock_fprog
structure (defined in <linux/filter.h>) containing the compiled BPF
instructions for the filter. Packets are dropped if they fail the
`pass' filter; otherwise, if they fail the `active' filter they are
``pass`` filter; otherwise, if they fail the ``active`` filter they are
passed but they do not reset the transmit or receive idle timer.
* PPPIOCSMRRU enables or disables multilink processing for received
......
.. SPDX-License-Identifier: GPL-2.0
============================================
The proc/net/tcp and proc/net/tcp6 variables
============================================
This document describes the interfaces /proc/net/tcp and /proc/net/tcp6.
Note that these interfaces are deprecated in favor of tcp_diag.
These /proc interfaces provide information about currently active TCP
These /proc interfaces provide information about currently active TCP
connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c
and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively.
It will first list all listening TCP sockets, and next list all established
TCP connections. A typical entry of /proc/net/tcp would look like this (split
up into 3 parts because of the length of the line):
TCP connections. A typical entry of /proc/net/tcp would look like this (split
up into 3 parts because of the length of the line)::
46: 010310AC:9C4C 030310AC:1770 01
46: 010310AC:9C4C 030310AC:1770 01
| | | | | |--> connection state
| | | | |------> remote TCP port number
| | | |-------------> remote IPv4 address
......@@ -17,7 +23,7 @@ up into 3 parts because of the length of the line):
| |---------------------------> local IPv4 address
|----------------------------------> number of entry
00000150:00000000 01:00000019 00000000
00000150:00000000 01:00000019 00000000
| | | | |--> number of unrecovered RTO timeouts
| | | |----------> number of jiffies until timer expires
| | |----------------> timer_active (see below)
......@@ -25,7 +31,7 @@ up into 3 parts because of the length of the line):
|-------------------------------> transmit-queue
1000 0 54165785 4 cd1e6040 25 4 27 3 -1
| | | | | | | | | |--> slow start size threshold,
| | | | | | | | | |--> slow start size threshold,
| | | | | | | | | or -1 if the threshold
| | | | | | | | | is >= 0xFFFF
| | | | | | | | |----> sending congestion window
......@@ -40,9 +46,12 @@ up into 3 parts because of the length of the line):
|---------------------------------------------> uid
timer_active:
== ================================================================
0 no timer is pending
1 retransmit-timer is pending
2 another timer (e.g. delayed ack or keepalive) is pending
3 this is a socket in TIME_WAIT state. Not all fields will contain
3 this is a socket in TIME_WAIT state. Not all fields will contain
data (or even exist)
4 zero window probe timer is pending
== ================================================================
.. SPDX-License-Identifier: GPL-2.0
===========================
How to use radiotap headers
===========================
......@@ -5,9 +8,9 @@ Pointer to the radiotap include file
------------------------------------
Radiotap headers are variable-length and extensible, you can get most of the
information you need to know on them from:
information you need to know on them from::
./include/net/ieee80211_radiotap.h
./include/net/ieee80211_radiotap.h
This document gives an overview and warns on some corner cases.
......@@ -21,6 +24,8 @@ of the it_present member of ieee80211_radiotap_header is set, it means that
the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the
argument area.
::
< 8-byte ieee80211_radiotap_header >
[ <possible argument bitmap extensions ... > ]
[ <argument> ... ]
......@@ -76,6 +81,8 @@ ieee80211_radiotap_header.
Example valid radiotap header
-----------------------------
::
0x00, 0x00, // <-- radiotap version + pad byte
0x0b, 0x00, // <- radiotap header length
0x04, 0x0c, 0x00, 0x00, // <-- bitmap
......@@ -89,64 +96,64 @@ Using the Radiotap Parser
If you are having to parse a radiotap struct, you can radically simplify the
job by using the radiotap parser that lives in net/wireless/radiotap.c and has
its prototypes available in include/net/cfg80211.h. You use it like this:
its prototypes available in include/net/cfg80211.h. You use it like this::
#include <net/cfg80211.h>
#include <net/cfg80211.h>
/* buf points to the start of the radiotap header part */
/* buf points to the start of the radiotap header part */
int MyFunction(u8 * buf, int buflen)
{
int pkt_rate_100kHz = 0, antenna = 0, pwr = 0;
struct ieee80211_radiotap_iterator iterator;
int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen);
int MyFunction(u8 * buf, int buflen)
{
int pkt_rate_100kHz = 0, antenna = 0, pwr = 0;
struct ieee80211_radiotap_iterator iterator;
int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen);
while (!ret) {
while (!ret) {
ret = ieee80211_radiotap_iterator_next(&iterator);
ret = ieee80211_radiotap_iterator_next(&iterator);
if (ret)
continue;
if (ret)
continue;
/* see if this argument is something we can use */
/* see if this argument is something we can use */
switch (iterator.this_arg_index) {
/*
* You must take care when dereferencing iterator.this_arg
* for multibyte types... the pointer is not aligned. Use
* get_unaligned((type *)iterator.this_arg) to dereference
* iterator.this_arg for type "type" safely on all arches.
*/
case IEEE80211_RADIOTAP_RATE:
/* radiotap "rate" u8 is in
* 500kbps units, eg, 0x02=1Mbps
*/
pkt_rate_100kHz = (*iterator.this_arg) * 5;
break;
switch (iterator.this_arg_index) {
/*
* You must take care when dereferencing iterator.this_arg
* for multibyte types... the pointer is not aligned. Use
* get_unaligned((type *)iterator.this_arg) to dereference
* iterator.this_arg for type "type" safely on all arches.
*/
case IEEE80211_RADIOTAP_RATE:
/* radiotap "rate" u8 is in
* 500kbps units, eg, 0x02=1Mbps
*/
pkt_rate_100kHz = (*iterator.this_arg) * 5;
break;
case IEEE80211_RADIOTAP_ANTENNA:
/* radiotap uses 0 for 1st ant */
antenna = *iterator.this_arg);
break;
case IEEE80211_RADIOTAP_ANTENNA:
/* radiotap uses 0 for 1st ant */
antenna = *iterator.this_arg);
break;
case IEEE80211_RADIOTAP_DBM_TX_POWER:
pwr = *iterator.this_arg;
break;
case IEEE80211_RADIOTAP_DBM_TX_POWER:
pwr = *iterator.this_arg;
break;
default:
break;
}
} /* while more rt headers */
default:
break;
}
} /* while more rt headers */
if (ret != -ENOENT)
return TXRX_DROP;
if (ret != -ENOENT)
return TXRX_DROP;
/* discard the radiotap header part */
buf += iterator.max_length;
buflen -= iterator.max_length;
/* discard the radiotap header part */
buf += iterator.max_length;
buflen -= iterator.max_length;
...
...
}
}
Andy Green <andy@warmcat.com>
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
=========================
Raylink wireless LAN card
=========================
September 21, 1999
Copyright (c) 1998 Corey Thomas (corey@world.std.com)
Copyright |copy| 1998 Corey Thomas (corey@world.std.com)
This file is the documentation for the Raylink Wireless LAN card driver for
Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE
......@@ -13,7 +21,7 @@ wireless LAN cards.
As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel
source. My web page for the development of ray_cs is at
http://web.ralinktech.com/ralink/Home/Support/Linux.html
http://web.ralinktech.com/ralink/Home/Support/Linux.html
and I can be emailed at corey@world.std.com
The kernel driver is based on ray_cs-1.62.tgz
......@@ -29,6 +37,7 @@ with nondefault parameters, they can be edited in
will find them all.
Information on card services is available at:
http://pcmcia-cs.sourceforge.net/
......@@ -39,72 +48,78 @@ the driver.
Currently, ray_cs is not part of David Hinds card services package,
so the following magic is required.
At the end of the /etc/pcmcia/config.opts file, add the line:
source ./ray_cs.opts
At the end of the /etc/pcmcia/config.opts file, add the line:
source ./ray_cs.opts
This will make card services read the ray_cs.opts file
when starting. Create the file /etc/pcmcia/ray_cs.opts containing the
following:
following::
#### start of /etc/pcmcia/ray_cs.opts ###################
# Configuration options for Raylink Wireless LAN PCMCIA card
device "ray_cs"
class "network" module "misc/ray_cs"
#### start of /etc/pcmcia/ray_cs.opts ###################
# Configuration options for Raylink Wireless LAN PCMCIA card
device "ray_cs"
class "network" module "misc/ray_cs"
card "RayLink PC Card WLAN Adapter"
manfid 0x01a6, 0x0000
bind "ray_cs"
card "RayLink PC Card WLAN Adapter"
manfid 0x01a6, 0x0000
bind "ray_cs"
module "misc/ray_cs" opts ""
#### end of /etc/pcmcia/ray_cs.opts #####################
module "misc/ray_cs" opts ""
#### end of /etc/pcmcia/ray_cs.opts #####################
To join an existing network with
different parameters, contact the network administrator for the
different parameters, contact the network administrator for the
configuration information, and edit /etc/pcmcia/ray_cs.opts.
Add the parameters below between the empty quotes.
Parameters for ray_cs driver which may be specified in ray_cs.opts:
bc integer 0 = normal mode (802.11 timing)
1 = slow down inter frame timing to allow
operation with older breezecom access
points.
beacon_period integer beacon period in Kilo-microseconds
legal values = must be integer multiple
of hop dwell
default = 256
country integer 1 = USA (default)
2 = Europe
3 = Japan
4 = Korea
5 = Spain
6 = France
7 = Israel
8 = Australia
=============== =============== =============================================
bc integer 0 = normal mode (802.11 timing),
1 = slow down inter frame timing to allow
operation with older breezecom access
points.
beacon_period integer beacon period in Kilo-microseconds,
legal values = must be integer multiple
of hop dwell
default = 256
country integer 1 = USA (default),
2 = Europe,
3 = Japan,
4 = Korea,
5 = Spain,
6 = France,
7 = Israel,
8 = Australia
essid string ESS ID - network name to join
string with maximum length of 32 chars
default value = "ADHOC_ESSID"
hop_dwell integer hop dwell time in Kilo-microseconds
hop_dwell integer hop dwell time in Kilo-microseconds
legal values = 16,32,64,128(default),256
irq_mask integer linux standard 16 bit value 1bit/IRQ
lsb is IRQ 0, bit 1 is IRQ 1 etc.
Used to restrict choice of IRQ's to use.
Recommended method for controlling
interrupts is in /etc/pcmcia/config.opts
Recommended method for controlling
interrupts is in /etc/pcmcia/config.opts
net_type integer 0 (default) = adhoc network,
net_type integer 0 (default) = adhoc network,
1 = infrastructure
phy_addr string string containing new MAC address in
hex, must start with x eg
x00008f123456
psm integer 0 = continuously active
psm integer 0 = continuously active,
1 = power save mode (not useful yet)
pc_debug integer (0-5) larger values for more verbose
......@@ -114,14 +129,14 @@ ray_debug integer Replaced with pc_debug
ray_mem_speed integer defaults to 500
sniffer integer 0 = not sniffer (default)
1 = sniffer which can be used to record all
network traffic using tcpdump or similar,
but no normal network use is allowed.
sniffer integer 0 = not sniffer (default),
1 = sniffer which can be used to record all
network traffic using tcpdump or similar,
but no normal network use is allowed.
translate integer 0 = no translation (encapsulate frames)
translate integer 0 = no translation (encapsulate frames),
1 = translation (RFC1042/802.1)
=============== =============== =============================================
More on sniffer mode:
......@@ -136,7 +151,7 @@ package which parses the 802.11 headers.
Known Problems and missing features
Does not work with non x86
Does not work with non x86
Does not work with SMP
......
.. SPDX-License-Identifier: GPL-2.0
==
RDS
===
Overview
========
......@@ -24,36 +29,39 @@ as IB.
The high-level semantics of RDS from the application's point of view are
* Addressing
RDS uses IPv4 addresses and 16bit port numbers to identify
the end point of a connection. All socket operations that involve
passing addresses between kernel and user space generally
use a struct sockaddr_in.
The fact that IPv4 addresses are used does not mean the underlying
transport has to be IP-based. In fact, RDS over IB uses a
reliable IB connection; the IP address is used exclusively to
locate the remote node's GID (by ARPing for the given IP).
RDS uses IPv4 addresses and 16bit port numbers to identify
the end point of a connection. All socket operations that involve
passing addresses between kernel and user space generally
use a struct sockaddr_in.
The fact that IPv4 addresses are used does not mean the underlying
transport has to be IP-based. In fact, RDS over IB uses a
reliable IB connection; the IP address is used exclusively to
locate the remote node's GID (by ARPing for the given IP).
The port space is entirely independent of UDP, TCP or any other
protocol.
The port space is entirely independent of UDP, TCP or any other
protocol.
* Socket interface
RDS sockets work *mostly* as you would expect from a BSD
socket. The next section will cover the details. At any rate,
all I/O is performed through the standard BSD socket API.
Some additions like zerocopy support are implemented through
control messages, while other extensions use the getsockopt/
setsockopt calls.
Sockets must be bound before you can send or receive data.
This is needed because binding also selects a transport and
attaches it to the socket. Once bound, the transport assignment
does not change. RDS will tolerate IPs moving around (eg in
a active-active HA scenario), but only as long as the address
doesn't move to a different transport.
RDS sockets work *mostly* as you would expect from a BSD
socket. The next section will cover the details. At any rate,
all I/O is performed through the standard BSD socket API.
Some additions like zerocopy support are implemented through
control messages, while other extensions use the getsockopt/
setsockopt calls.
Sockets must be bound before you can send or receive data.
This is needed because binding also selects a transport and
attaches it to the socket. Once bound, the transport assignment
does not change. RDS will tolerate IPs moving around (eg in
a active-active HA scenario), but only as long as the address
doesn't move to a different transport.
* sysctls
RDS supports a number of sysctls in /proc/sys/net/rds
RDS supports a number of sysctls in /proc/sys/net/rds
Socket Interface
......@@ -66,89 +74,88 @@ Socket Interface
options.
fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
This creates a new, unbound RDS socket.
This creates a new, unbound RDS socket.
setsockopt(SOL_SOCKET): send and receive buffer size
RDS honors the send and receive buffer size socket options.
You are not allowed to queue more than SO_SNDSIZE bytes to
a socket. A message is queued when sendmsg is called, and
it leaves the queue when the remote system acknowledges
its arrival.
The SO_RCVSIZE option controls the maximum receive queue length.
This is a soft limit rather than a hard limit - RDS will
continue to accept and queue incoming messages, even if that
takes the queue length over the limit. However, it will also
mark the port as "congested" and send a congestion update to
the source node. The source node is supposed to throttle any
processes sending to this congested port.
RDS honors the send and receive buffer size socket options.
You are not allowed to queue more than SO_SNDSIZE bytes to
a socket. A message is queued when sendmsg is called, and
it leaves the queue when the remote system acknowledges
its arrival.
The SO_RCVSIZE option controls the maximum receive queue length.
This is a soft limit rather than a hard limit - RDS will
continue to accept and queue incoming messages, even if that
takes the queue length over the limit. However, it will also
mark the port as "congested" and send a congestion update to
the source node. The source node is supposed to throttle any
processes sending to this congested port.
bind(fd, &sockaddr_in, ...)
This binds the socket to a local IP address and port, and a
transport, if one has not already been selected via the
This binds the socket to a local IP address and port, and a
transport, if one has not already been selected via the
SO_RDS_TRANSPORT socket option
sendmsg(fd, ...)
Sends a message to the indicated recipient. The kernel will
transparently establish the underlying reliable connection
if it isn't up yet.
Sends a message to the indicated recipient. The kernel will
transparently establish the underlying reliable connection
if it isn't up yet.
An attempt to send a message that exceeds SO_SNDSIZE will
return with -EMSGSIZE
An attempt to send a message that exceeds SO_SNDSIZE will
return with -EMSGSIZE
An attempt to send a message that would take the total number
of queued bytes over the SO_SNDSIZE threshold will return
EAGAIN.
An attempt to send a message that would take the total number
of queued bytes over the SO_SNDSIZE threshold will return
EAGAIN.
An attempt to send a message to a destination that is marked
as "congested" will return ENOBUFS.
An attempt to send a message to a destination that is marked
as "congested" will return ENOBUFS.
recvmsg(fd, ...)
Receives a message that was queued to this socket. The sockets
recv queue accounting is adjusted, and if the queue length
drops below SO_SNDSIZE, the port is marked uncongested, and
a congestion update is sent to all peers.
Applications can ask the RDS kernel module to receive
notifications via control messages (for instance, there is a
notification when a congestion update arrived, or when a RDMA
operation completes). These notifications are received through
the msg.msg_control buffer of struct msghdr. The format of the
messages is described in manpages.
Receives a message that was queued to this socket. The sockets
recv queue accounting is adjusted, and if the queue length
drops below SO_SNDSIZE, the port is marked uncongested, and
a congestion update is sent to all peers.
Applications can ask the RDS kernel module to receive
notifications via control messages (for instance, there is a
notification when a congestion update arrived, or when a RDMA
operation completes). These notifications are received through
the msg.msg_control buffer of struct msghdr. The format of the
messages is described in manpages.
poll(fd)
RDS supports the poll interface to allow the application
to implement async I/O.
RDS supports the poll interface to allow the application
to implement async I/O.
POLLIN handling is pretty straightforward. When there's an
incoming message queued to the socket, or a pending notification,
we signal POLLIN.
POLLIN handling is pretty straightforward. When there's an
incoming message queued to the socket, or a pending notification,
we signal POLLIN.
POLLOUT is a little harder. Since you can essentially send
to any destination, RDS will always signal POLLOUT as long as
there's room on the send queue (ie the number of bytes queued
is less than the sendbuf size).
POLLOUT is a little harder. Since you can essentially send
to any destination, RDS will always signal POLLOUT as long as
there's room on the send queue (ie the number of bytes queued
is less than the sendbuf size).
However, the kernel will refuse to accept messages to
a destination marked congested - in this case you will loop
forever if you rely on poll to tell you what to do.
This isn't a trivial problem, but applications can deal with
this - by using congestion notifications, and by checking for
ENOBUFS errors returned by sendmsg.
However, the kernel will refuse to accept messages to
a destination marked congested - in this case you will loop
forever if you rely on poll to tell you what to do.
This isn't a trivial problem, but applications can deal with
this - by using congestion notifications, and by checking for
ENOBUFS errors returned by sendmsg.
setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
This allows the application to discard all messages queued to a
specific destination on this particular socket.
This allows the application to cancel outstanding messages if
it detects a timeout. For instance, if it tried to send a message,
and the remote host is unreachable, RDS will keep trying forever.
The application may decide it's not worth it, and cancel the
operation. In this case, it would use RDS_CANCEL_SENT_TO to
nuke any pending messages.
setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
This allows the application to discard all messages queued to a
specific destination on this particular socket.
This allows the application to cancel outstanding messages if
it detects a timeout. For instance, if it tried to send a message,
and the remote host is unreachable, RDS will keep trying forever.
The application may decide it's not worth it, and cancel the
operation. In this case, it would use RDS_CANCEL_SENT_TO to
nuke any pending messages.
``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
Set or read an integer defining the underlying
encapsulating transport to be used for RDS packets on the
socket. When setting the option, integer argument may be
......@@ -180,32 +187,39 @@ RDS Protocol
Message header
The message header is a 'struct rds_header' (see rds.h):
Fields:
h_sequence:
per-packet sequence number
per-packet sequence number
h_ack:
piggybacked acknowledgment of last packet received
piggybacked acknowledgment of last packet received
h_len:
length of data, not including header
length of data, not including header
h_sport:
source port
source port
h_dport:
destination port
destination port
h_flags:
CONG_BITMAP - this is a congestion update bitmap
ACK_REQUIRED - receiver must ack this packet
RETRANSMITTED - packet has previously been sent
Can be:
============= ==================================
CONG_BITMAP this is a congestion update bitmap
ACK_REQUIRED receiver must ack this packet
RETRANSMITTED packet has previously been sent
============= ==================================
h_credit:
indicate to other end of connection that
it has more credits available (i.e. there is
more send room)
indicate to other end of connection that
it has more credits available (i.e. there is
more send room)
h_padding[4]:
unused, for future use
unused, for future use
h_csum:
header checksum
header checksum
h_exthdr:
optional data can be passed here. This is currently used for
passing RDMA-related information.
optional data can be passed here. This is currently used for
passing RDMA-related information.
ACK and retransmit handling
......@@ -260,7 +274,7 @@ RDS Protocol
RDS Transport Layer
==================
===================
As mentioned above, RDS is not IB-specific. Its code is divided
into a general RDS layer and a transport layer.
......@@ -281,19 +295,25 @@ RDS Kernel Structures
be sent and sets header fields as needed, based on the socket API.
This is then queued for the individual connection and sent by the
connection's transport.
struct rds_incoming
a generic struct referring to incoming data that can be handed from
the transport to the general code and queued by the general code
while the socket is awoken. It is then passed back to the transport
code to handle the actual copy-to-user.
struct rds_socket
per-socket information
struct rds_connection
per-connection information
struct rds_transport
pointers to transport-specific functions
struct rds_statistics
non-transport-specific statistics
struct rds_cong_map
wraps the raw congestion bitmap, contains rbnode, waitq, etc.
......@@ -317,53 +337,58 @@ The send path
=============
rds_sendmsg()
struct rds_message built from incoming data
CMSGs parsed (e.g. RDMA ops)
transport connection alloced and connected if not already
rds_message placed on send queue
send worker awoken
- struct rds_message built from incoming data
- CMSGs parsed (e.g. RDMA ops)
- transport connection alloced and connected if not already
- rds_message placed on send queue
- send worker awoken
rds_send_worker()
calls rds_send_xmit() until queue is empty
- calls rds_send_xmit() until queue is empty
rds_send_xmit()
transmits congestion map if one is pending
may set ACK_REQUIRED
calls transport to send either non-RDMA or RDMA message
(RDMA ops never retransmitted)
- transmits congestion map if one is pending
- may set ACK_REQUIRED
- calls transport to send either non-RDMA or RDMA message
(RDMA ops never retransmitted)
rds_ib_xmit()
allocs work requests from send ring
adds any new send credits available to peer (h_credits)
maps the rds_message's sg list
piggybacks ack
populates work requests
post send to connection's queue pair
- allocs work requests from send ring
- adds any new send credits available to peer (h_credits)
- maps the rds_message's sg list
- piggybacks ack
- populates work requests
- post send to connection's queue pair
The recv path
=============
rds_ib_recv_cq_comp_handler()
looks at write completions
unmaps recv buffer from device
no errors, call rds_ib_process_recv()
refill recv ring
- looks at write completions
- unmaps recv buffer from device
- no errors, call rds_ib_process_recv()
- refill recv ring
rds_ib_process_recv()
validate header checksum
copy header to rds_ib_incoming struct if start of a new datagram
add to ibinc's fraglist
if competed datagram:
update cong map if datagram was cong update
call rds_recv_incoming() otherwise
note if ack is required
- validate header checksum
- copy header to rds_ib_incoming struct if start of a new datagram
- add to ibinc's fraglist
- if competed datagram:
- update cong map if datagram was cong update
- call rds_recv_incoming() otherwise
- note if ack is required
rds_recv_incoming()
drop duplicate packets
respond to pings
find the sock associated with this datagram
add to sock queue
wake up sock
do some congestion calculations
- drop duplicate packets
- respond to pings
- find the sock associated with this datagram
- add to sock queue
- wake up sock
- do some congestion calculations
rds_recvmsg
copy data into user iovec
handle CMSGs
return to application
- copy data into user iovec
- handle CMSGs
- return to application
Multipath RDS (mprds)
=====================
......
.. SPDX-License-Identifier: GPL-2.0
=======================================
Linux wireless regulatory documentation
---------------------------------------
=======================================
This document gives a brief review over how the Linux wireless
regulatory infrastructure works.
......@@ -57,7 +60,7 @@ Users can use iw:
http://wireless.kernel.org/en/users/Documentation/iw
An example:
An example::
# set regulatory domain to "Costa Rica"
iw reg set CR
......@@ -104,9 +107,9 @@ Example code - drivers hinting an alpha2:
This example comes from the zd1211rw device driver. You can start
by having a mapping of your device's EEPROM country/regulatory
domain value to a specific alpha2 as follows:
domain value to a specific alpha2 as follows::
static struct zd_reg_alpha2_map reg_alpha2_map[] = {
static struct zd_reg_alpha2_map reg_alpha2_map[] = {
{ ZD_REGDOMAIN_FCC, "US" },
{ ZD_REGDOMAIN_IC, "CA" },
{ ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */
......@@ -116,10 +119,10 @@ static struct zd_reg_alpha2_map reg_alpha2_map[] = {
{ ZD_REGDOMAIN_FRANCE, "FR" },
Then you can define a routine to map your read EEPROM value to an alpha2,
as follows:
as follows::
static int zd_reg2alpha2(u8 regdomain, char *alpha2)
{
static int zd_reg2alpha2(u8 regdomain, char *alpha2)
{
unsigned int i;
struct zd_reg_alpha2_map *reg_map;
for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) {
......@@ -131,12 +134,14 @@ static int zd_reg2alpha2(u8 regdomain, char *alpha2)
}
}
return 1;
}
}
Lastly, you can then hint to the core of your discovered alpha2, if a match
was found. You need to do this after you have registered your wiphy. You
are expected to do this during initialization.
::
r = zd_reg2alpha2(mac->regdomain, alpha2);
if (!r)
regulatory_hint(hw->wiphy, alpha2);
......@@ -156,9 +161,9 @@ call regulatory_hint() with the regulatory domain structure in it.
Bellow is a simple example, with a regulatory domain cached using the stack.
Your implementation may vary (read EEPROM cache instead, for example).
Example cache of some regulatory domain
Example cache of some regulatory domain::
struct ieee80211_regdomain mydriver_jp_regdom = {
struct ieee80211_regdomain mydriver_jp_regdom = {
.n_reg_rules = 3,
.alpha2 = "JP",
//.alpha2 = "99", /* If I have no alpha2 to map it to */
......@@ -173,9 +178,9 @@ struct ieee80211_regdomain mydriver_jp_regdom = {
NL80211_RRF_NO_IR|
NL80211_RRF_DFS),
}
};
};
Then in some part of your code after your wiphy has been registered:
Then in some part of your code after your wiphy has been registered::
struct ieee80211_regdomain *rd;
int size_of_regd;
......
======================
RxRPC NETWORK PROTOCOL
======================
.. SPDX-License-Identifier: GPL-2.0
======================
RxRPC Network Protocol
======================
The RxRPC protocol driver provides a reliable two-phase transport on top of UDP
that can be used to perform RxRPC remote operations. This is done over sockets
......@@ -9,36 +11,35 @@ receive data, aborts and errors.
Contents of this document:
(*) Overview.
(#) Overview.
(*) RxRPC protocol summary.
(#) RxRPC protocol summary.
(*) AF_RXRPC driver model.
(#) AF_RXRPC driver model.
(*) Control messages.
(#) Control messages.
(*) Socket options.
(#) Socket options.
(*) Security.
(#) Security.
(*) Example client usage.
(#) Example client usage.
(*) Example server usage.
(#) Example server usage.
(*) AF_RXRPC kernel interface.
(#) AF_RXRPC kernel interface.
(*) Configurable parameters.
(#) Configurable parameters.
========
OVERVIEW
Overview
========
RxRPC is a two-layer protocol. There is a session layer which provides
reliable virtual connections using UDP over IPv4 (or IPv6) as the transport
layer, but implements a real network protocol; and there's the presentation
layer which renders structured data to binary blobs and back again using XDR
(as does SunRPC):
(as does SunRPC)::
+-------------+
| Application |
......@@ -85,31 +86,30 @@ The Andrew File System (AFS) is an example of an application that uses this and
that has both kernel (filesystem) and userspace (utility) components.
======================
RXRPC PROTOCOL SUMMARY
RxRPC Protocol Summary
======================
An overview of the RxRPC protocol:
(*) RxRPC sits on top of another networking protocol (UDP is the only option
(#) RxRPC sits on top of another networking protocol (UDP is the only option
currently), and uses this to provide network transport. UDP ports, for
example, provide transport endpoints.
(*) RxRPC supports multiple virtual "connections" from any given transport
(#) RxRPC supports multiple virtual "connections" from any given transport
endpoint, thus allowing the endpoints to be shared, even to the same
remote endpoint.
(*) Each connection goes to a particular "service". A connection may not go
(#) Each connection goes to a particular "service". A connection may not go
to multiple services. A service may be considered the RxRPC equivalent of
a port number. AF_RXRPC permits multiple services to share an endpoint.
(*) Client-originating packets are marked, thus a transport endpoint can be
(#) Client-originating packets are marked, thus a transport endpoint can be
shared between client and server connections (connections have a
direction).
(*) Up to a billion connections may be supported concurrently between one
(#) Up to a billion connections may be supported concurrently between one
local transport endpoint and one service on one remote endpoint. An RxRPC
connection is described by seven numbers:
connection is described by seven numbers::
Local address }
Local port } Transport (UDP) address
......@@ -119,22 +119,22 @@ An overview of the RxRPC protocol:
Connection ID
Service ID
(*) Each RxRPC operation is a "call". A connection may make up to four
(#) Each RxRPC operation is a "call". A connection may make up to four
billion calls, but only up to four calls may be in progress on a
connection at any one time.
(*) Calls are two-phase and asymmetric: the client sends its request data,
(#) Calls are two-phase and asymmetric: the client sends its request data,
which the service receives; then the service transmits the reply data
which the client receives.
(*) The data blobs are of indefinite size, the end of a phase is marked with a
(#) The data blobs are of indefinite size, the end of a phase is marked with a
flag in the packet. The number of packets of data making up one blob may
not exceed 4 billion, however, as this would cause the sequence number to
wrap.
(*) The first four bytes of the request data are the service operation ID.
(#) The first four bytes of the request data are the service operation ID.
(*) Security is negotiated on a per-connection basis. The connection is
(#) Security is negotiated on a per-connection basis. The connection is
initiated by the first data packet on it arriving. If security is
requested, the server then issues a "challenge" and then the client
replies with a "response". If the response is successful, the security is
......@@ -143,146 +143,145 @@ An overview of the RxRPC protocol:
connection lapse before the client, the security will be renegotiated if
the client uses the connection again.
(*) Calls use ACK packets to handle reliability. Data packets are also
(#) Calls use ACK packets to handle reliability. Data packets are also
explicitly sequenced per call.
(*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
(#) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
A hard-ACK indicates to the far side that all the data received to a point
has been received and processed; a soft-ACK indicates that the data has
been received but may yet be discarded and re-requested. The sender may
not discard any transmittable packets until they've been hard-ACK'd.
(*) Reception of a reply data packet implicitly hard-ACK's all the data
(#) Reception of a reply data packet implicitly hard-ACK's all the data
packets that make up the request.
(*) An call is complete when the request has been sent, the reply has been
(#) An call is complete when the request has been sent, the reply has been
received and the final hard-ACK on the last packet of the reply has
reached the server.
(*) An call may be aborted by either end at any time up to its completion.
(#) An call may be aborted by either end at any time up to its completion.
=====================
AF_RXRPC DRIVER MODEL
AF_RXRPC Driver Model
=====================
About the AF_RXRPC driver:
(*) The AF_RXRPC protocol transparently uses internal sockets of the transport
(#) The AF_RXRPC protocol transparently uses internal sockets of the transport
protocol to represent transport endpoints.
(*) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC
(#) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC
connections are handled transparently. One client socket may be used to
make multiple simultaneous calls to the same service. One server socket
may handle calls from many clients.
(*) Additional parallel client connections will be initiated to support extra
(#) Additional parallel client connections will be initiated to support extra
concurrent calls, up to a tunable limit.
(*) Each connection is retained for a certain amount of time [tunable] after
(#) Each connection is retained for a certain amount of time [tunable] after
the last call currently using it has completed in case a new call is made
that could reuse it.
(*) Each internal UDP socket is retained [tunable] for a certain amount of
(#) Each internal UDP socket is retained [tunable] for a certain amount of
time [tunable] after the last connection using it discarded, in case a new
connection is made that could use it.
(*) A client-side connection is only shared between calls if they have have
(#) A client-side connection is only shared between calls if they have have
the same key struct describing their security (and assuming the calls
would otherwise share the connection). Non-secured calls would also be
able to share connections with each other.
(*) A server-side connection is shared if the client says it is.
(#) A server-side connection is shared if the client says it is.
(*) ACK'ing is handled by the protocol driver automatically, including ping
(#) ACK'ing is handled by the protocol driver automatically, including ping
replying.
(*) SO_KEEPALIVE automatically pings the other side to keep the connection
(#) SO_KEEPALIVE automatically pings the other side to keep the connection
alive [TODO].
(*) If an ICMP error is received, all calls affected by that error will be
(#) If an ICMP error is received, all calls affected by that error will be
aborted with an appropriate network error passed through recvmsg().
Interaction with the user of the RxRPC socket:
(*) A socket is made into a server socket by binding an address with a
(#) A socket is made into a server socket by binding an address with a
non-zero service ID.
(*) In the client, sending a request is achieved with one or more sendmsgs,
(#) In the client, sending a request is achieved with one or more sendmsgs,
followed by the reply being received with one or more recvmsgs.
(*) The first sendmsg for a request to be sent from a client contains a tag to
(#) The first sendmsg for a request to be sent from a client contains a tag to
be used in all other sendmsgs or recvmsgs associated with that call. The
tag is carried in the control data.
(*) connect() is used to supply a default destination address for a client
(#) connect() is used to supply a default destination address for a client
socket. This may be overridden by supplying an alternate address to the
first sendmsg() of a call (struct msghdr::msg_name).
(*) If connect() is called on an unbound client, a random local port will
(#) If connect() is called on an unbound client, a random local port will
bound before the operation takes place.
(*) A server socket may also be used to make client calls. To do this, the
(#) A server socket may also be used to make client calls. To do this, the
first sendmsg() of the call must specify the target address. The server's
transport endpoint is used to send the packets.
(*) Once the application has received the last message associated with a call,
(#) Once the application has received the last message associated with a call,
the tag is guaranteed not to be seen again, and so it can be used to pin
client resources. A new call can then be initiated with the same tag
without fear of interference.
(*) In the server, a request is received with one or more recvmsgs, then the
(#) In the server, a request is received with one or more recvmsgs, then the
the reply is transmitted with one or more sendmsgs, and then the final ACK
is received with a last recvmsg.
(*) When sending data for a call, sendmsg is given MSG_MORE if there's more
(#) When sending data for a call, sendmsg is given MSG_MORE if there's more
data to come on that call.
(*) When receiving data for a call, recvmsg flags MSG_MORE if there's more
(#) When receiving data for a call, recvmsg flags MSG_MORE if there's more
data to come for that call.
(*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
(#) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
to indicate the terminal message for that call.
(*) A call may be aborted by adding an abort control message to the control
(#) A call may be aborted by adding an abort control message to the control
data. Issuing an abort terminates the kernel's use of that call's tag.
Any messages waiting in the receive queue for that call will be discarded.
(*) Aborts, busy notifications and challenge packets are delivered by recvmsg,
(#) Aborts, busy notifications and challenge packets are delivered by recvmsg,
and control data messages will be set to indicate the context. Receiving
an abort or a busy message terminates the kernel's use of that call's tag.
(*) The control data part of the msghdr struct is used for a number of things:
(#) The control data part of the msghdr struct is used for a number of things:
(*) The tag of the intended or affected call.
(#) The tag of the intended or affected call.
(*) Sending or receiving errors, aborts and busy notifications.
(#) Sending or receiving errors, aborts and busy notifications.
(*) Notifications of incoming calls.
(#) Notifications of incoming calls.
(*) Sending debug requests and receiving debug replies [TODO].
(#) Sending debug requests and receiving debug replies [TODO].
(*) When the kernel has received and set up an incoming call, it sends a
(#) When the kernel has received and set up an incoming call, it sends a
message to server application to let it know there's a new call awaiting
its acceptance [recvmsg reports a special control message]. The server
application then uses sendmsg to assign a tag to the new call. Once that
is done, the first part of the request data will be delivered by recvmsg.
(*) The server application has to provide the server socket with a keyring of
(#) The server application has to provide the server socket with a keyring of
secret keys corresponding to the security types it permits. When a secure
connection is being set up, the kernel looks up the appropriate secret key
in the keyring and then sends a challenge packet to the client and
receives a response packet. The kernel then checks the authorisation of
the packet and either aborts the connection or sets up the security.
(*) The name of the key a client will use to secure its communications is
(#) The name of the key a client will use to secure its communications is
nominated by a socket option.
Notes on sendmsg:
(*) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is
(#) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is
making progress at accepting packets within a reasonable time such that we
manage to queue up all the data for transmission. This requires the
client to accept at least one packet per 2*RTT time period.
......@@ -294,7 +293,7 @@ Notes on sendmsg:
Notes on recvmsg:
(*) If there's a sequence of data messages belonging to a particular call on
(#) If there's a sequence of data messages belonging to a particular call on
the receive queue, then recvmsg will keep working through them until:
(a) it meets the end of that call's received data,
......@@ -320,13 +319,13 @@ Notes on recvmsg:
flagged.
================
CONTROL MESSAGES
Control Messages
================
AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex
calls, to invoke certain actions and to report certain conditions. These are:
======================= === =========== ===============================
MESSAGE ID SRT DATA MEANING
======================= === =========== ===============================
RXRPC_USER_CALL_ID sr- User ID App's call specifier
......@@ -340,10 +339,11 @@ calls, to invoke certain actions and to report certain conditions. These are:
RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call
RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded
RXRPC_TX_LENGTH s-- data len Total length of Tx data
======================= === =========== ===============================
(SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message)
(*) RXRPC_USER_CALL_ID
(#) RXRPC_USER_CALL_ID
This is used to indicate the application's call ID. It's an unsigned long
that the app specifies in the client by attaching it to the first data
......@@ -351,7 +351,7 @@ calls, to invoke certain actions and to report certain conditions. These are:
message. recvmsg() passes it in conjunction with all messages except
those of the RXRPC_NEW_CALL message.
(*) RXRPC_ABORT
(#) RXRPC_ABORT
This is can be used by an application to abort a call by passing it to
sendmsg, or it can be delivered by recvmsg to indicate a remote abort was
......@@ -359,13 +359,13 @@ calls, to invoke certain actions and to report certain conditions. These are:
specify the call affected. If an abort is being sent, then error EBADSLT
will be returned if there is no call with that user ID.
(*) RXRPC_ACK
(#) RXRPC_ACK
This is delivered to a server application to indicate that the final ACK
of a call was received from the client. It will be associated with an
RXRPC_USER_CALL_ID to indicate the call that's now complete.
(*) RXRPC_NET_ERROR
(#) RXRPC_NET_ERROR
This is delivered to an application to indicate that an ICMP error message
was encountered in the process of trying to talk to the peer. An
......@@ -373,13 +373,13 @@ calls, to invoke certain actions and to report certain conditions. These are:
indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
affected.
(*) RXRPC_BUSY
(#) RXRPC_BUSY
This is delivered to a client application to indicate that a call was
rejected by the server due to the server being busy. It will be
associated with an RXRPC_USER_CALL_ID to indicate the rejected call.
(*) RXRPC_LOCAL_ERROR
(#) RXRPC_LOCAL_ERROR
This is delivered to an application to indicate that a local error was
encountered and that a call has been aborted because of it. An
......@@ -387,13 +387,13 @@ calls, to invoke certain actions and to report certain conditions. These are:
indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
affected.
(*) RXRPC_NEW_CALL
(#) RXRPC_NEW_CALL
This is delivered to indicate to a server application that a new call has
arrived and is awaiting acceptance. No user ID is associated with this,
as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT.
(*) RXRPC_ACCEPT
(#) RXRPC_ACCEPT
This is used by a server application to attempt to accept a call and
assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID
......@@ -402,12 +402,12 @@ calls, to invoke certain actions and to report certain conditions. These are:
return error ENODATA. If the user ID is already in use by another call,
then error EBADSLT will be returned.
(*) RXRPC_EXCLUSIVE_CALL
(#) RXRPC_EXCLUSIVE_CALL
This is used to indicate that a client call should be made on a one-off
connection. The connection is discarded once the call has terminated.
(*) RXRPC_UPGRADE_SERVICE
(#) RXRPC_UPGRADE_SERVICE
This is used to make a client call to probe if the specified service ID
may be upgraded by the server. The caller must check msg_name returned to
......@@ -419,7 +419,7 @@ calls, to invoke certain actions and to report certain conditions. These are:
future communication to that server and RXRPC_UPGRADE_SERVICE should no
longer be set.
(*) RXRPC_TX_LENGTH
(#) RXRPC_TX_LENGTH
This is used to inform the kernel of the total amount of data that is
going to be transmitted by a call (whether in a client request or a
......@@ -443,7 +443,7 @@ SOCKET OPTIONS
AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
(*) RXRPC_SECURITY_KEY
(#) RXRPC_SECURITY_KEY
This is used to specify the description of the key to be used. The key is
extracted from the calling process's keyrings with request_key() and
......@@ -452,17 +452,17 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
The optval pointer points to the description string, and optlen indicates
how long the string is, without the NUL terminator.
(*) RXRPC_SECURITY_KEYRING
(#) RXRPC_SECURITY_KEYRING
Similar to above but specifies a keyring of server secret keys to use (key
type "keyring"). See the "Security" section.
(*) RXRPC_EXCLUSIVE_CONNECTION
(#) RXRPC_EXCLUSIVE_CONNECTION
This is used to request that new connections should be used for each call
made subsequently on this socket. optval should be NULL and optlen 0.
(*) RXRPC_MIN_SECURITY_LEVEL
(#) RXRPC_MIN_SECURITY_LEVEL
This is used to specify the minimum security level required for calls on
this socket. optval must point to an int containing one of the following
......@@ -482,14 +482,14 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
Encrypted checksum plus entire packet padded and encrypted, including
actual packet length.
(*) RXRPC_UPGRADEABLE_SERVICE
(#) RXRPC_UPGRADEABLE_SERVICE
This is used to indicate that a service socket with two bindings may
upgrade one bound service to the other if requested by the client. optval
must point to an array of two unsigned short ints. The first is the
service ID to upgrade from and the second the service ID to upgrade to.
(*) RXRPC_SUPPORTED_CMSG
(#) RXRPC_SUPPORTED_CMSG
This is a read-only option that writes an int into the buffer indicating
the highest control message type supported.
......@@ -509,7 +509,7 @@ found at:
http://people.redhat.com/~dhowells/rxrpc/klog.c
The payload provided to add_key() on the client should be of the following
form:
form::
struct rxrpc_key_sec2_v1 {
uint16_t security_index; /* 2 */
......@@ -546,14 +546,14 @@ EXAMPLE CLIENT USAGE
A client would issue an operation by:
(1) An RxRPC socket is set up by:
(1) An RxRPC socket is set up by::
client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
Where the third parameter indicates the protocol family of the transport
socket used - usually IPv4 but it can also be IPv6 [TODO].
(2) A local address can optionally be bound:
(2) A local address can optionally be bound::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
......@@ -570,20 +570,20 @@ A client would issue an operation by:
several unrelated RxRPC sockets. Security is handled on a basis of
per-RxRPC virtual connection.
(3) The security is set:
(3) The security is set::
const char *key = "AFS:cambridge.redhat.com";
setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key));
This issues a request_key() to get the key representing the security
context. The minimum security level can be set:
context. The minimum security level can be set::
unsigned int sec = RXRPC_SECURITY_ENCRYPTED;
setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL,
&sec, sizeof(sec));
(4) The server to be contacted can then be specified (alternatively this can
be done through sendmsg):
be done through sendmsg)::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
......@@ -598,7 +598,9 @@ A client would issue an operation by:
(5) The request data should then be posted to the server socket using a series
of sendmsg() calls, each with the following control message attached:
RXRPC_USER_CALL_ID - specifies the user ID for this call
================== ===================================
RXRPC_USER_CALL_ID specifies the user ID for this call
================== ===================================
MSG_MORE should be set in msghdr::msg_flags on all but the last part of
the request. Multiple requests may be made simultaneously.
......@@ -635,13 +637,12 @@ any more calls (further calls to the same destination will be blocked until the
probe is concluded).
====================
EXAMPLE SERVER USAGE
Example Server Usage
====================
A server would be set up to accept operations in the following manner:
(1) An RxRPC socket is created by:
(1) An RxRPC socket is created by::
server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
......@@ -649,7 +650,7 @@ A server would be set up to accept operations in the following manner:
socket used - usually IPv4.
(2) Security is set up if desired by giving the socket a keyring with server
secret keys in it:
secret keys in it::
keyring = add_key("keyring", "AFSkeys", NULL, 0,
KEY_SPEC_PROCESS_KEYRING);
......@@ -663,7 +664,7 @@ A server would be set up to accept operations in the following manner:
The keyring can be manipulated after it has been given to the socket. This
permits the server to add more keys, replace keys, etc. while it is live.
(3) A local address must then be bound:
(3) A local address must then be bound::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
......@@ -680,7 +681,7 @@ A server would be set up to accept operations in the following manner:
should be called twice.
(4) If service upgrading is required, first two service IDs must have been
bound and then the following option must be set:
bound and then the following option must be set::
unsigned short service_ids[2] = { from_ID, to_ID };
setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE,
......@@ -690,14 +691,14 @@ A server would be set up to accept operations in the following manner:
to_ID if they request it. This will be reflected in msg_name obtained
through recvmsg() when the request data is delivered to userspace.
(5) The server is then set to listen out for incoming calls:
(5) The server is then set to listen out for incoming calls::
listen(server, 100);
(6) The kernel notifies the server of pending incoming connections by sending
it a message for each. This is received with recvmsg() on the server
socket. It has no data, and has a single dataless control message
attached:
attached::
RXRPC_NEW_CALL
......@@ -709,8 +710,10 @@ A server would be set up to accept operations in the following manner:
(7) The server then accepts the new call by issuing a sendmsg() with two
pieces of control data and no actual data:
RXRPC_ACCEPT - indicate connection acceptance
RXRPC_USER_CALL_ID - specify user ID for this call
================== ==============================
RXRPC_ACCEPT indicate connection acceptance
RXRPC_USER_CALL_ID specify user ID for this call
================== ==============================
(8) The first request data packet will then be posted to the server socket for
recvmsg() to pick up. At that point, the RxRPC address for the call can
......@@ -722,12 +725,17 @@ A server would be set up to accept operations in the following manner:
All data will be delivered with the following control message attached:
RXRPC_USER_CALL_ID - specifies the user ID for this call
================== ===================================
RXRPC_USER_CALL_ID specifies the user ID for this call
================== ===================================
(9) The reply data should then be posted to the server socket using a series
of sendmsg() calls, each with the following control messages attached:
RXRPC_USER_CALL_ID - specifies the user ID for this call
================== ===================================
RXRPC_USER_CALL_ID specifies the user ID for this call
================== ===================================
MSG_MORE should be set in msghdr::msg_flags on all but the last message
for a particular call.
......@@ -736,8 +744,10 @@ A server would be set up to accept operations in the following manner:
when it is received. It will take the form of a dataless message with two
control messages attached:
RXRPC_USER_CALL_ID - specifies the user ID for this call
RXRPC_ACK - indicates final ACK (no data)
================== ===================================
RXRPC_USER_CALL_ID specifies the user ID for this call
RXRPC_ACK indicates final ACK (no data)
================== ===================================
MSG_EOR will be flagged to indicate that this is the final message for
this call.
......@@ -746,8 +756,10 @@ A server would be set up to accept operations in the following manner:
aborted by calling sendmsg() with a dataless message with the following
control messages attached:
RXRPC_USER_CALL_ID - specifies the user ID for this call
RXRPC_ABORT - indicates abort code (4 byte data)
================== ===================================
RXRPC_USER_CALL_ID specifies the user ID for this call
RXRPC_ABORT indicates abort code (4 byte data)
================== ===================================
Any packets waiting in the socket's receive queue will be discarded if
this is issued.
......@@ -757,8 +769,7 @@ the one server socket, using control messages on sendmsg() and recvmsg() to
determine the call affected.
=========================
AF_RXRPC KERNEL INTERFACE
AF_RXRPC Kernel Interface
=========================
The AF_RXRPC module also provides an interface for use by in-kernel utilities
......@@ -786,7 +797,7 @@ then it passes this to the kernel interface functions.
The kernel interface functions are as follows:
(*) Begin a new client call.
(#) Begin a new client call::
struct rxrpc_call *
rxrpc_kernel_begin_call(struct socket *sock,
......@@ -837,7 +848,7 @@ The kernel interface functions are as follows:
returned. The caller now holds a reference on this and it must be
properly ended.
(*) End a client call.
(#) End a client call::
void rxrpc_kernel_end_call(struct socket *sock,
struct rxrpc_call *call);
......@@ -846,7 +857,7 @@ The kernel interface functions are as follows:
from AF_RXRPC's knowledge and will not be seen again in association with
the specified call.
(*) Send data through a call.
(#) Send data through a call::
typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk,
unsigned long user_call_ID,
......@@ -872,7 +883,7 @@ The kernel interface functions are as follows:
called with the call-state spinlock held to prevent any reply or final ACK
from being delivered first.
(*) Receive data from a call.
(#) Receive data from a call::
int rxrpc_kernel_recv_data(struct socket *sock,
struct rxrpc_call *call,
......@@ -902,12 +913,14 @@ The kernel interface functions are as follows:
more data was available, EMSGSIZE is returned.
If a remote ABORT is detected, the abort code received will be stored in
*_abort and ECONNABORTED will be returned.
``*_abort`` and ECONNABORTED will be returned.
The service ID that the call ended up with is returned into *_service.
This can be used to see if a call got a service upgrade.
(*) Abort a call.
(#) Abort a call??
::
void rxrpc_kernel_abort_call(struct socket *sock,
struct rxrpc_call *call,
......@@ -916,7 +929,7 @@ The kernel interface functions are as follows:
This is used to abort a call if it's still in an abortable state. The
abort code specified will be placed in the ABORT message sent.
(*) Intercept received RxRPC messages.
(#) Intercept received RxRPC messages::
typedef void (*rxrpc_interceptor_t)(struct sock *sk,
unsigned long user_call_ID,
......@@ -937,7 +950,8 @@ The kernel interface functions are as follows:
The skb->mark field indicates the type of message:
MARK MEANING
=============================== =======================================
Mark Meaning
=============================== =======================================
RXRPC_SKB_MARK_DATA Data message
RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call
......@@ -946,6 +960,7 @@ The kernel interface functions are as follows:
RXRPC_SKB_MARK_NET_ERROR Network error detected
RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered
RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance
=============================== =======================================
The remote abort message can be probed with rxrpc_kernel_get_abort_code().
The two error messages can be probed with rxrpc_kernel_get_error_number().
......@@ -961,7 +976,7 @@ The kernel interface functions are as follows:
is possible to get extra refs on all types of message for later freeing,
but this may pin the state of a call until the message is finally freed.
(*) Accept an incoming call.
(#) Accept an incoming call::
struct rxrpc_call *
rxrpc_kernel_accept_call(struct socket *sock,
......@@ -975,7 +990,7 @@ The kernel interface functions are as follows:
returned. The caller now holds a reference on this and it must be
properly ended.
(*) Reject an incoming call.
(#) Reject an incoming call::
int rxrpc_kernel_reject_call(struct socket *sock);
......@@ -984,21 +999,21 @@ The kernel interface functions are as follows:
Other errors may be returned if the call had been aborted (-ECONNABORTED)
or had timed out (-ETIME).
(*) Allocate a null key for doing anonymous security.
(#) Allocate a null key for doing anonymous security::
struct key *rxrpc_get_null_key(const char *keyname);
This is used to allocate a null RxRPC key that can be used to indicate
anonymous security for a particular domain.
(*) Get the peer address of a call.
(#) Get the peer address of a call::
void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call,
struct sockaddr_rxrpc *_srx);
This is used to find the remote peer address of a call.
(*) Set the total transmit data size on a call.
(#) Set the total transmit data size on a call::
void rxrpc_kernel_set_tx_length(struct socket *sock,
struct rxrpc_call *call,
......@@ -1009,14 +1024,14 @@ The kernel interface functions are as follows:
size should be set when the call is begun. tx_total_len may not be less
than zero.
(*) Get call RTT.
(#) Get call RTT::
u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call);
Get the RTT time to the peer in use by a call. The value returned is in
nanoseconds.
(*) Check call still alive.
(#) Check call still alive::
bool rxrpc_kernel_check_life(struct socket *sock,
struct rxrpc_call *call,
......@@ -1024,7 +1039,7 @@ The kernel interface functions are as follows:
void rxrpc_kernel_probe_life(struct socket *sock,
struct rxrpc_call *call);
The first function passes back in *_life a number that is updated when
The first function passes back in ``*_life`` a number that is updated when
ACKs are received from the peer (notably including PING RESPONSE ACKs
which we can elicit by sending PING ACKs to see if the call still exists
on the server). The caller should compare the numbers of two calls to see
......@@ -1040,7 +1055,7 @@ The kernel interface functions are as follows:
first function to change. Note that this must be called in TASK_RUNNING
state.
(*) Get reply timestamp.
(#) Get reply timestamp::
bool rxrpc_kernel_get_reply_time(struct socket *sock,
struct rxrpc_call *call,
......@@ -1048,10 +1063,10 @@ The kernel interface functions are as follows:
This allows the timestamp on the first DATA packet of the reply of a
client call to be queried, provided that it is still in the Rx ring. If
successful, the timestamp will be stored into *_ts and true will be
successful, the timestamp will be stored into ``*_ts`` and true will be
returned; false will be returned otherwise.
(*) Get remote client epoch.
(#) Get remote client epoch::
u32 rxrpc_kernel_get_epoch(struct socket *sock,
struct rxrpc_call *call)
......@@ -1065,7 +1080,7 @@ The kernel interface functions are as follows:
This value can be used to determine if the remote client has been
restarted as it shouldn't change otherwise.
(*) Set the maxmimum lifespan on a call.
(#) Set the maxmimum lifespan on a call::
void rxrpc_kernel_set_max_life(struct socket *sock,
struct rxrpc_call *call,
......@@ -1076,14 +1091,13 @@ The kernel interface functions are as follows:
aborted and -ETIME or -ETIMEDOUT will be returned.
=======================
CONFIGURABLE PARAMETERS
Configurable Parameters
=======================
The RxRPC protocol driver has a number of configurable parameters that can be
adjusted through sysctls in /proc/net/rxrpc/:
(*) req_ack_delay
(#) req_ack_delay
The amount of time in milliseconds after receiving a packet with the
request-ack flag set before we honour the flag and actually send the
......@@ -1093,60 +1107,60 @@ adjusted through sysctls in /proc/net/rxrpc/:
reception window is full (to a maximum of 255 packets), so delaying the
ACK permits several packets to be ACK'd in one go.
(*) soft_ack_delay
(#) soft_ack_delay
The amount of time in milliseconds after receiving a new packet before we
generate a soft-ACK to tell the sender that it doesn't need to resend.
(*) idle_ack_delay
(#) idle_ack_delay
The amount of time in milliseconds after all the packets currently in the
received queue have been consumed before we generate a hard-ACK to tell
the sender it can free its buffers, assuming no other reason occurs that
we would send an ACK.
(*) resend_timeout
(#) resend_timeout
The amount of time in milliseconds after transmitting a packet before we
transmit it again, assuming no ACK is received from the receiver telling
us they got it.
(*) max_call_lifetime
(#) max_call_lifetime
The maximum amount of time in seconds that a call may be in progress
before we preemptively kill it.
(*) dead_call_expiry
(#) dead_call_expiry
The amount of time in seconds before we remove a dead call from the call
list. Dead calls are kept around for a little while for the purpose of
repeating ACK and ABORT packets.
(*) connection_expiry
(#) connection_expiry
The amount of time in seconds after a connection was last used before we
remove it from the connection list. While a connection is in existence,
it serves as a placeholder for negotiated security; when it is deleted,
the security must be renegotiated.
(*) transport_expiry
(#) transport_expiry
The amount of time in seconds after a transport was last used before we
remove it from the transport list. While a transport is in existence, it
serves to anchor the peer data and keeps the connection ID counter.
(*) rxrpc_rx_window_size
(#) rxrpc_rx_window_size
The size of the receive window in packets. This is the maximum number of
unconsumed received packets we're willing to hold in memory for any
particular call.
(*) rxrpc_rx_mtu
(#) rxrpc_rx_mtu
The maximum packet MTU size that we're willing to receive in bytes. This
indicates to the peer whether we're willing to accept jumbo packets.
(*) rxrpc_rx_jumbo_max
(#) rxrpc_rx_jumbo_max
The maximum number of packets that we're willing to accept in a jumbo
packet. Non-terminal packets in a jumbo packet must contain a four byte
......
Linux Kernel SCTP
.. SPDX-License-Identifier: GPL-2.0
=================
Linux Kernel SCTP
=================
This is the current BETA release of the Linux Kernel SCTP reference
implementation.
implementation.
SCTP (Stream Control Transmission Protocol) is a IP based, message oriented,
reliable transport protocol, with congestion control, support for
transparent multi-homing, and multiple ordered streams of messages.
RFC2960 defines the core protocol. The IETF SIGTRAN working group originally
developed the SCTP protocol and later handed the protocol over to the
Transport Area (TSVWG) working group for the continued evolvement of SCTP as a
general purpose transport.
developed the SCTP protocol and later handed the protocol over to the
Transport Area (TSVWG) working group for the continued evolvement of SCTP as a
general purpose transport.
See the IETF website (http://www.ietf.org) for further documents on SCTP.
See http://www.ietf.org/rfc/rfc2960.txt
See the IETF website (http://www.ietf.org) for further documents on SCTP.
See http://www.ietf.org/rfc/rfc2960.txt
The initial project goal is to create an Linux kernel reference implementation
of SCTP that is RFC 2960 compliant and provides an programming interface
referred to as the UDP-style API of the Sockets Extensions for SCTP, as
proposed in IETF Internet-Drafts.
of SCTP that is RFC 2960 compliant and provides an programming interface
referred to as the UDP-style API of the Sockets Extensions for SCTP, as
proposed in IETF Internet-Drafts.
Caveats:
Caveats
=======
-lksctp can be built as statically or as a module. However, be aware that
module removal of lksctp is not yet a safe activity.
- lksctp can be built as statically or as a module. However, be aware that
module removal of lksctp is not yet a safe activity.
-There is tentative support for IPv6, but most work has gone towards
implementation and testing lksctp on IPv4.
- There is tentative support for IPv6, but most work has gone towards
implementation and testing lksctp on IPv4.
For more information, please visit the lksctp project website:
http://www.sf.net/projects/lksctp
Or contact the lksctp developers through the mailing list:
<linux-sctp@vger.kernel.org>
.. SPDX-License-Identifier: GPL-2.0
=================
LSM/SeLinux secid
=================
flowi structure:
The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate
......
.. SPDX-License-Identifier: GPL-2.0
====================
Seg6 Sysfs variables
====================
/proc/sys/net/conf/<iface>/seg6_* variables:
============================================
seg6_enabled - BOOL
Accept or drop SR-enabled IPv6 packets on this interface.
Relevant packets are those with SRH present and DA = local.
0 - disabled (default)
not 0 - enabled
* 0 - disabled (default)
* not 0 - enabled
seg6_require_hmac - INTEGER
Define HMAC policy for ingress SR-enabled packets on this interface.
-1 - Ignore HMAC field
0 - Accept SR packets without HMAC, validate SR packets with HMAC
1 - Drop SR packets without HMAC, validate SR packets with HMAC
* -1 - Ignore HMAC field
* 0 - Accept SR packets without HMAC, validate SR packets with HMAC
* 1 - Drop SR packets without HMAC, validate SR packets with HMAC
Default is 0.
(C)Copyright 1998-2000 SysKonnect,
===========================================================================
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
========================
SysKonnect driver - SKFP
========================
|copy| Copyright 1998-2000 SysKonnect,
skfp.txt created 11-May-2000
Readme File for skfp.o v2.06
This file contains
(1) OVERVIEW
(2) SUPPORTED ADAPTERS
(3) GENERAL INFORMATION
(4) INSTALLATION
(5) INCLUSION OF THE ADAPTER IN SYSTEM START
(6) TROUBLESHOOTING
(7) FUNCTION OF THE ADAPTER LEDS
(8) HISTORY
.. This file contains
===========================================================================
(1) OVERVIEW
(2) SUPPORTED ADAPTERS
(3) GENERAL INFORMATION
(4) INSTALLATION
(5) INCLUSION OF THE ADAPTER IN SYSTEM START
(6) TROUBLESHOOTING
(7) FUNCTION OF THE ADAPTER LEDS
(8) HISTORY
(1) OVERVIEW
============
1. Overview
===========
This README explains how to use the driver 'skfp' for Linux with your
network adapter.
Chapter 2: Contains a list of all network adapters that are supported by
this driver.
this driver.
Chapter 3: Gives some general information.
Chapter 3:
Gives some general information.
Chapter 4: Describes common problems and solutions.
......@@ -37,14 +43,13 @@ Chapter 5: Shows the changed functionality of the adapter LEDs.
Chapter 6: History of development.
***
(2) SUPPORTED ADAPTERS
======================
2. Supported adapters
=====================
The network driver 'skfp' supports the following network adapters:
SysKonnect adapters:
- SK-5521 (SK-NET FDDI-UP)
- SK-5522 (SK-NET FDDI-UP DAS)
- SK-5541 (SK-NET FDDI-FP)
......@@ -55,157 +60,187 @@ SysKonnect adapters:
- SK-5841 (SK-NET FDDI-FP64)
- SK-5843 (SK-NET FDDI-LP64)
- SK-5844 (SK-NET FDDI-LP64 DAS)
Compaq adapters (not tested):
- Netelligent 100 FDDI DAS Fibre SC
- Netelligent 100 FDDI SAS Fibre SC
- Netelligent 100 FDDI DAS UTP
- Netelligent 100 FDDI SAS UTP
- Netelligent 100 FDDI SAS Fibre MIC
***
(3) GENERAL INFORMATION
=======================
3. General Information
======================
From v2.01 on, the driver is integrated in the linux kernel sources.
Therefore, the installation is the same as for any other adapter
supported by the kernel.
Refer to the manual of your distribution about the installation
of network adapters.
Makes my life much easier :-)
***
Makes my life much easier :-)
(4) TROUBLESHOOTING
===================
4. Troubleshooting
==================
If you run into problems during installation, check those items:
Problem: The FDDI adapter cannot be found by the driver.
Reason: Look in /proc/pci for the following entry:
'FDDI network controller: SysKonnect SK-FDDI-PCI ...'
Problem:
The FDDI adapter cannot be found by the driver.
Reason:
Look in /proc/pci for the following entry:
'FDDI network controller: SysKonnect SK-FDDI-PCI ...'
If this entry exists, then the FDDI adapter has been
found by the system and should be able to be used.
If this entry does not exist or if the file '/proc/pci'
is not there, then you may have a hardware problem or PCI
support may not be enabled in your kernel.
The adapter can be checked using the diagnostic program
which is available from the SysKonnect web site:
www.syskonnect.de
Some COMPAQ machines have a problem with PCI under
Linux. This is described in the 'PCI howto' document
(included in some distributions or available from the
www, e.g. at 'www.linux.org') and no workaround is available.
Problem: You want to use your computer as a router between
multiple IP subnetworks (using multiple adapters), but
Problem:
You want to use your computer as a router between
multiple IP subnetworks (using multiple adapters), but
you cannot reach computers in other subnetworks.
Reason: Either the router's kernel is not configured for IP
Reason:
Either the router's kernel is not configured for IP
forwarding or there is a problem with the routing table
and gateway configuration in at least one of the
computers.
If your problem is not listed here, please contact our
technical support for help.
You can send email to:
linux@syskonnect.de
technical support for help.
You can send email to: linux@syskonnect.de
When contacting our technical support,
please ensure that the following information is available:
- System Manufacturer and Model
- Boards in your system
- Distribution
- Kernel version
***
(5) FUNCTION OF THE ADAPTER LEDS
================================
The functionality of the LED's on the FDDI network adapters was
changed in SMT version v2.82. With this new SMT version, the yellow
LED works as a ring operational indicator. An active yellow LED
indicates that the ring is down. The green LED on the adapter now
works as a link indicator where an active GREEN LED indicates that
the respective port has a physical connection.
5. Function of the Adapter LEDs
===============================
With versions of SMT prior to v2.82 a ring up was indicated if the
yellow LED was off while the green LED(s) showed the connection
status of the adapter. During a ring down the green LED was off and
the yellow LED was on.
The functionality of the LED's on the FDDI network adapters was
changed in SMT version v2.82. With this new SMT version, the yellow
LED works as a ring operational indicator. An active yellow LED
indicates that the ring is down. The green LED on the adapter now
works as a link indicator where an active GREEN LED indicates that
the respective port has a physical connection.
All implementations indicate that a driver is not loaded if
all LEDs are off.
With versions of SMT prior to v2.82 a ring up was indicated if the
yellow LED was off while the green LED(s) showed the connection
status of the adapter. During a ring down the green LED was off and
the yellow LED was on.
***
All implementations indicate that a driver is not loaded if
all LEDs are off.
(6) HISTORY
===========
6. History
==========
v2.06 (20000511) (In-Kernel version)
New features:
- 64 bit support
- new pci dma interface
- in kernel 2.3.99
v2.05 (20000217) (In-Kernel version)
New features:
- Changes for 2.3.45 kernel
v2.04 (20000207) (Standalone version)
New features:
- Added rx/tx byte counter
v2.03 (20000111) (Standalone version)
Problems fixed:
- Fixed printk statements from v2.02
v2.02 (991215) (Standalone version)
Problems fixed:
- Removed unnecessary output
- Fixed path for "printver.sh" in makefile
v2.01 (991122) (In-Kernel version)
New features:
- Integration in Linux kernel sources
- Support for memory mapped I/O.
v2.00 (991112)
New features:
- Full source released under GPL
v1.05 (991023)
Problems fixed:
- Compilation with kernel version 2.2.13 failed
v1.04 (990427)
Changes:
- New SMT module included, changing LED functionality
Problems fixed:
- Synchronization on SMP machines was buggy
v1.03 (990325)
Problems fixed:
- Interrupt routing on SMP machines could be incorrect
v1.02 (990310)
New features:
- Support for kernel versions 2.2.x added
- Kernel patch instead of private duplicate of kernel functions
v1.01 (980812)
Problems fixed:
Connection hangup with telnet
Slow telnet connection
v1.00 beta 01 (980507)
New features:
None.
Problems fixed:
None.
Known limitations:
- tar archive instead of standard package format (rpm).
- tar archive instead of standard package format (rpm).
- FDDI statistic is empty.
- not tested with 2.1.xx kernels
- integration in kernel not tested
......@@ -216,5 +251,3 @@ v1.00 beta 01 (980507)
- does not work on some COMPAQ machines. See the PCI howto
document for details about this problem.
- data corruption with kernel versions below 2.0.33.
*** End of information file ***
.. SPDX-License-Identifier: GPL-2.0
=========================
Stream Parser (strparser)
=========================
Introduction
============
......@@ -34,8 +38,10 @@ that is called when a full message has been completed.
Functions
=========
strp_init(struct strparser *strp, struct sock *sk,
const struct strp_callbacks *cb)
::
strp_init(struct strparser *strp, struct sock *sk,
const struct strp_callbacks *cb)
Called to initialize a stream parser. strp is a struct of type
strparser that is allocated by the upper layer. sk is the TCP
......@@ -43,31 +49,41 @@ strp_init(struct strparser *strp, struct sock *sk,
callback mode; in general mode this is set to NULL. Callbacks
are called by the stream parser (the callbacks are listed below).
void strp_pause(struct strparser *strp)
::
void strp_pause(struct strparser *strp)
Temporarily pause a stream parser. Message parsing is suspended
and no new messages are delivered to the upper layer.
void strp_unpause(struct strparser *strp)
::
void strp_unpause(struct strparser *strp)
Unpause a paused stream parser.
void strp_stop(struct strparser *strp);
::
void strp_stop(struct strparser *strp);
strp_stop is called to completely stop stream parser operations.
This is called internally when the stream parser encounters an
error, and it is called from the upper layer to stop parsing
operations.
void strp_done(struct strparser *strp);
::
void strp_done(struct strparser *strp);
strp_done is called to release any resources held by the stream
parser instance. This must be called after the stream processor
has been stopped.
int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
unsigned int orig_offset, size_t orig_len,
size_t max_msg_size, long timeo)
::
int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
unsigned int orig_offset, size_t orig_len,
size_t max_msg_size, long timeo)
strp_process is called in general mode for a stream parser to
parse an sk_buff. The number of bytes processed or a negative
......@@ -75,7 +91,9 @@ int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
consume the sk_buff. max_msg_size is maximum size the stream
parser will parse. timeo is timeout for completing a message.
void strp_data_ready(struct strparser *strp);
::
void strp_data_ready(struct strparser *strp);
The upper layer calls strp_tcp_data_ready when data is ready on
the lower socket for strparser to process. This should be called
......@@ -83,7 +101,9 @@ void strp_data_ready(struct strparser *strp);
maximum messages size is the limit of the receive socket
buffer and message timeout is the receive timeout for the socket.
void strp_check_rcv(struct strparser *strp);
::
void strp_check_rcv(struct strparser *strp);
strp_check_rcv is called to check for new messages on the socket.
This is normally called at initialization of a stream parser
......@@ -94,7 +114,9 @@ Callbacks
There are six callbacks:
int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
::
int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
parse_msg is called to determine the length of the next message
in the stream. The upper layer must implement this function. It
......@@ -107,14 +129,16 @@ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
The return values of this function are:
>0 : indicates length of successfully parsed message
0 : indicates more data must be received to parse the message
-ESTRPIPE : current message should not be processed by the
kernel, return control of the socket to userspace which
can proceed to read the messages itself
other < 0 : Error in parsing, give control back to userspace
assuming that synchronization is lost and the stream
is unrecoverable (application expected to close TCP socket)
========= ===========================================================
>0 indicates length of successfully parsed message
0 indicates more data must be received to parse the message
-ESTRPIPE current message should not be processed by the
kernel, return control of the socket to userspace which
can proceed to read the messages itself
other < 0 Error in parsing, give control back to userspace
assuming that synchronization is lost and the stream
is unrecoverable (application expected to close TCP socket)
========= ===========================================================
In the case that an error is returned (return value is less than
zero) and the parser is in receive callback mode, then it will set
......@@ -123,7 +147,9 @@ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
the current message, then the error set on the attached socket is
ENODATA since the stream is unrecoverable in that case.
void (*lock)(struct strparser *strp)
::
void (*lock)(struct strparser *strp)
The lock callback is called to lock the strp structure when
the strparser is performing an asynchronous operation (such as
......@@ -131,14 +157,18 @@ void (*lock)(struct strparser *strp)
function is to lock_sock for the associated socket. In general
mode the callback must be set appropriately.
void (*unlock)(struct strparser *strp)
::
void (*unlock)(struct strparser *strp)
The unlock callback is called to release the lock obtained
by the lock callback. In receive callback mode the default
function is release_sock for the associated socket. In general
mode the callback must be set appropriately.
void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
::
void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
rcv_msg is called when a full message has been received and
is queued. The callee must consume the sk_buff; it can
......@@ -152,7 +182,9 @@ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
the length of the message. skb->len - offset may be greater
then full_len since strparser does not trim the skb.
int (*read_sock_done)(struct strparser *strp, int err);
::
int (*read_sock_done)(struct strparser *strp, int err);
read_sock_done is called when the stream parser is done reading
the TCP socket in receive callback mode. The stream parser may
......@@ -160,7 +192,9 @@ int (*read_sock_done)(struct strparser *strp, int err);
to occur when exiting the loop. If the callback is not set (NULL
in strp_init) a default function is used.
void (*abort_parser)(struct strparser *strp, int err);
::
void (*abort_parser)(struct strparser *strp, int err);
This function is called when stream parser encounters an error
in parsing. The default function stops the stream parser and
......@@ -204,4 +238,3 @@ Author
======
Tom Herbert (tom@quantonium.net)
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
===============================================
Ethernet switch device driver model (switchdev)
===============================================
Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com>
Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us>
Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com>
The Ethernet switch device driver model (switchdev) is an in-kernel driver
......@@ -12,53 +18,57 @@ Figure 1 is a block diagram showing the components of the switchdev model for
an example setup using a data-center-class switch ASIC chip. Other setups
with SR-IOV or soft switches, such as OVS, are possible.
::
User-space tools
User-space tools
user space |
+-------------------------------------------------------------------+
kernel | Netlink
|
+--------------+-------------------------------+
| Network stack |
| (Linux) |
| |
+----------------------------------------------+
sw1p2 sw1p4 sw1p6
sw1p1 + sw1p3 + sw1p5 + eth1
+ | + | + | +
| | | | | | |
+--+----+----+----+----+----+---+ +-----+-----+
| Switch driver | | mgmt |
| (this document) | | driver |
| | | |
+--------------+----------------+ +-----------+
|
|
+--------------+-------------------------------+
| Network stack |
| (Linux) |
| |
+----------------------------------------------+
sw1p2 sw1p4 sw1p6
sw1p1 + sw1p3 + sw1p5 + eth1
+ | + | + | +
| | | | | | |
+--+----+----+----+----+----+---+ +-----+-----+
| Switch driver | | mgmt |
| (this document) | | driver |
| | | |
+--------------+----------------+ +-----------+
|
kernel | HW bus (eg PCI)
+-------------------------------------------------------------------+
hardware |
+--------------+----------------+
| Switch device (sw1) |
| +----+ +--------+
| | v offloaded data path | mgmt port
| | | |
+--|----|----+----+----+----+---+
| | | | | |
+ + + + + +
p1 p2 p3 p4 p5 p6
+--------------+----------------+
| Switch device (sw1) |
| +----+ +--------+
| | v offloaded data path | mgmt port
| | | |
+--|----|----+----+----+----+---+
| | | | | |
+ + + + + +
p1 p2 p3 p4 p5 p6
front-panel ports
front-panel ports
Fig 1.
Fig 1.
Include Files
-------------
#include <linux/netdevice.h>
#include <net/switchdev.h>
::
#include <linux/netdevice.h>
#include <net/switchdev.h>
Configuration
......@@ -114,10 +124,10 @@ Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
useful for dynamically-named ports where the device names its ports based on
external configuration. For example, if a physical 40G port is split logically
into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
name for each port using port PHYS name. The udev rule would be:
name for each port using port PHYS name. The udev rule would be::
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0
......@@ -173,7 +183,7 @@ Static FDB Entries
The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
to support static FDB entries installed to the device. Static bridge FDB
entries are installed, for example, using iproute2 bridge cmd:
entries are installed, for example, using iproute2 bridge cmd::
bridge fdb add ADDR dev DEV [vlan VID] [self]
......@@ -185,7 +195,7 @@ XXX: what should be done if offloading this rule to hardware fails (for
example, due to full capacity in hardware tables) ?
Note: by default, the bridge does not filter on VLAN and only bridges untagged
traffic. To enable VLAN support, turn on VLAN filtering:
traffic. To enable VLAN support, turn on VLAN filtering::
echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
......@@ -194,7 +204,7 @@ Notification of Learned/Forgotten Source MAC/VLANs
The switch device will learn/forget source MAC address/VLAN on ingress packets
and notify the switch driver of the mac/vlan/port tuples. The switch driver,
in turn, will notify the bridge driver using the switchdev notifier call:
in turn, will notify the bridge driver using the switchdev notifier call::
err = call_switchdev_notifiers(val, dev, info, extack);
......@@ -202,7 +212,7 @@ Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
forgetting, and info points to a struct switchdev_notifier_fdb_info. On
SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge
command will label these entries "offload":
command will label these entries "offload"::
$ bridge fdb
52:54:00:12:35:01 dev sw1p1 master br0 permanent
......@@ -219,11 +229,11 @@ command will label these entries "offload":
01:00:5e:00:00:01 dev br0 self permanent
33:33:ff:12:35:01 dev br0 self permanent
Learning on the port should be disabled on the bridge using the bridge command:
Learning on the port should be disabled on the bridge using the bridge command::
bridge link set dev DEV learning off
Learning on the device port should be enabled, as well as learning_sync:
Learning on the device port should be enabled, as well as learning_sync::
bridge link set dev DEV learning on self
bridge link set dev DEV learning_sync on self
......@@ -314,12 +324,16 @@ forwards the packet to the matching FIB entry's nexthop(s) egress ports.
To program the device, the driver has to register a FIB notifier handler
using register_fib_notifier. The following events are available:
FIB_EVENT_ENTRY_ADD: used for both adding a new FIB entry to the device,
or modifying an existing entry on the device.
FIB_EVENT_ENTRY_DEL: used for removing a FIB entry
FIB_EVENT_RULE_ADD, FIB_EVENT_RULE_DEL: used to propagate FIB rule changes
FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:
=================== ===================================================
FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device,
or modifying an existing entry on the device.
FIB_EVENT_ENTRY_DEL used for removing a FIB entry
FIB_EVENT_RULE_ADD,
FIB_EVENT_RULE_DEL used to propagate FIB rule changes
=================== ===================================================
FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass::
struct fib_entry_notifier_info {
struct fib_notifier_info info; /* must be first */
......@@ -332,12 +346,12 @@ FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:
u32 nlflags;
};
to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi
structure holds details on the route and route's nexthops. *dev is one of the
port netdevs mentioned in the route's next hop list.
to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi``
structure holds details on the route and route's nexthops. ``*dev`` is one
of the port netdevs mentioned in the route's next hop list.
Routes offloaded to the device are labeled with "offload" in the ip route
listing:
listing::
$ ip route show
default via 192.168.0.2 dev eth0
......
.. SPDX-License-Identifier: GPL-2.0
================================
TC Actions - Environmental Rules
================================
The "environmental" rules for authors of any new tc actions are:
1) If you stealeth or borroweth any packet thou shalt be branching
from the righteous path and thou shalt cloneth.
from the righteous path and thou shalt cloneth.
For example if your action queues a packet to be processed later,
or intentionally branches by redirecting a packet, then you need to
clone the packet.
For example if your action queues a packet to be processed later,
or intentionally branches by redirecting a packet, then you need to
clone the packet.
2) If you munge any packet thou shalt call pskb_expand_head in the case
someone else is referencing the skb. After that you "own" the skb.
someone else is referencing the skb. After that you "own" the skb.
3) Dropping packets you don't own is a no-no. You simply return
TC_ACT_SHOT to the caller and they will drop it.
TC_ACT_SHOT to the caller and they will drop it.
The "environmental" rules for callers of actions (qdiscs etc) are:
*) Thou art responsible for freeing anything returned as being
TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
returned, then all is great and you don't need to do anything.
#) Thou art responsible for freeing anything returned as being
TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
returned, then all is great and you don't need to do anything.
Post on netdev if something is unclear.
.. SPDX-License-Identifier: GPL-2.0
====================
Thin-streams and TCP
====================
A wide range of Internet-based services that use reliable transport
protocols display what we call thin-stream properties. This means
that the application sends data with such a low rate that the
......@@ -42,6 +46,7 @@ References
==========
More information on the modifications, as well as a wide range of
experimental data can be found here:
"Improving latency for interactive, thin-stream applications over
reliable transport"
http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file
.. SPDX-License-Identifier: GPL-2.0
====
Team
====
Team devices are driven from userspace via libteam library which is here:
https://github.com/jpirko/libteam
.. SPDX-License-Identifier: GPL-2.0
============
Timestamping
============
1. Control Interfaces
=====================
The interfaces for receiving network packages timestamps are:
* SO_TIMESTAMP
SO_TIMESTAMP
Generates a timestamp for each incoming packet in (not necessarily
monotonic) system time. Reports the timestamp via recvmsg() in a
control message in usec resolution.
......@@ -13,7 +20,7 @@ The interfaces for receiving network packages timestamps are:
SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for
SO_TIMESTAMP_NEW options respectively.
* SO_TIMESTAMPNS
SO_TIMESTAMPNS
Same timestamping mechanism as SO_TIMESTAMP, but reports the
timestamp as struct timespec in nsec resolution.
SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD
......@@ -22,17 +29,18 @@ The interfaces for receiving network packages timestamps are:
and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options
respectively.
* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
Only for multicast:approximate transmit timestamp obtained by
reading the looped packet receive timestamp.
* SO_TIMESTAMPING
SO_TIMESTAMPING
Generates timestamps on reception, transmission or both. Supports
multiple timestamp sources, including hardware. Supports generating
timestamps for stream sockets.
1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW):
1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW)
-------------------------------------------------------------
This socket option enables timestamping of datagrams on the reception
path. Because the destination socket, if any, is not known early in
......@@ -59,10 +67,11 @@ struct __kernel_timespec format.
SO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038
on 32 bit machines.
1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW):
1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW)
----------------------------------------------------------------------
Supports multiple types of timestamp requests. As a result, this
socket option takes a bitmap of flags, not a boolean. In
socket option takes a bitmap of flags, not a boolean. In::
err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
......@@ -76,6 +85,7 @@ be enabled for individual sendmsg calls using cmsg (1.3.4).
1.3.1 Timestamp Generation
^^^^^^^^^^^^^^^^^^^^^^^^^^
Some bits are requests to the stack to try to generate timestamps. Any
combination of them is valid. Changes to these bits apply to newly
......@@ -106,7 +116,6 @@ SOF_TIMESTAMPING_TX_SOFTWARE:
require driver support and may not be available for all devices.
This flag can be enabled via both socket options and control messages.
SOF_TIMESTAMPING_TX_SCHED:
Request tx timestamps prior to entering the packet scheduler. Kernel
transmit latency is, if long, often dominated by queuing delay. The
......@@ -132,6 +141,7 @@ SOF_TIMESTAMPING_TX_ACK:
1.3.2 Timestamp Reporting
^^^^^^^^^^^^^^^^^^^^^^^^^
The other three bits control which timestamps will be reported in a
generated control message. Changes to the bits take immediate
......@@ -151,11 +161,11 @@ SOF_TIMESTAMPING_RAW_HARDWARE:
1.3.3 Timestamp Options
^^^^^^^^^^^^^^^^^^^^^^^
The interface supports the options
SOF_TIMESTAMPING_OPT_ID:
Generate a unique identifier along with each packet. A process can
have multiple concurrent timestamping requests outstanding. Packets
can be reordered in the transmit path, for instance in the packet
......@@ -183,7 +193,6 @@ SOF_TIMESTAMPING_OPT_ID:
SOF_TIMESTAMPING_OPT_CMSG:
Support recv() cmsg for all timestamped packets. Control messages
are already supported unconditionally on all packets with receive
timestamps and on IPv6 packets with transmit timestamp. This option
......@@ -193,7 +202,6 @@ SOF_TIMESTAMPING_OPT_CMSG:
SOF_TIMESTAMPING_OPT_TSONLY:
Applies to transmit timestamps only. Makes the kernel return the
timestamp as a cmsg alongside an empty packet, as opposed to
alongside the original packet. This reduces the amount of memory
......@@ -202,7 +210,6 @@ SOF_TIMESTAMPING_OPT_TSONLY:
This option disables SOF_TIMESTAMPING_OPT_CMSG.
SOF_TIMESTAMPING_OPT_STATS:
Optional stats that are obtained along with the transmit timestamps.
It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the
transmit timestamp is available, the stats are available in a
......@@ -213,7 +220,6 @@ SOF_TIMESTAMPING_OPT_STATS:
data was limited by peer's receiver window.
SOF_TIMESTAMPING_OPT_PKTINFO:
Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming
packets with hardware timestamps. The message contains struct
scm_ts_pktinfo, which supplies the index of the real interface which
......@@ -223,7 +229,6 @@ SOF_TIMESTAMPING_OPT_PKTINFO:
other fields, but they are reserved and undefined.
SOF_TIMESTAMPING_OPT_TX_SWHW:
Request both hardware and software timestamps for outgoing packets
when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE
are enabled at the same time. If both timestamps are generated,
......@@ -242,12 +247,13 @@ combined with SOF_TIMESTAMPING_OPT_TSONLY.
1.3.4. Enabling timestamps via control messages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In addition to socket options, timestamp generation can be requested
per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1).
Using this feature, applications can sample timestamps per sendmsg()
without paying the overhead of enabling and disabling timestamps via
setsockopt:
setsockopt::
struct msghdr *msg;
...
......@@ -264,7 +270,7 @@ The SOF_TIMESTAMPING_TX_* flags set via cmsg will override
the SOF_TIMESTAMPING_TX_* flags set via setsockopt.
Moreover, applications must still enable timestamp reporting via
setsockopt to receive timestamps:
setsockopt to receive timestamps::
__u32 val = SOF_TIMESTAMPING_SOFTWARE |
SOF_TIMESTAMPING_OPT_ID /* or any other flag */;
......@@ -272,6 +278,7 @@ setsockopt to receive timestamps:
1.4 Bytestream Timestamps
-------------------------
The SO_TIMESTAMPING interface supports timestamping of bytes in a
bytestream. Each request is interpreted as a request for when the
......@@ -331,6 +338,7 @@ unusual.
2 Data Interfaces
==================
Timestamps are read using the ancillary data feature of recvmsg().
See `man 3 cmsg` for details of this interface. The socket manual
......@@ -339,20 +347,21 @@ SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
2.1 SCM_TIMESTAMPING records
----------------------------
These timestamps are returned in a control message with cmsg_level
SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
For SO_TIMESTAMPING_OLD:
For SO_TIMESTAMPING_OLD::
struct scm_timestamping {
struct timespec ts[3];
};
struct scm_timestamping {
struct timespec ts[3];
};
For SO_TIMESTAMPING_NEW:
For SO_TIMESTAMPING_NEW::
struct scm_timestamping64 {
struct __kernel_timespec ts[3];
struct scm_timestamping64 {
struct __kernel_timespec ts[3];
Always use SO_TIMESTAMPING_NEW timestamp to always get timestamp in
struct scm_timestamping64 format.
......@@ -377,6 +386,7 @@ in ts[0] when a real software timestamp is missing. This happens also
on hardware transmit timestamps.
2.1.1 Transmit timestamps with MSG_ERRQUEUE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For transmit timestamps the outgoing packet is looped back to the
socket's error queue with the send timestamp(s) attached. A process
......@@ -393,6 +403,7 @@ embeds the struct scm_timestamping.
2.1.1.2 Timestamp types
~~~~~~~~~~~~~~~~~~~~~~~
The semantics of the three struct timespec are defined by field
ee_info in the extended error structure. It contains a value of
......@@ -408,6 +419,7 @@ case the timestamp is stored in ts[0].
2.1.1.3 Fragmentation
~~~~~~~~~~~~~~~~~~~~~
Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
......@@ -416,6 +428,7 @@ socket.
2.1.1.4 Packet Payload
~~~~~~~~~~~~~~~~~~~~~~
The calling application is often not interested in receiving the whole
packet payload that it passed to the stack originally: the socket
......@@ -427,6 +440,7 @@ however, the full packet is queued, taking up budget from SO_RCVBUF.
2.1.1.5 Blocking Read
~~~~~~~~~~~~~~~~~~~~~
Reading from the error queue is always a non-blocking operation. To
block waiting on a timestamp, use poll or select. poll() will return
......@@ -436,6 +450,7 @@ ignored on request. See also `man 2 poll`.
2.1.2 Receive timestamps
^^^^^^^^^^^^^^^^^^^^^^^^
On reception, there is no reason to read from the socket error queue.
The SCM_TIMESTAMPING ancillary data is sent along with the packet data
......@@ -447,16 +462,17 @@ is again deprecated and ts[2] holds a hardware timestamp if set.
3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
=======================================================================
Hardware time stamping must also be initialized for each device driver
that is expected to do hardware time stamping. The parameter is defined in
include/uapi/linux/net_tstamp.h as:
include/uapi/linux/net_tstamp.h as::
struct hwtstamp_config {
int flags; /* no flags defined right now, must be zero */
int tx_type; /* HWTSTAMP_TX_* */
int rx_filter; /* HWTSTAMP_FILTER_* */
};
struct hwtstamp_config {
int flags; /* no flags defined right now, must be zero */
int tx_type; /* HWTSTAMP_TX_* */
int rx_filter; /* HWTSTAMP_FILTER_* */
};
Desired behavior is passed into the kernel and to a specific device by
calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
......@@ -487,44 +503,47 @@ Any process can read the actual configuration by passing this
structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has
not been implemented in all drivers.
/* possible values for hwtstamp_config->tx_type */
enum {
/*
* no outgoing packet will need hardware time stamping;
* should a packet arrive which asks for it, no hardware
* time stamping will be done
*/
HWTSTAMP_TX_OFF,
/*
* enables hardware time stamping for outgoing packets;
* the sender of the packet decides which are to be
* time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
* before sending the packet
*/
HWTSTAMP_TX_ON,
};
/* possible values for hwtstamp_config->rx_filter */
enum {
/* time stamp no incoming packet at all */
HWTSTAMP_FILTER_NONE,
/* time stamp any incoming packet */
HWTSTAMP_FILTER_ALL,
/* return value: time stamp all packets requested plus some others */
HWTSTAMP_FILTER_SOME,
/* PTP v1, UDP, any kind of event packet */
HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
/* for the complete list of values, please check
* the include file include/uapi/linux/net_tstamp.h
*/
};
::
/* possible values for hwtstamp_config->tx_type */
enum {
/*
* no outgoing packet will need hardware time stamping;
* should a packet arrive which asks for it, no hardware
* time stamping will be done
*/
HWTSTAMP_TX_OFF,
/*
* enables hardware time stamping for outgoing packets;
* the sender of the packet decides which are to be
* time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
* before sending the packet
*/
HWTSTAMP_TX_ON,
};
/* possible values for hwtstamp_config->rx_filter */
enum {
/* time stamp no incoming packet at all */
HWTSTAMP_FILTER_NONE,
/* time stamp any incoming packet */
HWTSTAMP_FILTER_ALL,
/* return value: time stamp all packets requested plus some others */
HWTSTAMP_FILTER_SOME,
/* PTP v1, UDP, any kind of event packet */
HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
/* for the complete list of values, please check
* the include file include/uapi/linux/net_tstamp.h
*/
};
3.1 Hardware Timestamping Implementation: Device Drivers
--------------------------------------------------------
A driver which supports hardware time stamping must support the
SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
......@@ -533,22 +552,23 @@ should also support SIOCGHWTSTAMP.
Time stamps for received packets must be stored in the skb. To get a pointer
to the shared time stamp structure of the skb call skb_hwtstamps(). Then
set the time stamps in the structure:
set the time stamps in the structure::
struct skb_shared_hwtstamps {
/* hardware time stamp transformed into duration
* since arbitrary point in time
*/
ktime_t hwtstamp;
};
struct skb_shared_hwtstamps {
/* hardware time stamp transformed into duration
* since arbitrary point in time
*/
ktime_t hwtstamp;
};
Time stamps for outgoing packets are to be generated as follows:
- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)
is set no-zero. If yes, then the driver is expected to do hardware time
stamping.
- If this is possible for the skb and requested, then declare
that the driver is doing the time stamping by setting the flag
SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with
SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with::
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
......
.. SPDX-License-Identifier: GPL-2.0
=========================
Transparent proxy support
=========================
......@@ -11,39 +14,39 @@ From Linux 4.18 transparent proxy support is also available in nf_tables.
================================
The idea is that you identify packets with destination address matching a local
socket on your box, set the packet mark to a certain value:
socket on your box, set the packet mark to a certain value::
# iptables -t mangle -N DIVERT
# iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
# iptables -t mangle -A DIVERT -j MARK --set-mark 1
# iptables -t mangle -A DIVERT -j ACCEPT
# iptables -t mangle -N DIVERT
# iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
# iptables -t mangle -A DIVERT -j MARK --set-mark 1
# iptables -t mangle -A DIVERT -j ACCEPT
Alternatively you can do this in nft with the following commands:
Alternatively you can do this in nft with the following commands::
# nft add table filter
# nft add chain filter divert "{ type filter hook prerouting priority -150; }"
# nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept
# nft add table filter
# nft add chain filter divert "{ type filter hook prerouting priority -150; }"
# nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept
And then match on that value using policy routing to have those packets
delivered locally:
delivered locally::
# ip rule add fwmark 1 lookup 100
# ip route add local 0.0.0.0/0 dev lo table 100
# ip rule add fwmark 1 lookup 100
# ip route add local 0.0.0.0/0 dev lo table 100
Because of certain restrictions in the IPv4 routing output code you'll have to
modify your application to allow it to send datagrams _from_ non-local IP
addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket
option before calling bind:
fd = socket(AF_INET, SOCK_STREAM, 0);
/* - 8< -*/
int value = 1;
setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value));
/* - 8< -*/
name.sin_family = AF_INET;
name.sin_port = htons(0xCAFE);
name.sin_addr.s_addr = htonl(0xDEADBEEF);
bind(fd, &name, sizeof(name));
option before calling bind::
fd = socket(AF_INET, SOCK_STREAM, 0);
/* - 8< -*/
int value = 1;
setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value));
/* - 8< -*/
name.sin_family = AF_INET;
name.sin_port = htons(0xCAFE);
name.sin_addr.s_addr = htonl(0xDEADBEEF);
bind(fd, &name, sizeof(name));
A trivial patch for netcat is available here:
http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch
......@@ -61,10 +64,10 @@ be able to find out the original destination address. Even in case of TCP
getting the original destination address is racy.)
The 'TPROXY' target provides similar functionality without relying on NAT. Simply
add rules like this to the iptables ruleset above:
add rules like this to the iptables ruleset above::
# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
--tproxy-mark 0x1/0x1 --on-port 50080
# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
--tproxy-mark 0x1/0x1 --on-port 50080
Or the following rule to nft:
......@@ -82,10 +85,12 @@ nf_tables implementation.
====================================
To use tproxy you'll need to have the following modules compiled for iptables:
- NETFILTER_XT_MATCH_SOCKET
- NETFILTER_XT_TARGET_TPROXY
Or the floowing modules for nf_tables:
- NFT_SOCKET
- NFT_TPROXY
......
......@@ -193,7 +193,7 @@ W: https://wireless.wiki.kernel.org/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git
F: Documentation/driver-api/80211/cfg80211.rst
F: Documentation/networking/regulatory.txt
F: Documentation/networking/regulatory.rst
F: include/linux/ieee80211.h
F: include/net/cfg80211.h
F: include/net/ieee80211_radiotap.h
......@@ -9515,7 +9515,7 @@ F: drivers/soc/lantiq
LAPB module
L: linux-x25@vger.kernel.org
S: Orphan
F: Documentation/networking/lapb-module.txt
F: Documentation/networking/lapb-module.rst
F: include/*/lapb.h
F: net/lapb/
......@@ -10079,7 +10079,7 @@ S: Maintained
W: https://wireless.wiki.kernel.org/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git
F: Documentation/networking/mac80211-injection.txt
F: Documentation/networking/mac80211-injection.rst
F: Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst
F: drivers/net/wireless/mac80211_hwsim.[ch]
F: include/net/mac80211.h
......@@ -13262,7 +13262,7 @@ F: drivers/input/joystick/pxrc.c
PHONET PROTOCOL
M: Remi Denis-Courmont <courmisch@gmail.com>
S: Supported
F: Documentation/networking/phonet.txt
F: Documentation/networking/phonet.rst
F: include/linux/phonet.h
F: include/net/phonet/
F: include/uapi/linux/phonet.h
......@@ -14219,7 +14219,7 @@ L: linux-rdma@vger.kernel.org
L: rds-devel@oss.oracle.com (moderated for non-subscribers)
S: Supported
W: https://oss.oracle.com/projects/rds/
F: Documentation/networking/rds.txt
F: Documentation/networking/rds.rst
F: net/rds/
RDT - RESOURCE ALLOCATION
......@@ -14593,7 +14593,7 @@ M: David Howells <dhowells@redhat.com>
L: linux-afs@lists.infradead.org
S: Supported
W: https://www.infradead.org/~dhowells/kafs/
F: Documentation/networking/rxrpc.txt
F: Documentation/networking/rxrpc.rst
F: include/keys/rxrpc-type.h
F: include/net/af_rxrpc.h
F: include/trace/events/rxrpc.h
......@@ -14999,7 +14999,7 @@ M: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
L: linux-sctp@vger.kernel.org
S: Maintained
W: http://lksctp.sourceforge.net
F: Documentation/networking/sctp.txt
F: Documentation/networking/sctp.rst
F: include/linux/sctp.h
F: include/net/sctp/
F: include/uapi/linux/sctp.h
......
......@@ -302,7 +302,7 @@ config NETCONSOLE
tristate "Network console logging support"
---help---
If you want to log kernel messages over the network, enable this.
See <file:Documentation/networking/netconsole.txt> for details.
See <file:Documentation/networking/netconsole.rst> for details.
config NETCONSOLE_DYNAMIC
bool "Dynamic reconfiguration of logging targets"
......@@ -312,7 +312,7 @@ config NETCONSOLE_DYNAMIC
This option enables the ability to dynamically reconfigure target
parameters (interface, IP addresses, port numbers, MAC addresses)
at runtime through a userspace interface exported using configfs.
See <file:Documentation/networking/netconsole.txt> for details.
See <file:Documentation/networking/netconsole.rst> for details.
config NETPOLL
def_bool NETCONSOLE
......
......@@ -48,7 +48,7 @@ config LTPC
If you are in doubt, this card is the one with the 65C02 chip on it.
You also need version 1.3.3 or later of the netatalk package.
This driver is experimental, which means that it may not work.
See the file <file:Documentation/networking/ltpc.txt>.
See the file <file:Documentation/networking/ltpc.rst>.
config COPS
tristate "COPS LocalTalk PC support"
......
......@@ -1150,7 +1150,7 @@ static irqreturn_t gelic_card_interrupt(int irq, void *ptr)
* gelic_net_poll_controller - artificial interrupt for netconsole etc.
* @netdev: interface device structure
*
* see Documentation/networking/netconsole.txt
* see Documentation/networking/netconsole.rst
*/
void gelic_net_poll_controller(struct net_device *netdev)
{
......
......@@ -1615,7 +1615,7 @@ spider_net_interrupt(int irq, void *ptr)
* spider_net_poll_controller - artificial interrupt for netconsole etc.
* @netdev: interface device structure
*
* see Documentation/networking/netconsole.txt
* see Documentation/networking/netconsole.rst
*/
static void
spider_net_poll_controller(struct net_device *netdev)
......
......@@ -77,7 +77,7 @@ config SKFP
- Netelligent 100 FDDI SAS UTP
- Netelligent 100 FDDI SAS Fibre MIC
Read <file:Documentation/networking/skfp.txt> for information about
Read <file:Documentation/networking/skfp.rst> for information about
the driver.
Questions concerning this driver can be addressed to:
......
......@@ -21,7 +21,7 @@ config PLIP
bits at a time (mode 0) or with special PLIP cables, to be used on
bidirectional parallel ports only, which can transmit 8 bits at a
time (mode 1); you can find the wiring of these cables in
<file:Documentation/networking/PLIP.txt>. The cables can be up to
<file:Documentation/networking/plip.rst>. The cables can be up to
15m long. Mode 0 works also if one of the machines runs DOS/Windows
and has some PLIP software installed, e.g. the Crynwr PLIP packet
driver (<http://oak.oakland.edu/simtel.net/msdos/pktdrvr-pre.html>)
......
......@@ -57,7 +57,7 @@ config PCMCIA_RAYCS
---help---
Say Y here if you intend to attach an Aviator/Raytheon PCMCIA
(PC-card) wireless Ethernet networking card to your computer.
Please read the file <file:Documentation/networking/ray_cs.txt> for
Please read the file <file:Documentation/networking/ray_cs.rst> for
details.
To compile this driver as a module, choose M here: the module will be
......
......@@ -79,7 +79,7 @@ The DPSW can have ports connected to DPNIs or to PHYs via DPMACs.
For a more detailed description of the Ethernet switch device driver model
see:
Documentation/networking/switchdev.txt
Documentation/networking/switchdev.rst
Creating an Ethernet Switch
===========================
......
......@@ -89,7 +89,7 @@ enum {
* Add your fresh new feature above and remember to update
* netdev_features_strings[] in net/core/ethtool.c and maybe
* some feature mask #defines below. Please also describe it
* in Documentation/networking/netdev-features.txt.
* in Documentation/networking/netdev-features.rst.
*/
/**/NETDEV_FEATURE_COUNT
......
......@@ -5211,7 +5211,7 @@ u32 ieee80211_mandatory_rates(struct ieee80211_supported_band *sband,
* Radiotap parsing functions -- for controlled injection support
*
* Implemented in net/wireless/radiotap.c
* Documentation in Documentation/networking/radiotap-headers.txt
* Documentation in Documentation/networking/radiotap-headers.rst
*/
struct radiotap_align_size {
......
......@@ -36,7 +36,7 @@ struct sock_extended_err {
*
* The timestamping interfaces SO_TIMESTAMPING, MSG_TSTAMP_*
* communicate network timestamps by passing this struct in a cmsg with
* recvmsg(). See Documentation/networking/timestamping.txt for details.
* recvmsg(). See Documentation/networking/timestamping.rst for details.
* User space sees a timespec definition that matches either
* __kernel_timespec or __kernel_old_timespec, in the kernel we
* require two structure definitions to provide both.
......
......@@ -344,7 +344,7 @@ config NET_PKTGEN
what was just said, you don't need it: say N.
Documentation on how to use the packet generator can be found
at <file:Documentation/networking/pktgen.txt>.
at <file:Documentation/networking/pktgen.rst>.
To compile this code as a module, choose M here: the
module will be called pktgen.
......
......@@ -56,7 +56,7 @@
* Integrated to 2.5.x 021029 --Lucio Maciel (luciomaciel@zipmail.com.br)
*
* 021124 Finished major redesign and rewrite for new functionality.
* See Documentation/networking/pktgen.txt for how to use this.
* See Documentation/networking/pktgen.rst for how to use this.
*
* The new operation:
* For each CPU one thread/process is created at start. This process checks
......
......@@ -15,7 +15,7 @@ config LAPB
currently supports LAPB only over Ethernet connections. If you want
to use LAPB connections over Ethernet, say Y here and to "LAPB over
Ethernet driver" below. Read
<file:Documentation/networking/lapb-module.txt> for technical
<file:Documentation/networking/lapb-module.rst> for technical
details.
To compile this driver as a module, choose M here: the
......
......@@ -2144,7 +2144,7 @@ static bool ieee80211_parse_tx_radiotap(struct ieee80211_local *local,
/*
* Please update the file
* Documentation/networking/mac80211-injection.txt
* Documentation/networking/mac80211-injection.rst
* when parsing new fields here.
*/
......
......@@ -1043,7 +1043,7 @@ config NETFILTER_XT_TARGET_TPROXY
on Netfilter connection tracking and NAT, unlike REDIRECT.
For it to work you will have to configure certain iptables rules
and use policy routing. For more information on how to set it up
see Documentation/networking/tproxy.txt.
see Documentation/networking/tproxy.rst.
To compile it as a module, choose M here. If unsure, say N.
......
......@@ -18,7 +18,7 @@ config AF_RXRPC
This module at the moment only supports client operations and is
currently incomplete.
See Documentation/networking/rxrpc.txt.
See Documentation/networking/rxrpc.rst.
config AF_RXRPC_IPV6
bool "IPv6 support for RxRPC"
......@@ -41,7 +41,7 @@ config AF_RXRPC_DEBUG
help
Say Y here to make runtime controllable debugging messages appear.
See Documentation/networking/rxrpc.txt.
See Documentation/networking/rxrpc.rst.
config RXKAD
......@@ -56,4 +56,4 @@ config RXKAD
Provide kerberos 4 and AFS kaserver security handling for AF_RXRPC
through the use of the key retention service.
See Documentation/networking/rxrpc.txt.
See Documentation/networking/rxrpc.rst.
......@@ -21,7 +21,7 @@ static const unsigned long max_jiffies = MAX_JIFFY_OFFSET;
/*
* RxRPC operating parameters.
*
* See Documentation/networking/rxrpc.txt and the variable definitions for more
* See Documentation/networking/rxrpc.rst and the variable definitions for more
* information on the individual parameters.
*/
static struct ctl_table rxrpc_sysctl_table[] = {
......
......@@ -90,7 +90,7 @@ static const struct ieee80211_radiotap_namespace radiotap_ns = {
* iterator.this_arg for type "type" safely on all arches.
*
* Example code:
* See Documentation/networking/radiotap-headers.txt
* See Documentation/networking/radiotap-headers.rst
*/
int ieee80211_radiotap_iterator_init(
......
......@@ -3,7 +3,7 @@ Sample and benchmark scripts for pktgen (packet generator)
This directory contains some pktgen sample and benchmark scripts, that
can easily be copied and adjusted for your own use-case.
General doc is located in kernel: Documentation/networking/pktgen.txt
General doc is located in kernel: Documentation/networking/pktgen.rst
Helper include files
====================
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment