1. 28 Dec, 2019 19 commits
    • Michal Kubecek's avatar
      ethtool: set link settings with LINKINFO_SET request · a53f3d41
      Michal Kubecek authored
      Implement LINKINFO_SET netlink request to set link settings queried by
      LINKINFO_GET message.
      
      Only physical port, phy MDIO address and MDI(-X) control can be set,
      attempt to modify MDI(-X) status and transceiver is rejected.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a53f3d41
    • Michal Kubecek's avatar
      ethtool: provide link settings with LINKINFO_GET request · 459e0b81
      Michal Kubecek authored
      Implement LINKINFO_GET netlink request to get basic link settings provided
      by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands.
      
      This request provides settings not directly related to autonegotiation and
      link mode selection: physical port, phy MDIO address, MDI(-X) status,
      MDI(-X) control and transceiver.
      
      LINKINFO_GET request can be used with NLM_F_DUMP (without device
      identification) to request the information for all devices in current
      network namespace providing the data.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      459e0b81
    • Michal Kubecek's avatar
      ethtool: provide string sets with STRSET_GET request · 71921690
      Michal Kubecek authored
      Requests a contents of one or more string sets, i.e. indexed arrays of
      strings; this information is provided by ETHTOOL_GSSET_INFO and
      ETHTOOL_GSTRINGS commands of ioctl interface. Unlike ioctl interface, all
      information can be retrieved with one request and mulitple string sets can
      be requested at once.
      
      There are three types of requests:
      
        - no NLM_F_DUMP, no device: get "global" stringsets
        - no NLM_F_DUMP, with device: get string sets related to the device
        - NLM_F_DUMP, no device: get device related string sets for all devices
      
      Client can request either all string sets of given type (global or device
      related) or only specific sets. With ETHTOOL_A_STRSET_COUNTS flag set, only
      set sizes (numbers of strings) are returned.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71921690
    • Michal Kubecek's avatar
      ethtool: default handlers for GET requests · 728480f1
      Michal Kubecek authored
      Significant part of GET request processing is common for most request
      types but unfortunately it cannot be easily separated from type specific
      code as we need to alternate between common actions (parsing common request
      header, allocating message and filling netlink/genetlink headers etc.) and
      specific actions (querying the device, composing the reply). The processing
      also happens in three different situations: "do" request, "dump" request
      and notification, each doing things in slightly different way.
      
      The request specific code is implemented in four or five callbacks defined
      in an instance of struct get_request_ops:
      
        parse_request() - parse incoming message
        prepare_data()  - retrieve data from driver or NIC
        reply_size()    - estimate reply message size
        fill_reply()    - compose reply message
        cleanup_data()  - (optional) clean up additional data
      
      Other members of struct get_request_ops describe the data structure holding
      information from client request and data used to compose the message. The
      default handlers ethnl_default_doit(), ethnl_default_dumpit(),
      ethnl_default_start() and ethnl_default_done() can be then used in genl_ops
      handler. Notification handler will be introduced in a later patch.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      728480f1
    • Michal Kubecek's avatar
      ethtool: support for netlink notifications · 6b08d6c1
      Michal Kubecek authored
      Add infrastructure for ethtool netlink notifications. There is only one
      multicast group "monitor" which is used to notify userspace about changes
      and actions performed. Notification messages (types using suffix _NTF)
      share the format with replies to GET requests.
      
      Notifications are supposed to be broadcasted on every configuration change,
      whether it is done using the netlink interface or ioctl one. Netlink SET
      requests only trigger a notification if some data is actually changed.
      
      To trigger an ethtool notification, both ethtool netlink and external code
      use ethtool_notify() helper. This helper requires RTNL to be held and may
      sleep. Handlers sending messages for specific notification message types
      are registered in ethnl_notify_handlers array. As notifications can be
      triggered from other code, ethnl_ok flag is used to prevent an attempt to
      send notification before genetlink family is registered.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b08d6c1
    • Michal Kubecek's avatar
      ethtool: netlink bitset handling · 10b518d4
      Michal Kubecek authored
      The ethtool netlink code uses common framework for passing arbitrary
      length bit sets to allow future extensions. A bitset can be a list (only
      one bitmap) or can consist of value and mask pair (used e.g. when client
      want to modify only some bits). A bitset can use one of two formats:
      verbose (bit by bit) or compact.
      
      Verbose format consists of bitset size (number of bits), list flag and
      an array of bit nests, telling which bits are part of the list or which
      bits are in the mask and which of them are to be set. In requests, bits
      can be identified by index (position) or by name. In replies, kernel
      provides both index and name. Verbose format is suitable for "one shot"
      applications like standard ethtool command as it avoids the need to
      either keep bit names (e.g. link modes) in sync with kernel or having to
      add an extra roundtrip for string set request (e.g. for private flags).
      
      Compact format uses one (list) or two (value/mask) arrays of 32-bit
      words to store the bitmap(s). It is more suitable for long running
      applications (ethtool in monitor mode or network management daemons)
      which can retrieve the names once and then pass only compact bitmaps to
      save space.
      
      Userspace requests can use either format; ETHTOOL_FLAG_COMPACT_BITSETS
      flag in request header tells kernel which format to use in reply.
      Notifications always use compact format.
      
      As some code uses arrays of unsigned long for internal representation and
      some arrays of u32 (or even a single u32), two sets of parse/compose
      helpers are introduced. To avoid code duplication, helpers for unsigned
      long arrays are implemented as wrappers around helpers for u32 arrays.
      There are two reasons for this choice: (1) u32 arrays are more frequent in
      ethtool code and (2) unsigned long array can be always interpreted as an
      u32 array on little endian 64-bit and all 32-bit architectures while we
      would need special handling for odd number of u32 words in the opposite
      direction.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10b518d4
    • Michal Kubecek's avatar
      ethtool: helper functions for netlink interface · 041b1c5d
      Michal Kubecek authored
      Add common request/reply header definition and helpers to parse request
      header and fill reply header. Provide ethnl_update_* helpers to update
      structure members from request attributes (to be used for *_SET requests).
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      041b1c5d
    • Michal Kubecek's avatar
      ethtool: introduce ethtool netlink interface · 2b4a8990
      Michal Kubecek authored
      Basic genetlink and init infrastructure for the netlink interface, register
      genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to
      make the build optional. Add initial overall interface description into
      Documentation/networking/ethtool-netlink.rst, further patches will add more
      detailed information.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b4a8990
    • Kevin Kou's avatar
      sctp: do trace_sctp_probe after SACK validation and check · 356b23c0
      Kevin Kou authored
      The function sctp_sf_eat_sack_6_2 now performs the Verification
      Tag validation, Chunk length validation, Bogu check, and also
      the detection of out-of-order SACK based on the RFC2960
      Section 6.2 at the beginning, and finally performs the further
      processing of SACK. The trace_sctp_probe now triggered before
      the above necessary validation and check.
      
      this patch is to do the trace_sctp_probe after the chunk sanity
      tests, but keep doing trace if the SACK received is out of order,
      for the out-of-order SACK is valuable to congestion control
      debugging.
      
      v1->v2:
       - keep doing SCTP trace if the SACK is out of order as Marcelo's
         suggestion.
      v2->v3:
       - regenerate the patch as v2 generated on top of v1, and add
         'net-next' tag to the new one as Marcelo's comments.
      Signed-off-by: default avatarKevin Kou <qdkevin.kou@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      356b23c0
    • Nikita Yushchenko's avatar
      mv88e6xxx: Add serdes Rx statistics · 0df95287
      Nikita Yushchenko authored
      If packet checker is enabled in the serdes, then Rx counter registers
      start working, and no side effects have been detected.
      
      This patch enables packet checker automatically when powering serdes on,
      and exposes Rx counter registers via ethtool statistics interface.
      
      Code partially basded by older attempt by Andrew Lunn.
      Signed-off-by: default avatarNikita Yushchenko <nikita.yoush@cogentembedded.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0df95287
    • YueHaibing's avatar
      net: ena: remove set but not used variable 'rx_ring' · cad451dd
      YueHaibing authored
      drivers/net/ethernet/amazon/ena/ena_netdev.c: In function ena_xdp_xmit_buff:
      drivers/net/ethernet/amazon/ena/ena_netdev.c:316:19: warning:
       variable rx_ring set but not used [-Wunused-but-set-variable]
      
      commit 548c4940 ("net: ena: Implement XDP_TX action")
      left behind this unused variable.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cad451dd
    • Mao Wenan's avatar
      net: dsa: qca: ar9331: drop pointless static qualifier in ar9331_sw_mbus_init · c8f957df
      Mao Wenan authored
      There is no need to set variable 'mbus' static
      since new value always be assigned before use it.
      Signed-off-by: default avatarMao Wenan <maowenan@huawei.com>
      Reviewed-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8f957df
    • Xu Wang's avatar
      ppp: Remove redundant BUG_ON() check in ppp_pernet · 8a3f44a0
      Xu Wang authored
      Passing NULL to ppp_pernet causes a crash via BUG_ON.
      Dereferencing net in net_generic() also has the same effect.
      This patch removes the redundant BUG_ON check on the same parameter.
      Signed-off-by: default avatarXu Wang <vulab@iscas.ac.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a3f44a0
    • David S. Miller's avatar
      Merge branch 'tcp_cubic-various-fixes' · 36a78867
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp_cubic: various fixes
      
      This patch series converts tcp_cubic to usec clock resolution
      for Hystart logic.
      
      This makes Hystart more relevant for data-center flows.
      Prior to this series, Hystart was not kicking, or was
      kicking without good reason, since the 1ms clock was too coarse.
      
      Last patch also fixes an issue with Hystart vs TCP pacing.
      
      v2: removed a last-minute debug chunk from last patch
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36a78867
    • Eric Dumazet's avatar
      tcp_cubic: make Hystart aware of pacing · ede656e8
      Eric Dumazet authored
      For years we disabled Hystart ACK train detection at Google
      because it was fooled by TCP pacing.
      
      ACK train detection uses a simple heuristic, detecting if
      we receive ACK past half the RTT, to exit slow start before
      hitting the bottleneck and experience massive drops.
      
      But pacing by design might delay packets up to RTT/2,
      so we need to tweak the Hystart logic to be aware of this
      extra delay.
      
      Tested:
       Added a 100 usec delay at receiver.
      
      Before:
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
         9117
         7057
         9553
         8300
         7030
         6849
         9533
        10126
         6876
         8473
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1230               0.0
      
      After :
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
         9845
        10103
        10866
        11096
        11936
        11487
        11773
        12188
        11066
        11894
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       6462               0.0
      
      Disabling Hystart ACK Train detection gives similar numbers
      
      echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        11173
        10954
        12455
        10627
        11578
        11583
        11222
        10880
        10665
        11366
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ede656e8
    • Eric Dumazet's avatar
      tcp_cubic: tweak Hystart detection for short RTT flows · 42f3a8aa
      Eric Dumazet authored
      After switching ca->delay_min to usec resolution, we exit
      slow start prematurely for very low RTT flows, setting
      snd_ssthresh to 20.
      
      The reason is that delay_min is fed with RTT of small packet
      trains. Then as cwnd is increased, TCP sends bigger TSO packets.
      
      LRO/GRO aggregation and/or interrupt mitigation strategies
      on receiver tend to inflate RTT samples.
      
      Fix this by adding to delay_min the expected delay of
      two TSO packets, given current pacing rate.
      
      Tested:
      
      Sender uses pfifo_fast qdisc
      
      Before :
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        11348
        11707
        11562
        11428
        11773
        11534
         9878
        11693
        10597
        10968
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       200                0.0
      
      After :
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        14877
        14517
        15797
        18466
        17376
        14833
        17558
        17933
        16039
        18059
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1670               0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42f3a8aa
    • Eric Dumazet's avatar
      tcp_cubic: switch bictcp_clock() to usec resolution · cff04e2d
      Eric Dumazet authored
      Current 1ms clock feeds ca->round_start, ca->delay_min,
      ca->last_ack.
      
      This is quite problematic for data-center flows, where delay_min
      is way below 1 ms.
      
      This means Hystart Train detection triggers every time jiffies value
      is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
      expression becomes true.
      
      This kind of random behavior can be solved by reusing the existing
      usec timestamp that TCP keeps in tp->tcp_mstamp
      
      Note that a followup patch will tweak things a bit, because
      during slow start, GRO aggregation on receivers naturally
      increases the RTT as TSO packets gradually come to ~64KB size.
      
      To recap, right after this patch CUBIC Hystart train detection
      is more aggressive, since short RTT flows might exit slow start at
      cwnd = 20, instead of being possibly unbounded.
      
      Following patch will address this problem.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cff04e2d
    • Eric Dumazet's avatar
      tcp_cubic: remove one conditional from hystart_update() · 35821fc2
      Eric Dumazet authored
      If we initialize ca->curr_rtt to ~0U, we do not need to test
      for zero value in hystart_update()
      
      We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
      been processed, and thus ca->curr_rtt will have a sane value.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35821fc2
    • Eric Dumazet's avatar
      tcp_cubic: optimize hystart_update() · 473900a5
      Eric Dumazet authored
      We do not care which bit in ca->found is set.
      
      We avoid accessing hystart and hystart_detect unless really needed,
      possibly avoiding one cache line miss.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      473900a5
  2. 27 Dec, 2019 2 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 2bbc078f
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2019-12-27
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      We've added 127 non-merge commits during the last 17 day(s) which contain
      a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).
      
      There are three merge conflicts. Conflicts and resolution looks as follows:
      
      1) Merge conflict in net/bpf/test_run.c:
      
      There was a tree-wide cleanup c593642c ("treewide: Use sizeof_field() macro")
      which gets in the way with b590cb5f ("bpf: Switch to offsetofend in
      BPF_PROG_TEST_RUN"):
      
        <<<<<<< HEAD
                if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                                   sizeof_field(struct __sk_buff, priority),
        =======
                if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
        >>>>>>> 7c8dce4b
      
      There are a few occasions that look similar to this. Always take the chunk with
      offsetofend(). Note that there is one where the fields differ in here:
      
        <<<<<<< HEAD
                if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                                   sizeof_field(struct __sk_buff, tstamp),
        =======
                if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
        >>>>>>> 7c8dce4b
      
      Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
      850a88cc ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").
      
      2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:
      
      (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)
      
        <<<<<<< HEAD
                if (is_13b_check(off, insn))
                        return -1;
                emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
        =======
                emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
        >>>>>>> 7c8dce4b
      
      Result should look like:
      
                emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);
      
      3) Merge conflict in arch/riscv/include/asm/pgtable.h:
      
        <<<<<<< HEAD
        =======
        #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
        #define VMALLOC_END      (PAGE_OFFSET - 1)
        #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
      
        #define BPF_JIT_REGION_SIZE     (SZ_128M)
        #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
        #define BPF_JIT_REGION_END      (VMALLOC_END)
      
        /*
         * Roughly size the vmemmap space to be large enough to fit enough
         * struct pages to map half the virtual address space. Then
         * position vmemmap directly below the VMALLOC region.
         */
        #define VMEMMAP_SHIFT \
                (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
        #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
        #define VMEMMAP_END     (VMALLOC_START - 1)
        #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)
      
        #define vmemmap         ((struct page *)VMEMMAP_START)
      
        >>>>>>> 7c8dce4b
      
      Only take the BPF_* defines from there and move them higher up in the
      same file. Remove the rest from the chunk. The VMALLOC_* etc defines
      got moved via 01f52e16 ("riscv: define vmemmap before pfn_to_page
      calls"). Result:
      
        [...]
        #define __S101  PAGE_READ_EXEC
        #define __S110  PAGE_SHARED_EXEC
        #define __S111  PAGE_SHARED_EXEC
      
        #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
        #define VMALLOC_END      (PAGE_OFFSET - 1)
        #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
      
        #define BPF_JIT_REGION_SIZE     (SZ_128M)
        #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
        #define BPF_JIT_REGION_END      (VMALLOC_END)
      
        /*
         * Roughly size the vmemmap space to be large enough to fit enough
         * struct pages to map half the virtual address space. Then
         * position vmemmap directly below the VMALLOC region.
         */
        #define VMEMMAP_SHIFT \
                (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
        #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
        #define VMEMMAP_END     (VMALLOC_START - 1)
        #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)
      
        [...]
      
      Let me know if there are any other issues.
      
      Anyway, the main changes are:
      
      1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
         to a provided BPF object file. This provides an alternative, simplified API
         compared to standard libbpf interaction. Also, add libbpf extern variable
         resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.
      
      2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
         generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
         add various BPF riscv JIT improvements, from Björn Töpel.
      
      3) Extend bpftool to allow matching BPF programs and maps by name,
         from Paul Chaignon.
      
      4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
         flag for allowing updates without service interruption, from Andrey Ignatov.
      
      5) Cleanup and simplification of ring access functions for AF_XDP with a
         bonus of 0-5% performance improvement, from Magnus Karlsson.
      
      6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
         audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.
      
      7) Move and extend test_select_reuseport into BPF program tests under
         BPF selftests, from Jakub Sitnicki.
      
      8) Various BPF sample improvements for xdpsock for customizing parameters
         to set up and benchmark AF_XDP, from Jay Jayatheerthan.
      
      9) Improve libbpf to provide a ulimit hint on permission denied errors.
         Also change XDP sample programs to attach in driver mode by default,
         from Toke Høiland-Jørgensen.
      
      10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
          programs, from Nikita V. Shirokov.
      
      11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.
      
      12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
          libbpf conversion, from Jesper Dangaard Brouer.
      
      13) Minor misc improvements from various others.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bbc078f
    • Andrii Nakryiko's avatar
      bpftool: Make skeleton C code compilable with C++ compiler · 7c8dce4b
      Andrii Nakryiko authored
      When auto-generated BPF skeleton C code is included from C++ application, it
      triggers compilation error due to void * being implicitly casted to whatever
      target pointer type. This is supported by C, but not C++. To solve this
      problem, add explicit casts, where necessary.
      
      To ensure issues like this are captured going forward, add skeleton usage in
      test_cpp test.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191226210253.3132060-1-andriin@fb.com
      7c8dce4b
  3. 26 Dec, 2019 19 commits
    • David S. Miller's avatar
      Merge branch 's390-qeth-next' · 9e41fbf3
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: updates 2019-12-23
      
      please apply the following patch series for qeth to your net-next tree.
      
      This reworks the RX code to use napi_gro_frags() when building non-linear
      skbs, along with some consolidation and cleanups.
      
      Happy holidays - and many thanks for all the effort & support over the past
      year, to both Jakub and you. It's much appreciated.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e41fbf3
    • Julian Wiedmann's avatar
      s390/qeth: remove QETH_RX_PULL_LEN · 8ca8559f
      Julian Wiedmann authored
      Since commit f677fcb9 ("s390/qeth: ensure linear access to packet headers"),
      the CQ-specific skbs are allocated with a slightly bigger linear part
      than necessary. Shrink it down to the maximum that's needed by
      qeth_extract_skb().
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ca8559f
    • Julian Wiedmann's avatar
      s390/qeth: use napi_gro_frags() for SG skbs · dcdcf867
      Julian Wiedmann authored
      For non-linear packets, get the skb for attaching the page fragments
      from napi_get_frags() so that it can be recycled during GRO.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcdcf867
    • Julian Wiedmann's avatar
      s390/qeth: consolidate RX code · c04b116a
      Julian Wiedmann authored
      To reduce the path length and levels of indirection, move the RX
      processing from the sub-drivers into the core.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c04b116a
    • Mao Wenan's avatar
      af_packet: refactoring code for prb_calc_retire_blk_tmo · 0914d2bb
      Mao Wenan authored
      If __ethtool_get_link_ksettings() is failed and with
      non-zero value, prb_calc_retire_blk_tmo() should return
      DEFAULT_PRB_RETIRE_TOV firstly.
      
      This patch is to refactory code and make it more readable.
      Signed-off-by: default avatarMao Wenan <maowenan@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0914d2bb
    • Paul Durrant's avatar
      xen-netback: support dynamic unbind/bind · 9476654b
      Paul Durrant authored
      By re-attaching RX, TX, and CTL rings during connect() rather than
      assuming they are freshly allocated (i.e. assuming the counters are zero),
      and avoiding forcing state to Closed in netback_remove() it is possible
      for vif instances to be unbound and re-bound from and to (respectively) a
      running guest.
      
      Dynamic unbind/bind is a highly useful feature for a backend module as it
      allows it to be unloaded and re-loaded (i.e. updated) without requiring
      domUs to be halted.
      
      This has been tested by running iperf as a server in the test VM and
      then running a client against it in a continuous loop, whilst also
      running:
      
      while true;
        do echo vif-$DOMID-$VIF >unbind;
        echo down;
        rmmod xen-netback;
        echo unloaded;
        modprobe xen-netback;
        cd $(pwd);
        brctl addif xenbr0 vif$DOMID.$VIF;
        ip link set vif$DOMID.$VIF up;
        echo up;
        sleep 5;
        done
      
      in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind,
      unload, re-load, re-bind and re-plumb the backend.
      
      Clearly a performance drop was seen but no TCP connection resets were
      observed during this test and moreover a parallel SSH connection into the
      guest remained perfectly usable throughout.
      Signed-off-by: default avatarPaul Durrant <pdurrant@amazon.com>
      Reviewed-by: default avatarWei Liu <wei.liu@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9476654b
    • David S. Miller's avatar
      Merge branch 'RTL8211F-RGMII-RX-TX-delay-configuration-improvements' · 8d347992
      David S. Miller authored
      Martin Blumenstingl says:
      
      ====================
      RTL8211F: RGMII RX/TX delay configuration improvements
      
      In discussion with Andrew [0] we figured out that it would be best to
      make the RX delay of the RTL8211F PHY configurable (just like the TX
      delay is already configurable).
      
      While here I took the opportunity to add some logging to the TX delay
      configuration as well.
      
      There is no public documentation for the RX and TX delay registers.
      I received this information a while ago (and created this RfC patch
      back then: [1]). Realtek gave me permission to take the information
      from the datasheet extracts and phase them in my own words and publish
      that (I am not allowed to publish the datasheet extracts).
      
      I have tested these patches on two boards:
      - Amlogic Meson8b Odroid-C1
      - Amlogic GXM Khadas VIM2
      Both still behave as before these changes (iperf3 speeds are the same
      in both directions: RX and TX), which is expected because they are
      currently using phy-mode = "rgmii" with the RX delay not being generated
      by the PHY.
      
      [0] https://patchwork.ozlabs.org/patch/1215313/
      [1] https://patchwork.ozlabs.org/patch/843946/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d347992
    • Martin Blumenstingl's avatar
      net: phy: realtek: add support for configuring the RX delay on RTL8211F · 1b3047b5
      Martin Blumenstingl authored
      On RTL8211F the RX and TX delays (2ns) can be configured in two ways:
      - pin strapping (RXD1 for the TX delay and RXD0 for the RX delay, LOW
        means "off" and HIGH means "on") which is read during PHY reset
      - using software to configure the TX and RX delay registers
      
      So far only the configuration using pin strapping has been supported.
      Add support for enabling or disabling the RGMII RX delay based on the
      phy-mode to be able to get the RX delay into a known state. This is
      important because the RX delay has to be coordinated between the PHY,
      MAC and the PCB design (trace length). With an invalid RX delay applied
      (for example if both PHY and MAC add a 2ns RX delay) Ethernet may not
      work at all.
      
      Also add debug logging when configuring the RX delay (just like the TX
      delay) because this is a common source of problems.
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b3047b5
    • Martin Blumenstingl's avatar
      net: phy: realtek: add logging for the RGMII TX delay configuration · 3aec743d
      Martin Blumenstingl authored
      RGMII requires a delay of 2ns between the data and the clock signal.
      There are at least three ways this can happen. One possibility is by
      having the PHY generate this delay.
      This is a common source for problems (for example with slow TX speeds or
      packet loss when sending data). The TX delay configuration of the
      RTL8211F PHY can be set either by pin-strappping the RXD1 pin (HIGH
      means enabled, LOW means disabled) or through configuring a paged
      register. The setting from the RXD1 pin is also reflected in the
      register.
      
      Add debug logging to the TX delay configuration on RTL8211F so it's
      easier to spot these issues (for example if the TX delay is enabled for
      both, the RTL8211F PHY and the MAC).
      This is especially helpful because there is no public datasheet for the
      RTL8211F PHY available with all the RX/TX delay specifics.
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3aec743d
    • David S. Miller's avatar
      Merge branch 'mlxsw-spectrum_router-Cleanups' · 1f4f16fa
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: spectrum_router: Cleanups
      
      This patch set removes from mlxsw code that is no longer necessary after
      the simplification of the IPv4 and IPv6 route offload API.
      
      The patches eliminate unnecessary code by taking advantage of the fact
      that mlxsw no longer needs to maintain a list of identical routes,
      following recent changes in route offload API.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f4f16fa
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Remove FIB entry list from FIB node · 7c4a7ec8
      Ido Schimmel authored
      As explained in previous patches, the driver no longer needs to maintain
      a list of identical FIB entries (i.e, same {tb_id, prefix, prefix
      length}) and therefore each FIB node can only store one FIB entry.
      
      Remove the FIB entry list and simplify the code.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c4a7ec8
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Consolidate identical functions · b04720ae
      Ido Schimmel authored
      After the last patch mlxsw_sp_fib{4,6}_node_entry_link() and
      mlxsw_sp_fib{4,6}_node_entry_unlink() are identical and can therefore be
      consolidated into the same common function.
      
      Perform the consolidation.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b04720ae
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Make route creation and destruction symmetric · 0705297e
      Ido Schimmel authored
      Host routes that perform decapsulation of IP in IP tunnels have a
      special adjacency entry linked to them. This entry stores information
      such as the expected underlay source IP. When the route is deleted this
      entry needs to be freed.
      
      The allocation of the adjacency entry happens in
      mlxsw_sp_fib4_entry_type_set(), but it is freed in
      mlxsw_sp_fib4_node_entry_unlink().
      
      Create a new function - mlxsw_sp_fib4_entry_type_unset() - and free the
      adjacency entry there.
      
      This will allow us to consolidate mlxsw_sp_fib{4,6}_node_entry_unlink()
      in the next patch.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0705297e
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Eliminate dead code · 0d2fb5aa
      Ido Schimmel authored
      Since the driver no longer maintains a list of identical routes there is
      no route to promote when a route is deleted.
      
      Remove that code that took care of it.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d2fb5aa
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Remove unnecessary checks · 231c8d2b
      Ido Schimmel authored
      Now that the networking stack takes care of only notifying the routes of
      interest, we do not need to maintain a list of identical routes.
      
      Remove the check that tests if the route is the first route in the FIB
      node.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      231c8d2b
    • Andy Roulin's avatar
      bonding: rename AD_STATE_* to LACP_STATE_* · c1e46990
      Andy Roulin authored
      As the LACP actor/partner state is now part of the uapi, rename the
      3ad state defines with LACP prefix. The LACP prefix is preferred over
      BOND_3AD as the LACP standard moved to 802.1AX.
      
      Fixes: 826f66b3 ("bonding: move 802.3ad port state flags to uapi")
      Signed-off-by: default avatarAndy Roulin <aroulin@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1e46990
    • Kevin Kou's avatar
      sctp: move trace_sctp_probe_path into sctp_outq_sack · f643ee29
      Kevin Kou authored
      The original patch bringed in the "SCTP ACK tracking trace event"
      feature was committed at Dec.20, 2017, it replaced jprobe usage
      with trace events, and bringed in two trace events, one is
      TRACE_EVENT(sctp_probe), another one is TRACE_EVENT(sctp_probe_path).
      The original patch intended to trigger the trace_sctp_probe_path in
      TRACE_EVENT(sctp_probe) as below code,
      
      +TRACE_EVENT(sctp_probe,
      +
      +	TP_PROTO(const struct sctp_endpoint *ep,
      +		 const struct sctp_association *asoc,
      +		 struct sctp_chunk *chunk),
      +
      +	TP_ARGS(ep, asoc, chunk),
      +
      +	TP_STRUCT__entry(
      +		__field(__u64, asoc)
      +		__field(__u32, mark)
      +		__field(__u16, bind_port)
      +		__field(__u16, peer_port)
      +		__field(__u32, pathmtu)
      +		__field(__u32, rwnd)
      +		__field(__u16, unack_data)
      +	),
      +
      +	TP_fast_assign(
      +		struct sk_buff *skb = chunk->skb;
      +
      +		__entry->asoc = (unsigned long)asoc;
      +		__entry->mark = skb->mark;
      +		__entry->bind_port = ep->base.bind_addr.port;
      +		__entry->peer_port = asoc->peer.port;
      +		__entry->pathmtu = asoc->pathmtu;
      +		__entry->rwnd = asoc->peer.rwnd;
      +		__entry->unack_data = asoc->unack_data;
      +
      +		if (trace_sctp_probe_path_enabled()) {
      +			struct sctp_transport *sp;
      +
      +			list_for_each_entry(sp, &asoc->peer.transport_addr_list,
      +					    transports) {
      +				trace_sctp_probe_path(sp, asoc);
      +			}
      +		}
      +	),
      
      But I found it did not work when I did testing, and trace_sctp_probe_path
      had no output, I finally found that there is trace buffer lock
      operation(trace_event_buffer_reserve) in include/trace/trace_events.h:
      
      static notrace void							\
      trace_event_raw_event_##call(void *__data, proto)			\
      {									\
      	struct trace_event_file *trace_file = __data;			\
      	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
      	struct trace_event_buffer fbuffer;				\
      	struct trace_event_raw_##call *entry;				\
      	int __data_size;						\
      									\
      	if (trace_trigger_soft_disabled(trace_file))			\
      		return;							\
      									\
      	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
      									\
      	entry = trace_event_buffer_reserve(&fbuffer, trace_file,	\
      				 sizeof(*entry) + __data_size);		\
      									\
      	if (!entry)							\
      		return;							\
      									\
      	tstruct								\
      									\
      	{ assign; }							\
      									\
      	trace_event_buffer_commit(&fbuffer);				\
      }
      
      The reason caused no output of trace_sctp_probe_path is that
      trace_sctp_probe_path written in TP_fast_assign part of
      TRACE_EVENT(sctp_probe), and it will be placed( { assign; } ) after the
      trace_event_buffer_reserve() when compiler expands Macro,
      
              entry = trace_event_buffer_reserve(&fbuffer, trace_file,        \
                                       sizeof(*entry) + __data_size);         \
                                                                              \
              if (!entry)                                                     \
                      return;                                                 \
                                                                              \
              tstruct                                                         \
                                                                              \
              { assign; }                                                     \
      
      so trace_sctp_probe_path finally can not acquire trace_event_buffer
      and return no output, that is to say the nest of tracepoint entry function
      is not allowed. The function call flow is:
      
      trace_sctp_probe()
      -> trace_event_raw_event_sctp_probe()
       -> lock buffer
       -> trace_sctp_probe_path()
         -> trace_event_raw_event_sctp_probe_path()  --nested
         -> buffer has been locked and return no output.
      
      This patch is to remove trace_sctp_probe_path from the TP_fast_assign
      part of TRACE_EVENT(sctp_probe) to avoid the nest of entry function,
      and trigger sctp_probe_path_trace in sctp_outq_sack.
      
      After this patch, you can enable both events individually,
        # cd /sys/kernel/debug/tracing
        # echo 1 > events/sctp/sctp_probe/enable
        # echo 1 > events/sctp/sctp_probe_path/enable
      
      Or, you can enable all the events under sctp.
      
        # echo 1 > events/sctp/enable
      Signed-off-by: default avatarKevin Kou <qdkevin.kou@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f643ee29
    • Hechao Li's avatar
      bpf: Print error message for bpftool cgroup show · 1162f844
      Hechao Li authored
      Currently, when bpftool cgroup show <path> has an error, no error
      message is printed. This is confusing because the user may think the
      result is empty.
      
      Before the change:
      
      $ bpftool cgroup show /sys/fs/cgroup
      ID       AttachType      AttachFlags     Name
      $ echo $?
      255
      
      After the change:
      $ ./bpftool cgroup show /sys/fs/cgroup
      Error: can't query bpf programs attached to /sys/fs/cgroup: Operation
      not permitted
      
      v2: Rename check_query_cgroup_progs to cgroup_has_attached_progs
      Signed-off-by: default avatarHechao Li <hechaol@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191224011742.3714301-1-hechaol@fb.com
      1162f844
    • Andrii Nakryiko's avatar
      libbpf: Support CO-RE relocations for LDX/ST/STX instructions · 8ab9da57
      Andrii Nakryiko authored
      Clang patch [0] enables emitting relocatable generic ALU/ALU64 instructions
      (i.e, shifts and arithmetic operations), as well as generic load/store
      instructions. The former ones are already supported by libbpf as is. This
      patch adds further support for load/store instructions. Relocatable field
      offset is encoded in BPF instruction's 16-bit offset section and are adjusted
      by libbpf based on target kernel BTF.
      
      These Clang changes and corresponding libbpf changes allow for more succinct
      generated BPF code by encoding relocatable field reads as a single
      ST/LDX/STX instruction. It also enables relocatable access to BPF context.
      Previously, if context struct (e.g., __sk_buff) was accessed with CO-RE
      relocations (e.g., due to preserve_access_index attribute), it would be
      rejected by BPF verifier due to modified context pointer dereference. With
      Clang patch, such context accesses are both relocatable and have a fixed
      offset from the point of view of BPF verifier.
      
        [0] https://reviews.llvm.org/D71790Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20191223180305.86417-1-andriin@fb.com
      8ab9da57