1. 03 Jun, 2019 14 commits
    • Heiner Kallweit's avatar
      r8169: remove struct jumbo_ops · 485bb1b3
      Heiner Kallweit authored
      The jumbo_ops are used in just one place, so we can simplify the code
      and avoid the penalty of indirect calls in times of retpoline.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      485bb1b3
    • Heiner Kallweit's avatar
      r8169: remove struct mdio_ops · 5f950523
      Heiner Kallweit authored
      The mdio_ops are used in just one place, so we can simplify the code
      and avoid the penalty of indirect calls in times of retpoline.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f950523
    • Heiner Kallweit's avatar
      r8169: improve r8169_csum_workaround · 0b12c73a
      Heiner Kallweit authored
      Use helper skb_is_gso() and simplify access to tx_dropped.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b12c73a
    • Heiner Kallweit's avatar
      net: ethernet: improve eth_platform_get_mac_address · db4bad07
      Heiner Kallweit authored
      pci_device_to_OF_node(to_pci_dev(dev)) is the same as dev->of_node,
      so we can simplify the code. In addition add an empty line before
      the return statement.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db4bad07
    • David S. Miller's avatar
      Merge branch 'ifa_list-RCU' · feb3cf2e
      David S. Miller authored
      Florian Westphal says:
      
      ====================
      net: add rcu annotations for ifa_list
      
      v3: fix typo in patch1 commit message
          All other patches are unchanged.
      v2: remove ifa_list iteration in afs instead of conversion
      
      Eric Dumazet reported following problem:
      
        It looks that unless RTNL is held, accessing ifa_list needs proper RCU
        protection.  indev->ifa_list can be changed under us by another cpu
        (which owns RTNL) [..]
      
        A proper rcu_dereference() with an happy sparse support would require
        adding __rcu attribute.
      
      This patch series does that: add __rcu to the ifa_list pointers.
      That makes sparse complain, so the series also adds the required
      rcu_assign_pointer/dereference helpers where needed.
      
      All patches except the last one are preparation work.
      Two new macros are introduced for in_ifaddr walks.
      
      Last patch adds the __rcu annotations and the assign_pointer/dereference
      helper calls.
      
      This patch is a bit large, but I found no better way -- other
      approaches (annotate-first or add helpers-first) all result in
      mid-series sparse warnings.
      
      This series is submitted vs. net-next rather than net for several
      reasons:
      
      1. Its (mostly) compile-tested only
      2. 3rd patch changes behaviour wrt. secondary addresses
         (see changelog)
      3. The problem exists for a very long time (2004), so it doesn't
         seem to be urgent to fix this -- rcu use to free ifa_list
         predates the git era.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feb3cf2e
    • Florian Westphal's avatar
      net: ipv4: provide __rcu annotation for ifa_list · 2638eb8b
      Florian Westphal authored
      ifa_list is protected by rcu, yet code doesn't reflect this.
      
      Add the __rcu annotations and fix up all places that are now reported by
      sparse.
      
      I've done this in the same commit to not add intermediate patches that
      result in new warnings.
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2638eb8b
    • Florian Westphal's avatar
      drivers: use in_dev_for_each_ifa_rtnl/rcu · cb8f1478
      Florian Westphal authored
      Like previous patches, use the new iterator macros to avoid sparse
      warnings once proper __rcu annotations are added.
      
      Compile tested only.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb8f1478
    • Florian Westphal's avatar
      net: use new in_dev_ifa iterators · cd5a411d
      Florian Westphal authored
      Use in_dev_for_each_ifa_rcu/rtnl instead.
      This prevents sparse warnings once proper __rcu annotations are added.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      
      t di# Last commands done (6 commands done):
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd5a411d
    • Florian Westphal's avatar
      netfilter: use in_dev_for_each_ifa_rcu · b8d19572
      Florian Westphal authored
      Netfilter hooks are always running under rcu read lock, use
      the new iterator macro so sparse won't complain once we add
      proper __rcu annotations.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8d19572
    • Florian Westphal's avatar
      devinet: use in_dev_for_each_ifa_rcu in more places · d519e870
      Florian Westphal authored
      This also replaces spots that used for_primary_ifa().
      
      for_primary_ifa() aborts the loop on the first secondary address seen.
      
      Replace it with either the rcu or rtnl variant of in_dev_for_each_ifa(),
      but two places will now also consider secondary addresses too:
      inet_addr_onlink() and inet_ifa_byprefix().
      
      I do not understand why they should ignore secondary addresses.
      
      Why would a secondary address not be considered 'on link'?
      When matching a prefix, why ignore a matching secondary address?
      
      Other places get converted as well, but gain "->flags & SECONDARY" check.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d519e870
    • Florian Westphal's avatar
      net: inetdevice: provide replacement iterators for in_ifaddr walk · ef11db33
      Florian Westphal authored
      The ifa_list is protected either by rcu or rtnl lock, but the
      current iterators do not account for this.
      
      This adds two iterators as replacement, a later patch in
      the series will update them with the needed rcu/rtnl_dereference calls.
      
      Its not done in this patch yet to avoid sparse warnings -- the fields
      lack the proper __rcu annotation.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef11db33
    • Florian Westphal's avatar
      afs: do not send list of client addresses · 35ebfc22
      Florian Westphal authored
      David Howells says:
        I'm told that there's not really any point populating the list.
        Current OpenAFS ignores it, as does AuriStor - and IBM AFS 3.6 will
        do the right thing.
        The list is actually useless as it's the client's view of the world,
        not the servers, so if there's any NAT in the way its contents are
        invalid.  Further, it doesn't support IPv6 addresses.
      
        On that basis, feel free to make it an empty list and remove all the
        interface enumeration.
      
      V1 of this patch reworked the function to use a new helper for the
      ifa_list iteration to avoid sparse warnings once the proper __rcu
      annotations get added in struct in_device later.
      
      But, in light of the above, just remove afs_get_ipv4_interfaces.
      
      Compile tested only.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35ebfc22
    • Colin Ian King's avatar
      qed: remove redundant assignment to rc · b9f88982
      Colin Ian King authored
      The variable rc is assigned with a value that is never read and
      it is re-assigned a new value later on.  The assignment is redundant
      and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9f88982
    • David S. Miller's avatar
      Merge tag 'isdn-removal' of https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground · 8a7e8ff8
      David S. Miller authored
      Arnd Bergmann says:
      
      ====================
      isdn: deprecate non-mISDN drivers
      
      When isdn4linux came up in the context of another patch series, I
      remembered that we had discussed removing it a while ago.
      
      It turns out that the suggestion from Karsten Keil wa to remove I4L
      in 2018 after the last public ISDN networks are shut down. This has
      happened now (with a very small number of exceptions), so I guess it's
      time to try again.
      
      We currently have three ISDN stacks in the kernel: the original
      isdn4linux (with the hisax driver), the newer CAPI (with four drivers),
      and finally the mISDN stack (supporting roughly the same hardware as
      hisax).
      
      As far as I can tell, anyone using ISDN with mainline kernel drivers in
      the past few years uses mISDN, and this is typically used for voice-only
      PBX installations that don't require a public network.
      
      The older stacks support additional features for data networks, but those
      typically make no sense any more if there is no network to connect to.
      
      My proposal for this time is to kill off isdn4linux entirely, as it seems
      to have been unusable for quite a while. This code has been abandoned
      for many years and it does cause problems for treewide maintenance as
      it tends to do everything that we try to stop doing.
      Birger Harzenetter mentioned that is is still using i4l in order to
      make use of the 'divert' feature that is not part of mISDN, but has
      otherwise moved on to mISDN for normal operation, like apparently
      everyone else.
      
      CAPI in turn is not quite as obsolete, but two of the drivers (avm
      and hysdn) don't seem to be used at all, while another one (gigaset)
      will stop being maintained as Paul Bolle is no longer able to
      test it after the network gets shut down in September.
      All three are now moved into drivers/staging to let others speak
      up in case there are remaining users.
      This leaves Bluetooth CMTP as the only remaining user of CAPI, but
      Marcel Holtmann wishes to keep maintaining it.
      
      For the discussion on version 1, see [2]
      Unfortunately, Karsten Keil as the maintainer has not participated in
      the discussion.
      
            Arnd
      
      [1] https://patchwork.kernel.org/patch/8484861/#17900371
      [2] https://listserv.isdn4linux.de/pipermail/isdn4linux/2019-April/thread.html
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a7e8ff8
  2. 02 Jun, 2019 4 commits
  3. 01 Jun, 2019 5 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · c1e9e01d
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      The following patchset container Netfilter/IPVS update for net-next:
      
      1) Add UDP tunnel support for ICMP errors in IPVS.
      
      Julian Anastasov says:
      
      This patchset is a followup to the commit that adds UDP/GUE tunnel:
      "ipvs: allow tunneling with gue encapsulation".
      
      What we do is to put tunnel real servers in hash table (patch 1),
      add function to lookup tunnels (patch 2) and use it to strip the
      embedded tunnel headers from ICMP errors (patch 3).
      
      2) Extend xt_owner to match for supplementary groups, from
         Lukasz Pawelczyk.
      
      3) Remove unused oif field in flow_offload_tuple object, from
         Taehee Yoo.
      
      4) Release basechain counters from workqueue to skip synchronize_rcu()
         call. From Florian Westphal.
      
      5) Replace skb_make_writable() by skb_ensure_writable(). Patchset
         from Florian Westphal.
      
      6) Checksum support for gue encapsulation in IPVS, from Jacky Hu.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1e9e01d
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 0462eaac
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2019-05-31
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      Lots of exciting new features in the first PR of this developement cycle!
      The main changes are:
      
      1) misc verifier improvements, from Alexei.
      
      2) bpftool can now convert btf to valid C, from Andrii.
      
      3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
         This feature greatly improves BPF speed on 32-bit architectures. From Jiong.
      
      4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
         bpf programs got stuck in dying cgroups. From Roman.
      
      5) new bpf_send_signal() helper, from Yonghong.
      
      6) cgroup inet skb programs can signal CN to the stack, from Lawrence.
      
      7) miscellaneous cleanups, from many developers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0462eaac
    • Alan Maguire's avatar
      selftests/bpf: measure RTT from xdp using xdping · cd538502
      Alan Maguire authored
      xdping allows us to get latency estimates from XDP.  Output looks
      like this:
      
      ./xdping -I eth4 192.168.55.8
      Setting up XDP for eth4, please wait...
      XDP setup disrupts network connectivity, hit Ctrl+C to quit
      
      Normal ping RTT data
      [Ignore final RTT; it is distorted by XDP using the reply]
      PING 192.168.55.8 (192.168.55.8) from 192.168.55.7 eth4: 56(84) bytes of data.
      64 bytes from 192.168.55.8: icmp_seq=1 ttl=64 time=0.302 ms
      64 bytes from 192.168.55.8: icmp_seq=2 ttl=64 time=0.208 ms
      64 bytes from 192.168.55.8: icmp_seq=3 ttl=64 time=0.163 ms
      64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.275 ms
      
      4 packets transmitted, 4 received, 0% packet loss, time 3079ms
      rtt min/avg/max/mdev = 0.163/0.237/0.302/0.054 ms
      
      XDP RTT data:
      64 bytes from 192.168.55.8: icmp_seq=5 ttl=64 time=0.02808 ms
      64 bytes from 192.168.55.8: icmp_seq=6 ttl=64 time=0.02804 ms
      64 bytes from 192.168.55.8: icmp_seq=7 ttl=64 time=0.02815 ms
      64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.02805 ms
      
      The xdping program loads the associated xdping_kern.o BPF program
      and attaches it to the specified interface.  If run in client
      mode (the default), it will add a map entry keyed by the
      target IP address; this map will store RTT measurements, current
      sequence number etc.  Finally in client mode the ping command
      is executed, and the xdping BPF program will use the last ICMP
      reply, reformulate it as an ICMP request with the next sequence
      number and XDP_TX it.  After the reply to that request is received
      we can measure RTT and repeat until the desired number of
      measurements is made.  This is why the sequence numbers in the
      normal ping are 1, 2, 3 and 8.  We XDP_TX a modified version
      of ICMP reply 4 and keep doing this until we get the 4 replies
      we need; hence the networking stack only sees reply 8, where
      we have XDP_PASSed it upstream since we are done.
      
      In server mode (-s), xdping simply takes ICMP requests and replies
      to them in XDP rather than passing the request up to the networking
      stack.  No map entry is required.
      
      xdping can be run in native XDP mode (the default, or specified
      via -N) or in skb mode (-S).
      
      A test program test_xdping.sh exercises some of these options.
      
      Note that native XDP does not seem to XDP_TX for veths, hence -N
      is not tested.  Looking at the code, it looks like XDP_TX is
      supported so I'm not sure if that's expected.  Running xdping in
      native mode for ixgbe as both client and server works fine.
      
      Changes since v4
      
      - close fds on cleanup (Song Liu)
      
      Changes since v3
      
      - fixed seq to be __be16 (Song Liu)
      - fixed fd checks in xdping.c (Song Liu)
      
      Changes since v2
      
      - updated commit message to explain why seq number of last
        ICMP reply is 8 not 4 (Song Liu)
      - updated types of seq number, raddr and eliminated csum variable
        in xdpclient/xdpserver functions as it was not needed (Song Liu)
      - added XDPING_DEFAULT_COUNT definition and usage specification of
        default/max counts (Song Liu)
      
      Changes since v1
       - moved from RFC to PATCH
       - removed unused variable in ipv4_csum() (Song Liu)
       - refactored ICMP checks into icmp_check() function called by client
         and server programs and reworked client and server programs due
         to lack of shared code (Song Liu)
       - added checks to ensure that SKB and native mode are not requested
         together (Song Liu)
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cd538502
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 33aae282
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2019-05-31
      
      This series contains updates to the iavf driver.
      
      Nathan Chancellor converts the use of gnu_printf to printf.
      
      Aleksandr modifies the driver to limit the number of RSS queues to the
      number of online CPUs in order to avoid creating misconfigured RSS
      queues.
      
      Gustavo A. R. Silva converts a couple of instances where sizeof() can be
      replaced with struct_size().
      
      Alice makes the remaining changes to the iavf driver to cleanup all the
      old "i40evf" references in the driver to iavf, including the file names
      that still contained the old driver reference.  There was no functional
      changes made, just cosmetic to reduce any confusion going forward now
      that the iavf driver is the virtual function driver for both i40e and
      ice drivers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33aae282
    • Jiong Wang's avatar
      bpf: doc: update answer for 32-bit subregister question · c231c22a
      Jiong Wang authored
      There has been quite a few progress around the two steps mentioned in the
      answer to the following question:
      
        Q: BPF 32-bit subregister requirements
      
      This patch updates the answer to reflect what has been done.
      
      v2:
       - Add missing full stop. (Song Liu)
       - Minor tweak on one sentence. (Song Liu)
      
      v1:
       - Integrated rephrase from Quentin and Jakub
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarJiong Wang <jiong.wang@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c231c22a
  4. 31 May, 2019 17 commits
    • Alexei Starovoitov's avatar
      Merge branch 'map-charge-cleanup' · d168286d
      Alexei Starovoitov authored
      Roman Gushchin says:
      
      ====================
      During my work on memcg-based memory accounting for bpf maps
      I've done some cleanups and refactorings of the existing
      memlock rlimit-based code. It makes it more robust, unifies
      size to pages conversion, size checks and corresponding error
      codes. Also it adds coverage for cgroup local storage and
      socket local storage maps.
      
      It looks like some preliminary work on the mm side might be
      required to start working on the memcg-based accounting,
      so I'm sending these patches as a separate patchset.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d168286d
    • Roman Gushchin's avatar
      bpf: move memory size checks to bpf_map_charge_init() · c85d6913
      Roman Gushchin authored
      Most bpf map types doing similar checks and bytes to pages
      conversion during memory allocation and charging.
      
      Let's unify these checks by moving them into bpf_map_charge_init().
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c85d6913
    • Roman Gushchin's avatar
      bpf: rework memlock-based memory accounting for maps · b936ca64
      Roman Gushchin authored
      In order to unify the existing memlock charging code with the
      memcg-based memory accounting, which will be added later, let's
      rework the current scheme.
      
      Currently the following design is used:
        1) .alloc() callback optionally checks if the allocation will likely
           succeed using bpf_map_precharge_memlock()
        2) .alloc() performs actual allocations
        3) .alloc() callback calculates map cost and sets map.memory.pages
        4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
           and performs actual charging; in case of failure the map is
           destroyed
        <map is in use>
        1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
           performs uncharge and releases the user
        2) .map_free() callback releases the memory
      
      The scheme can be simplified and made more robust:
        1) .alloc() calculates map cost and calls bpf_map_charge_init()
        2) bpf_map_charge_init() sets map.memory.user and performs actual
          charge
        3) .alloc() performs actual allocations
        <map is in use>
        1) .map_free() callback releases the memory
        2) bpf_map_charge_finish() performs uncharge and releases the user
      
      The new scheme also allows to reuse bpf_map_charge_init()/finish()
      functions for memcg-based accounting. Because charges are performed
      before actual allocations and uncharges after freeing the memory,
      no bogus memory pressure can be created.
      
      In cases when the map structure is not available (e.g. it's not
      created yet, or is already destroyed), on-stack bpf_map_memory
      structure is used. The charge can be transferred with the
      bpf_map_charge_move() function.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b936ca64
    • Roman Gushchin's avatar
      bpf: group memory related fields in struct bpf_map_memory · 3539b96e
      Roman Gushchin authored
      Group "user" and "pages" fields of bpf_map into the bpf_map_memory
      structure. Later it can be extended with "memcg" and other related
      information.
      
      The main reason for a such change (beside cosmetics) is to pass
      bpf_map_memory structure to charging functions before the actual
      allocation of bpf_map.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3539b96e
    • Roman Gushchin's avatar
      bpf: add memlock precharge for socket local storage · d50836cd
      Roman Gushchin authored
      Socket local storage maps lack the memlock precharge check,
      which is performed before the memory allocation for
      most other bpf map types.
      
      Let's add it in order to unify all map types.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d50836cd
    • Roman Gushchin's avatar
      bpf: add memlock precharge check for cgroup_local_storage · ffc8b144
      Roman Gushchin authored
      Cgroup local storage maps lack the memlock precharge check,
      which is performed before the memory allocation for
      most other bpf map types.
      
      Let's add it in order to unify all map types.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ffc8b144
    • Alexei Starovoitov's avatar
      Merge branch 'propagate-cn-to-tcp' · 576240cf
      Alexei Starovoitov authored
      Lawrence Brakmo says:
      
      ====================
      This patchset adds support for propagating congestion notifications (cn)
      to TCP from cgroup inet skb egress BPF programs.
      
      Current cgroup skb BPF programs cannot trigger TCP congestion window
      reductions, even when they drop a packet. This patch-set adds support
      for cgroup skb BPF programs to send congestion notifications in the
      return value when the packets are TCP packets. Rather than the
      current 1 for keeping the packet and 0 for dropping it, they can
      now return:
          NET_XMIT_SUCCESS    (0)    - continue with packet output
          NET_XMIT_DROP       (1)    - drop packet and do cn
          NET_XMIT_CN         (2)    - continue with packet output and do cn
          -EPERM                     - drop packet
      
      Finally, HBM programs are modified to collect and return more
      statistics.
      
      There has been some discussion regarding the best place to manage
      bandwidths. Some believe this should be done in the qdisc where it can
      also be managed with a BPF program. We believe there are advantages
      for doing it with a BPF program in the cgroup/skb callback. For example,
      it reduces overheads in the cases where there is on primary workload and
      one or more secondary workloads, where each workload is running on its
      own cgroupv2. In this scenario, we only need to throttle the secondary
      workloads and there is no overhead for the primary workload since there
      will be no BPF program attached to its cgroup.
      
      Regardless, we agree that this mechanism should not penalize those that
      are not using it. We tested this by doing 1 byte req/reply RPCs over
      loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs.
      Each test was repeated 50 times with a 1 minute delay between each set
      of 10. We then calculated the average RPCs/sec over the 50 tests. We
      compare upstream with upstream + patchset and no BPF program as well
      as upstream + patchset and a BPF program that just returns ALLOW_PKT.
      Here are the results:
      
      upstream                           80937 RPCs/sec
      upstream + patches, no BPF program 80894 RPCs/sec
      upstream + patches, BPF program    80634 RPCs/sec
      
      These numbers indicate that there is no penalty for these patches
      
      The use of congestion notifications improves the performance of HBM when
      using Cubic. Without congestion notifications, Cubic will not decrease its
      cwnd and HBM will need to drop a large percentage of the packets.
      
      The following results are obtained for rate limits of 1Gbps,
      between two servers using netperf, and only one flow. We also show how
      reducing the max delayed ACK timer can improve the performance when
      using Cubic.
      
      Command used was:
        ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
                         -s=<server running netserver>
        where:
           <rate>   is 1000
           --no_cn  specifies no cwr notifications
           dctcp    uses dctcp
      
                             Cubic                    DCTCP
      Lim, DA      Mbps cwnd cred drops  Mbps cwnd cred drops
      --------     ---- ---- ---- -----  ---- ---- ---- -----
        1G, 40       35  462 -320 67%     995    1 -212  0.05%
        1G, 40,cn   736    9  -78  0.07   995    1 -212  0.05
        1G,  5,cn   941    2 -189  0.13   995    1 -212  0.05
      
      Notes:
        --no_cn has no effect with DCTCP
        Lim = rate limit
        DA = maximum delay ack timer
        cred = credit in packets
        drops = % packets dropped
      
      v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3
              New egress values apply to all protocols, not just TCP
              Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers
              Removed changes to __tcp_transmit_skb (patch 5), no longer needed
              Removed sample use of EDT
      v2->v3: Removed the probe timer related changes
      v3->v4: Replaced preempt_enable_no_resched() by preempt_enable()
              in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() macro
      ====================
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      576240cf
    • brakmo's avatar
      bpf: Add more stats to HBM · d58c6f72
      brakmo authored
      Adds more stats to HBM, including average cwnd and rtt of all TCP
      flows, percents of packets that are ecn ce marked and distribution
      of return values.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d58c6f72
    • brakmo's avatar
      bpf: Add cn support to hbm_out_kern.c · ffd81558
      brakmo authored
      Update hbm_out_kern.c to support returning cn notifications.
      Also updates relevant files to allow disabling cn notifications.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ffd81558
    • brakmo's avatar
      bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls · 956fe219
      brakmo authored
      Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
      congestion notifications from the BPF programs.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      956fe219
    • brakmo's avatar
      bpf: Update __cgroup_bpf_run_filter_skb with cn · e7a3160d
      brakmo authored
      For egress packets, __cgroup_bpf_fun_filter_skb() will now call
      BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
      in order to propagate congestion notifications (cn) requests to TCP
      callers.
      
      For egress packets, this function can return:
         NET_XMIT_SUCCESS    (0)    - continue with packet output
         NET_XMIT_DROP       (1)    - drop packet and notify TCP to call cwr
         NET_XMIT_CN         (2)    - continue with packet output and notify TCP
                                      to call cwr
         -EPERM                     - drop packet
      
      For ingress packets, this function will return -EPERM if any attached
      program was found and if it returned != 1 during execution. Otherwise 0
      is returned.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e7a3160d
    • brakmo's avatar
      bpf: cgroup inet skb programs can return 0 to 3 · 5cf1e914
      brakmo authored
      Allows cgroup inet skb programs to return values in the range [0, 3].
      The second bit is used to deterine if congestion occurred and higher
      level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()
      
      The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
      at load time if it uses the new return values (i.e. 2 or 3).
      
      The expected_attach_type is currently not enforced for
      BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
      expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
      BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
      break backward compatibility.
      
      This patch adds a enforce_expected_attach_type bit to only
      enforce the expected_attach_type when it uses the new
      return value.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5cf1e914
    • brakmo's avatar
      bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY · 1f52f6c0
      brakmo authored
      Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
      __cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
      request cwr for TCP packets.
      
      Current cgroup skb programs can only return 0 or 1 (0 to drop the
      packet. This macro changes the behavior so the low order bit
      indicates whether the packet should be dropped (0) or not (1)
      and the next bit is used for congestion notification (cn).
      
      Hence, new allowed return values of CGROUP EGRESS BPF programs are:
        0: drop packet
        1: keep packet
        2: drop packet and call cwr
        3: keep packet and call cwr
      
      This macro then converts it to one of NET_XMIT values or -EPERM
      that has the effect of dropping the packet with no cn.
        0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
        1: NET_XMIT_DROP     skb should be dropped and cwr called
        2: NET_XMIT_CN       skb should be transmitted and cwr called
        3: -EPERM            skb should be dropped (no cn)
      
      Note that when more than one BPF program is called, the packet is
      dropped if at least one of programs requests it be dropped, and
      there is cn if at least one program returns cn.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1f52f6c0
    • Colin Ian King's avatar
      xen-netback: remove redundant assignment to err · 587a7126
      Colin Ian King authored
      The variable err is assigned with the value -ENOMEM that is never
      read and it is re-assigned a new value later on.  The assignment is
      redundant and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      587a7126
    • Colin Ian King's avatar
      nexthop: remove redundant assignment to err · 6f43e525
      Colin Ian King authored
      The variable err is initialized with a value that is never read
      and err is reassigned a few statements later. This initialization
      is redundant and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f43e525
    • David S. Miller's avatar
      Merge branch 'phylink-sfp-updates' · 6912378d
      David S. Miller authored
      Russell King says:
      
      ====================
      phylink/sfp updates
      
      This is a series of updates to phylink and sfp:
      
      - Remove an unused net device argument from the phylink MII ioctl
        emulation code.
      
      - add support for using interrupts when using a GPIO for link status
        tracking, rather than polling it at one second intervals.  This
        reduces the need to wakeup the CPU every second.
      
      - add support to the MII ioctl API to read and write Clause 45 PHY
        registers.  I don't know how desirable this is for mainline, but I
        have used this facility extensively to investigate the Marvell
        88x3310 PHY.  A recent illustration of use for this was debugging
        the PHY-without-firmware problem recently reported.
      
      - add mandatory attach/detach methods for the upstream side of sfp
        bus code, which will allow us to remove the "netdev" structure from
        the SFP layers.
      
      - remove the "netdev" structure from the SFP upstream registration
        calls, which simplifies PHY to SFP links.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6912378d
    • Russell King's avatar
      net: sfp: remove sfp-bus use of netdevs · 54f70b3b
      Russell King authored
      The sfp-bus code now no longer has any use for the network device
      structure, so remove its use.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54f70b3b