1. 09 Nov, 2018 40 commits
    • Michał Mirosław's avatar
      net/core: use __vlan_hwaccel helpers · b1817524
      Michał Mirosław authored
      This removes assumptions about VLAN_TAG_PRESENT bit.
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1817524
    • Michał Mirosław's avatar
      cxgb4: use __vlan_hwaccel helpers · 35c4a95d
      Michał Mirosław authored
      Use __vlan_hwaccel_put_tag() to set vlan tag and proto fields.
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c4a95d
    • Cong Wang's avatar
      net: move __skb_checksum_complete*() to skbuff.c · 49f8e832
      Cong Wang authored
      __skb_checksum_complete_head() and __skb_checksum_complete()
      are both declared in skbuff.h, they fit better in skbuff.c
      than datagram.c.
      
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49f8e832
    • David S. Miller's avatar
      Merge branch 'net-ethernet-ti-cpsw-fix-vlan-mcast' · d1cb9273
      David S. Miller authored
      Ivan Khoronzhuk says:
      
      ====================
      net: ethernet: ti: cpsw: fix vlan mcast
      
      The cpsw holds separate mcast entires for vlan entries. At this moment
      driver adds only not vlan mcast addresses, omitting vlan/mcast entries.
      As result mcast for vlans doesn't work. It can be fixed by adding same
      mcast entries for every created vlan, but this patchseries uses more
      sophisticated way and allows to create mcast entries only for vlans
      that really require it. Generic functions from this series can be
      reused for fixing vlan and macvlan unicast.
      
      Simple example of ALE table before and after this series, having same
      mcast entries as for vlan 100 as for real device (reserved vlan 2),
      and one mcast address only for vlan 100 - 01:1b:19:00:00:00.
      
      <---- Before this patchset ---->
      vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5, mem_list = 0x5
      mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
      ucast, vid = 2, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
      vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0, mem_list = 0x7
      mcast, vid = 2, addr = 33:33:00:00:00:01, port_mask = 0x1
      mcast, vid = 2, addr = 01:00:5e:00:00:01, port_mask = 0x1
      vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3, mem_list = 0x3
      mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
      ucast, vid = 1, addr = 74:da:ea:47:7d:9c, persistant, port_num = 0x0
      mcast, vid = 1, addr = 33:33:00:00:00:01, port_mask = 0x1
      mcast, vid = 1, addr = 01:00:5e:00:00:01, port_mask = 0x1
      mcast, vid = 2, addr = 01:80:c2:00:00:00, port_mask = 0x1
      mcast, vid = 2, addr = 01:80:c2:00:00:03, port_mask = 0x1
      mcast, vid = 2, addr = 01:80:c2:00:00:0e, port_mask = 0x1
      mcast, vid = 1, addr = 01:80:c2:00:00:00, port_mask = 0x1
      mcast, vid = 1, addr = 01:80:c2:00:00:03, port_mask = 0x1
      mcast, vid = 1, addr = 01:80:c2:00:00:0e, port_mask = 0x1
      mcast, vid = 2, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
      mcast, vid = 2, addr = 33:33:00:00:00:fb, port_mask = 0x1
      mcast, vid = 2, addr = 33:33:00:01:00:03, port_mask = 0x1
      mcast, vid = 1, addr = 33:33:ff:47:7d:9c, port_mask = 0x1
      mcast, vid = 1, addr = 33:33:00:00:00:fb, port_mask = 0x1
      mcast, vid = 1, addr = 33:33:00:01:00:03, port_mask = 0x1
      mcast, vid = 1, addr = 01:00:5e:00:00:fb, port_mask = 0x1
      mcast, vid = 1, addr = 01:00:5e:00:00:fc, port_mask = 0x1
      vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5, mem_list = 0x5
      ucast, vid = 100, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
      mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
      mcast, vid = 2, addr = 01:1b:19:00:00:00, port_mask = 0x1
      			 ^^^
       Here mcast entry (ptpl2), has to be added only for vlan 100
       but added for reserved vlan 2...that's not enough.
      
      <---- After this patchset ---->
      vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5, mem_list = 0x5
      mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
      ucast, vid = 2, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
      vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0, mem_list = 0x7
      mcast, vid = 2, addr = 33:33:00:00:00:01, port_mask = 0x1
      mcast, vid = 2, addr = 01:00:5e:00:00:01, port_mask = 0x1
      vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3, mem_list = 0x3
      mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
      ucast, vid = 1, addr = 74:da:ea:47:7d:9c, persistant, port_num = 0x0
      mcast, vid = 1, addr = 33:33:00:00:00:01, port_mask = 0x1
      mcast, vid = 1, addr = 01:00:5e:00:00:01, port_mask = 0x1
      mcast, vid = 2, addr = 01:80:c2:00:00:00, port_mask = 0x1
      mcast, vid = 2, addr = 01:80:c2:00:00:03, port_mask = 0x1
      mcast, vid = 2, addr = 01:80:c2:00:00:0e, port_mask = 0x1
      mcast, vid = 1, addr = 01:80:c2:00:00:00, port_mask = 0x1
      mcast, vid = 1, addr = 01:80:c2:00:00:03, port_mask = 0x1
      mcast, vid = 1, addr = 01:80:c2:00:00:0e, port_mask = 0x1
      mcast, vid = 2, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
      mcast, vid = 1, addr = 33:33:ff:47:7d:9c, port_mask = 0x1
      mcast, vid = 2, addr = 33:33:00:00:00:fb, port_mask = 0x1
      mcast, vid = 2, addr = 33:33:00:01:00:03, port_mask = 0x1
      mcast, vid = 1, addr = 33:33:00:00:00:fb, port_mask = 0x1
      mcast, vid = 1, addr = 33:33:00:01:00:03, port_mask = 0x1
      vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5, mem_list = 0x5
      ucast, vid = 100, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
      mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
      mcast, vid = 100, addr = 33:33:00:00:00:01, port_mask = 0x1
      mcast, vid = 100, addr = 01:00:5e:00:00:01, port_mask = 0x1
      mcast, vid = 100, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
      mcast, vid = 100, addr = 01:80:c2:00:00:00, port_mask = 0x1
      mcast, vid = 100, addr = 01:80:c2:00:00:03, port_mask = 0x1
      mcast, vid = 100, addr = 01:80:c2:00:00:0e, port_mask = 0x1
      mcast, vid = 100, addr = 33:33:00:00:00:fb, port_mask = 0x1
      mcast, vid = 100, addr = 33:33:00:01:00:03, port_mask = 0x1
      mcast, vid = 100, addr = 01:1b:19:00:00:00, port_mask = 0x1
      			 ^^^
          Here mcast entry (ptpl2), is added only for vlan 100
          as it should be.
      
      Based on net-next/master
      
      v2..v1:
        net: ethernet: ti: cpsw: fix vlan mcast
      	- removed limit for legacy switch cpsw mode
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1cb9273
    • Ivan Khoronzhuk's avatar
      net: ethernet: ti: cpsw: fix vlan configuration while down/up · 00fe4712
      Ivan Khoronzhuk authored
      The vlan configuration is not restored after interface donw/up sequence
      (if dual-emac - both interfaces). Tested on am572x EVM.
      
      Steps to check:
      ~# ip link add link eth1 name eth1.100 type vlan id 100
      ~# ifconfig eth0 down
      ~# ifconfig eth1 down
      
      Try to remove vid and observe warning:
      ~# ip link del eth1.100
      [  739.526757] net eth1: removing vlanid 100 from vlan filter
      [  739.533322] failed to kill vid 0081/100 for device eth1
      
      This patch fixes it, restoring only vlan ALE entries and all other
      unicast/multicast entries are restored by system calling rx_mode ndo.
      Reviewed-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      00fe4712
    • Ivan Khoronzhuk's avatar
      net: ethernet: ti: cpsw: fix vlan mcast · 15180eca
      Ivan Khoronzhuk authored
      At this moment, mcast addresses are added for real device only
      (reserved vlans for dual-emac mode), even if a mcast address was added
      for some vlan only, thus ALE doesn't have corresponding vlan mcast
      entries after vlan socket joined multicast group. So ALE drops vlan
      frames with mcast addresses intended for vlans and potentially can
      receive mcast frames for base ndev. That's not correct. So, fix it by
      creating only vlan/mcast entries as requested. Patch doesn't use any
      additional lists and is based on device mc address list and cpsw ALE
      table entries.
      Signed-off-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15180eca
    • Ivan Khoronzhuk's avatar
      net: 8021q: vlan_core: allow use list of vlans for real device · 960abf68
      Ivan Khoronzhuk authored
      It's redundancy for the drivers to hold the list of vlans when
      absolutely the same list exists in vlan core. In most cases it's
      needed only to traverse the vlan devices, their vids and sync some
      settings with h/w, so add API to simplify this.
      
      At least some of these drivers also can benefit:
      grep "for_each.*vid" -r drivers/net/ethernet/
      
      drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:
      drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c:
      drivers/net/ethernet/qlogic/qlge/qlge_main.c:
      drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c:
      drivers/net/ethernet/via/via-rhine.c:
      drivers/net/ethernet/via/via-velocity.c:
      drivers/net/ethernet/intel/igb/igb_main.c:
      drivers/net/ethernet/intel/ice/ice_main.c:
      drivers/net/ethernet/intel/e1000/e1000_main.c:
      drivers/net/ethernet/intel/i40e/i40e_main.c:
      drivers/net/ethernet/intel/e1000e/netdev.c:
      drivers/net/ethernet/intel/igbvf/netdev.c:
      drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:
      drivers/net/ethernet/intel/ixgb/ixgb_main.c:
      drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:
      drivers/net/ethernet/amd/xgbe/xgbe-dev.c:
      drivers/net/ethernet/emulex/benet/be_main.c:
      drivers/net/ethernet/neterion/vxge/vxge-main.c:
      drivers/net/ethernet/adaptec/starfire.c:
      drivers/net/ethernet/brocade/bna/bnad.c:
      Reviewed-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      960abf68
    • Ivan Khoronzhuk's avatar
      net: core: dev_addr_lists: add auxiliary func to handle reference address updates · e7946760
      Ivan Khoronzhuk authored
      In order to avoid all table update, and only remove or add new
      address, the auxiliary function exists, named __hw_addr_sync_dev().
      It allows end driver do nothing when nothing changed and add/rm when
      concrete address is firstly added or lastly removed. But it doesn't
      include cases when an address of real device or vlan was reused by
      other vlans or vlan/macval devices.
      
      For handaling events when address was reused/unreused the patch adds
      new auxiliary routine - __hw_addr_ref_sync_dev(). It allows to do
      nothing when nothing was changed and do updates only for an address
      being added/reused/deleted/unreused. Thus, clone address changes for
      vlans can be mirrored in the table. The function is exclusive with
      __hw_addr_sync_dev(). It's responsibility of the end driver to
      identify address vlan device, if it needs so.
      Signed-off-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7946760
    • Edward Cree's avatar
      sfc: use the new __netdev_tx_sent_queue BQL optimisation · 29e12207
      Edward Cree authored
      As added in 3e59020a ("net: bql: add __netdev_tx_sent_queue()"), which
       see for performance rationale.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29e12207
    • David S. Miller's avatar
      Merge branch 'net-Remove-VLAN_TAG_PRESENT-from-drivers' · eb4149c9
      David S. Miller authored
      Michał Mirosław says:
      
      ====================
      net: Remove VLAN_TAG_PRESENT from drivers
      
      This series removes VLAN_TAG_PRESENT use from network drivers in
      preparation to removing its special meaning.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb4149c9
    • Michał Mirosław's avatar
    • Michał Mirosław's avatar
      OVS: remove use of VLAN_TAG_PRESENT · 9df46aef
      Michał Mirosław authored
      This is a minimal change to allow removing of VLAN_TAG_PRESENT.
      It leaves OVS unable to use CFI bit, as fixing this would need
      a deeper surgery involving userspace interface.
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9df46aef
    • Michał Mirosław's avatar
      cnic: remove use of VLAN_TAG_PRESENT · f723a1a2
      Michał Mirosław authored
      This just removes VLAN_TAG_PRESENT use.  VLAN TCI=0 special meaning is
      deeply embedded in the driver code and so is left as is.
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f723a1a2
    • Michał Mirosław's avatar
    • Ilias Apalodimas's avatar
      net: socionext: refactor netsec_alloc_dring() · 0d404a61
      Ilias Apalodimas authored
      return -ENOMEM directly instead of assigning it in a variable
      Signed-off-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d404a61
    • Ilias Apalodimas's avatar
      net: socionext: different approach on DMA · 4acb20b4
      Ilias Apalodimas authored
      Current driver dynamically allocates an skb and maps it as DMA Rx
      buffer. In order to prepare for upcoming XDP changes, let's introduce a
      different allocation scheme.
      Buffers are allocated dynamically and mapped into hardware.
      During the Rx operation the driver uses build_skb() to produce the
      necessary buffers for the network stack.
      This change increases performance ~15% on 64b packets with smmu disabled
      and ~5% with smmu enabled
      Signed-off-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4acb20b4
    • Stefan Wahren's avatar
      net: qca_spi: Add available buffer space verification · 026b907d
      Stefan Wahren authored
      Interferences on the SPI line could distort the response of
      available buffer space. So at least we should check that the
      response doesn't exceed the maximum available buffer space.
      In error case increase a new error counter and retry it later.
      This behavior avoids buffer errors in the QCA7000, which
      results in an unnecessary chip reset including packet loss.
      Signed-off-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      026b907d
    • David Barmann's avatar
      sock: Reset dst when changing sk_mark via setsockopt · 50254256
      David Barmann authored
      When setting the SO_MARK socket option, if the mark changes, the dst
      needs to be reset so that a new route lookup is performed.
      
      This fixes the case where an application wants to change routing by
      setting a new sk_mark.  If this is done after some packets have already
      been sent, the dst is cached and has no effect.
      Signed-off-by: default avatarDavid Barmann <david.barmann@stackpath.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50254256
    • David S. Miller's avatar
      Merge branch 's390-qeth-next' · 52358cb5
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: updates 2018-11-08
      
      please apply the following qeth patches to net-next.
      
      The first patch allows one more device type to query the FW for a MAC address,
      the others are all basically just removal of duplicated or unused code.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52358cb5
    • Julian Wiedmann's avatar
      s390/qeth: don't process hsuid in qeth_l3_setup_netdev() · ded9da1f
      Julian Wiedmann authored
      qeth_l3_setup_netdev() checks if the hsuid attribute is set on the qeth
      device, and propagates it to the net_device. In the past this was needed
      to pick up any hsuid that was set before allocation of the net_device.
      
      With commit d3d1b205 ("s390/qeth: allocate netdevice early") this
      is no longer necessary, qeth_l3_dev_hsuid_store() always stores the
      hsuid straight into dev->perm_addr.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ded9da1f
    • Julian Wiedmann's avatar
      s390/qeth: remove unused fallback in Layer3's MAC code · 9168f5ae
      Julian Wiedmann authored
      If the CREATE ADDR sent by qeth_l3_iqd_read_initial_mac() fails, its
      callback sets a random MAC address on the net_device. The error then
      propagates back, and qeth_l3_setup_netdev() bails out without
      registering the net_device.
      
      Any subsequent call to qeth_l3_setup_netdev() will then attempt a fresh
      CREATE ADDR which either 1) also fails, or 2) sets a proper MAC address
      on the net_device. Consequently, the net_device will never be registered
      with a random MAC and we can drop the fallback code.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9168f5ae
    • Julian Wiedmann's avatar
      s390/qeth: remove two IPA command helpers · 4fa55fa9
      Julian Wiedmann authored
      qeth_l3_send_ipa_arp_cmd() is merely a wrapper around
      qeth_send_control_data() now. So push the length adjustment into
      QETH_SETASS_BASE_LEN, and remove the wrapper. While at it, also remove
      some redundant 0-initializations.
      
      qeth_send_setassparms() requires that callers prepare their command
      parameters, so that they can be copied into the parameter area in one
      go. Skip the indirection, and just let callers set up the command
      themselves.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fa55fa9
    • Julian Wiedmann's avatar
      s390/qeth: replace open-coded cmd setup · 605c9d5f
      Julian Wiedmann authored
      Call qeth_prepare_ipa_cmd() during setup of a new IPA cmd buffer, so
      that it is used for all commands. Thus ARP and SNMP requests don't have
      to do their own initialization.
      
      This will now also set the proper MPC protocol version for SNMP requests
      on L2 devices.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      605c9d5f
    • Julian Wiedmann's avatar
      s390/qeth: remove card list · d7d18da1
      Julian Wiedmann authored
      Re-implement the card-by-RDEV lookup by using device model concepts, and
      remove the now redundant list of all qeth card instances in the system.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7d18da1
    • Julian Wiedmann's avatar
      s390/qeth: unify transmit code · 81ec5439
      Julian Wiedmann authored
      Since commit 82bf5c08 ("s390/qeth: add support for IPv6 TSO"),
      qeth_xmit() also knows how to build TSO packets and is practically
      identical to qeth_l3_xmit().
      Convert qeth_l3_xmit() into a thin wrapper that merely strips the
      L2 header off a packet, and calls qeth_xmit() for the actual
      TX processing.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81ec5439
    • Julian Wiedmann's avatar
      s390/qeth: handle af_iucv skbs in qeth_l3_fill_header() · 5a541f6d
      Julian Wiedmann authored
      Filling the HW header from one single function will make it easier to
      rip out all the duplicated transmit code in qeth_l3_xmit(). On top, this
      saves one conditional branch in the TSO path.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a541f6d
    • Julian Wiedmann's avatar
      s390/qeth: utilize virtual MAC for Layer2 OSD devices · b144b99f
      Julian Wiedmann authored
      By default, READ MAC on a Layer2 OSD device returns the adapter's
      burnt-in MAC address. Given the default scenario of many virtual devices
      on the same adapter, qeth can't make any use of this address and
      therefore skips the READ MAC call for this device type.
      
      But in some configurations, the READ MAC command for a Layer2 OSD device
      actually returns a pre-provisioned, virtual MAC address. So enable the
      READ MAC code to detect this situation, and let the L2 subdriver
      call READ MAC for OSD devices.
      
      This also removes the QETH_LAYER2_MAC_READ flag, which protects L2
      devices against calling READ MAC multiple times. Instead protect the
      whole call to qeth_l2_request_initial_mac().
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b144b99f
    • Li RongQing's avatar
      openvswitch: remove BUG_ON from get_dpdev · 04087d9a
      Li RongQing authored
      if local is NULL pointer, and the following access of local's
      dev will trigger panic, which is same as BUG_ON
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04087d9a
    • David S. Miller's avatar
      Merge branch 'ICMP-error-handling-for-UDP-tunnels' · 20da4ef9
      David S. Miller authored
      Stefano Brivio says:
      
      ====================
      ICMP error handling for UDP tunnels
      
      This series introduces ICMP error handling for UDP tunnels and
      encapsulations and related selftests. We need to handle ICMP errors to
      support PMTU discovery and route redirection -- this support is entirely
      missing right now:
      
      - patch 1/11 adds a socket lookup for UDP tunnels that use, by design,
        the same destination port on both endpoints -- i.e. VXLAN and GENEVE
      - patches 2/11 to 7/11 are specific to VxLAN and GENEVE
      - patches 8/11 and 9/11 add infrastructure for lookup of encapsulations
        where sent packets cannot be matched via receiving socket lookup, i.e.
        FoU and GUE
      - patches 10/11 and 11/11 are specific to FoU and GUE
      
      v2: changes are listed in the single patches
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20da4ef9
    • Stefano Brivio's avatar
      selftests: pmtu: Introduce FoU and GUE PMTU exceptions tests · 56fd865f
      Stefano Brivio authored
      Introduce eight tests, for FoU and GUE, with IPv4 and IPv6 payload,
      on IPv4 and IPv6 transport, that check that PMTU exceptions are created
      with the right value when exceeding the MTU on a link of the path.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56fd865f
    • Stefano Brivio's avatar
      fou, fou6: ICMP error handlers for FoU and GUE · b8a51b38
      Stefano Brivio authored
      As the destination port in FoU and GUE receiving sockets doesn't
      necessarily match the remote destination port, we can't associate errors
      to the encapsulating tunnels with a socket lookup -- we need to blindly
      try them instead. This means we don't even know if we are handling errors
      for FoU or GUE without digging into the packets.
      
      Hence, implement a single handler for both, one for IPv4 and one for IPv6,
      that will check whether the packet that generated the ICMP error used a
      direct IP encapsulation or if it had a GUE header, and send the error to
      the matching protocol handler, if any.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8a51b38
    • Stefano Brivio's avatar
      udp: Support for error handlers of tunnels with arbitrary destination port · e7cc0824
      Stefano Brivio authored
      ICMP error handling is currently not possible for UDP tunnels not
      employing a receiving socket with local destination port matching the
      remote one, because we have no way to look them up.
      
      Add an err_handler tunnel encapsulation operation that can be exported by
      tunnels in order to pass the error to the protocol implementing the
      encapsulation. We can't easily use a lookup function as we did for VXLAN
      and GENEVE, as protocol error handlers, which would be in turn called by
      implementations of this new operation, handle the errors themselves,
      together with the tunnel lookup.
      
      Without a socket, we can't be sure which encapsulation error handler is
      the appropriate one: encapsulation handlers (the ones for FoU and GUE
      introduced in the next patch, e.g.) will need to check the new error codes
      returned by protocol handlers to figure out if errors match the given
      encapsulation, and, in turn, report this error back, so that we can try
      all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.
      
      v2:
      - Name all arguments in err_handler prototypes (David Miller)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7cc0824
    • Stefano Brivio's avatar
      net: Convert protocol error handlers from void to int · 32bbd879
      Stefano Brivio authored
      We'll need this to handle ICMP errors for tunnels without a sending socket
      (i.e. FoU and GUE). There, we might have to look up different types of IP
      tunnels, registered as network protocols, before we get a match, so we
      want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
      inet_protos and inet6_protos. These error codes will be used in the next
      patch.
      
      For consistency, return sensible error codes in protocol error handlers
      whenever handlers can't handle errors because, even if valid, they don't
      match a protocol or any of its states.
      
      This has no effect on existing error handling paths.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32bbd879
    • Stefano Brivio's avatar
      selftests: pmtu: Introduce tests for IPv4/IPv6 over GENEVE over IPv4/IPv6 · ce733661
      Stefano Brivio authored
      Use a router between endpoints, implemented via namespaces, set a low MTU
      between router and destination endpoint, exceed it and check PMTU value in
      route exceptions.
      
      v2:
      - Introduce IPv4 tests right away, if iproute2 doesn't support the 'df'
        link option they will be skipped (David Ahern)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce733661
    • Stefano Brivio's avatar
      geneve: Allow configuration of DF behaviour · a025fb5f
      Stefano Brivio authored
      draft-ietf-nvo3-geneve-08 says:
      
         It is strongly RECOMMENDED that Path MTU Discovery ([RFC1191],
         [RFC1981]) be used by setting the DF bit in the IP header when Geneve
         packets are transmitted over IPv4 (this is the default with IPv6).
      
      Now that ICMP error handling is working for GENEVE, we can comply with
      this recommendation.
      
      Make this configurable, though, to avoid breaking existing setups. By
      default, DF won't be set. It can be set or inherited from inner IPv4
      packets. If it's configured to be inherited and we are encapsulating IPv6,
      it will be set.
      
      This only applies to non-lwt tunnels: if an external control plane is
      used, tunnel key will still control the DF flag.
      
      v2:
      - DF behaviour configuration only applies for non-lwt tunnels, apply DF
        setting only if (!geneve->collect_md) in geneve_xmit_skb()
        (Stephen Hemminger)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a025fb5f
    • Stefano Brivio's avatar
      geneve: ICMP error lookup handler · a0796644
      Stefano Brivio authored
      Export an encap_err_lookup() operation to match an ICMP error against a
      valid VNI.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0796644
    • Stefano Brivio's avatar
      selftests: pmtu: Introduce tests for IPv4/IPv6 over VXLAN over IPv4/IPv6 · 58288879
      Stefano Brivio authored
      Use a router between endpoints, implemented via namespaces, set a low MTU
      between router and destination endpoint, exceed it and check PMTU value in
      route exceptions.
      
      v2:
      - Change all occurrences of VxLAN to VXLAN (Jiri Benc)
      - Introduce IPv4 tests right away, if iproute2 doesn't support the 'df'
        link option they will be skipped (David Ahern)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58288879
    • Stefano Brivio's avatar
      vxlan: Allow configuration of DF behaviour · b4d30697
      Stefano Brivio authored
      Allow users to set the IPv4 DF bit in outgoing packets, or to inherit its
      value from the IPv4 inner header. If the encapsulated protocol is IPv6 and
      DF is configured to be inherited, always set it.
      
      For IPv4, inheriting DF from the inner header was probably intended from
      the very beginning judging by the comment to vxlan_xmit(), but it wasn't
      actually implemented -- also because it would have done more harm than
      good, without handling for ICMP Fragmentation Needed messages.
      
      According to RFC 7348, "Path MTU discovery MAY be used". An expired RFC
      draft, draft-saum-nvo3-pmtud-over-vxlan-05, whose purpose was to describe
      PMTUD implementation, says that "is a MUST that Vxlan gateways [...]
      SHOULD set the DF-bit [...]", whatever that means.
      
      Given this background, the only sane option is probably to let the user
      decide, and keep the current behaviour as default.
      
      This only applies to non-lwt tunnels: if an external control plane is
      used, tunnel key will still control the DF flag.
      
      v2:
      - DF behaviour configuration only applies for non-lwt tunnels, move DF
        setting to if (!info) block in vxlan_xmit_one() (Stephen Hemminger)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4d30697
    • Stefano Brivio's avatar
      vxlan: ICMP error lookup handler · c3a43b9f
      Stefano Brivio authored
      Export an encap_err_lookup() operation to match an ICMP error against a
      valid VNI.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3a43b9f
    • Stefano Brivio's avatar
      udp: Handle ICMP errors for tunnels with same destination port on both endpoints · a36e185e
      Stefano Brivio authored
      For both IPv4 and IPv6, if we can't match errors to a socket, try
      tunnels before ignoring them. Look up a socket with the original source
      and destination ports as found in the UDP packet inside the ICMP payload,
      this will work for tunnels that force the same destination port for both
      endpoints, i.e. VXLAN and GENEVE.
      
      Actually, lwtunnels could break this assumption if they are configured by
      an external control plane to have different destination ports on the
      endpoints: in this case, we won't be able to trace ICMP messages back to
      them.
      
      For IPv6 redirect messages, call ip6_redirect() directly with the output
      interface argument set to the interface we received the packet from (as
      it's the very interface we should build the exception on), otherwise the
      new nexthop will be rejected. There's no such need for IPv4.
      
      Tunnels can now export an encap_err_lookup() operation that indicates a
      match. Pass the packet to the lookup function, and if the tunnel driver
      reports a matching association, continue with regular ICMP error handling.
      
      v2:
      - Added newline between network and transport header sets in
        __udp{4,6}_lib_err_encap() (David Miller)
      - Removed redundant skb_reset_network_header(skb); in
        __udp4_lib_err_encap()
      - Removed redundant reassignment of iph in __udp4_lib_err_encap()
        (Sabrina Dubroca)
      - Edited comment to __udp{4,6}_lib_err_encap() to reflect the fact this
        won't work with lwtunnels configured to use asymmetric ports. By the way,
        it's VXLAN, not VxLAN (Jiri Benc)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a36e185e