1. 09 Nov, 2016 30 commits
  2. 08 Nov, 2016 5 commits
    • Soheil Hassas Yeganeh's avatar
      sock: do not set sk_err in sock_dequeue_err_skb · f5f99309
      Soheil Hassas Yeganeh authored
      Do not set sk_err when dequeuing errors from the error queue.
      Doing so results in:
      a) Bugs: By overwriting existing sk_err values, it possibly
         hides legitimate errors. It is also incorrect when local
         errors are queued with ip_local_error. That happens in the
         context of a system call, which already returns the error
         code.
      b) Inconsistent behavior: When there are pending errors on
         the error queue, sk_err is sometimes 0 (e.g., for
         the first timestamp on the error queue) and sometimes
         set to an error code (after dequeuing the first
         timestamp).
      c) Suboptimality: Setting sk_err to ENOMSG on simple
         TX timestamps can abort parallel reads and writes.
      
      Removing this line doesn't break userspace. This is because
      userspace code cannot rely on sk_err for detecting whether
      there is something on the error queue. Except for ICMP messages
      received for UDP and RAW, sk_err is not set at enqueue time,
      and as a result sk_err can be 0 while there are plenty of
      errors on the error queue.
      
      For ICMP packets in UDP and RAW, sk_err is set when they are
      enqueued on the error queue, but that does not result in aborting
      reads and writes. For such cases, sk_err is only readable via
      getsockopt(SO_ERROR) which will reset the value of sk_err on
      its own. More importantly, prior to this patch,
      recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
      sk_err is set by sock_dequeue_err_skb without atomic ops or
      locks) which can store 0 in sk_err even when we have ICMP
      messages pending. Removing this line from sock_dequeue_err_skb
      eliminates that race.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5f99309
    • David S. Miller's avatar
      Merge branch 'IFF_NO_QUEUE-semantics' · 5f7f7502
      David S. Miller authored
      Jesper Dangaard Brouer says:
      
      ====================
      qdisc and tx_queue_len cleanups for IFF_NO_QUEUE devices
      
      This patchset is a cleanup for IFF_NO_QUEUE devices.  It will
      hopefully help userspace get a more consistent behavior when attaching
      qdisc to such virtual devices.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f7f7502
    • Jesper Dangaard Brouer's avatar
      qdisc: catch misconfig of attaching qdisc to tx_queue_len zero device · 84c46dd8
      Jesper Dangaard Brouer authored
      It is a clear misconfiguration to attach a qdisc to a device with
      tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
      htb, plug and sfb) inherit/copy this value as their queue length.
      
      Why should the kernel catch such a misconfiguration?  Because prior to
      introducing the IFF_NO_QUEUE device flag, userspace found a loophole
      in the qdisc config system that allowed them to achieve the equivalent
      of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
      a device.  The loophole on older kernels is setting tx_queue_len=0,
      *prior* to device qdisc init (the config time is significant, simply
      setting tx_queue_len=0 doesn't trigger the loophole).
      
      This loophole is currently used by Docker[1] to get better performance
      and scalability out of the veth device.  The Docker developers were
      warned[1] that they needed to adjust the tx_queue_len if ever
      attaching a qdisc.  The OpenShift project didn't remember this warning
      and attached a qdisc, this were caught and fixed in[2].
      
      [1] https://github.com/docker/libcontainer/pull/193
      [2] https://github.com/openshift/origin/pull/11126
      
      Instead of fixing every userspace program that used this loophole, and
      forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
      catch the misconfiguration on the kernel side.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84c46dd8
    • Jesper Dangaard Brouer's avatar
      net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue len · 11597084
      Jesper Dangaard Brouer authored
      The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a
      default qdisc attached, given they will be backed by physical device,
      that already have a qdisc attached for pushback.
      
      It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as
      this can be useful for difference policy reasons (e.g. bandwidth
      limiting containers).  For this to work, the tx_queue_len need to have
      a sane value, because some qdiscs inherit/copy the tx_queue_len
      (namely, pfifo, bfifo, gred, htb, plug and sfb).
      
      Commit a813104d ("IFF_NO_QUEUE: Fix for drivers not calling
      ether_setup()") caught situations where some drivers didn't initialize
      tx_queue_len.  The problem with the commit was choosing 1 as the
      fallback value.
      
      A qdisc queue length of 1 causes more harm than good, because it
      creates hard to debug situations for userspace. It gives userspace a
      false sense of a working config after attaching a qdisc.  As low
      volume traffic (that doesn't activate the qdisc policy) works,
      like ping, while traffic that e.g. needs shaping cannot reach the
      configured policy levels, given the queue length is too small.
      
      This patch change the value to DEFAULT_TX_QUEUE_LEN, given other
      IFF_NO_QUEUE devices (that call ether_setup()) also use this value.
      
      Fixes: a813104d ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11597084
    • Jesper Dangaard Brouer's avatar
      net: make default TX queue length a defined constant · d0a81f67
      Jesper Dangaard Brouer authored
      The default TX queue length of Ethernet devices have been a magic
      constant of 1000, ever since the initial git import.
      
      Looking back in historical trees[1][2] the value used to be 100,
      with the same comment "Ethernet wants good queues". The commit[3]
      that changed this from 100 to 1000 didn't describe why, but from
      conversations with Robert Olsson it seems that it was changed
      when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the
      link speed increased x10 the queue size were also adjusted.  This
      value later caused much heartache for the bufferbloat community.
      
      This patch merely moves the value into a defined constant.
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/
      [2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/
      [3] https://git.kernel.org/tglx/history/c/98921832c232Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0a81f67
  3. 07 Nov, 2016 5 commits
    • David S. Miller's avatar
      Merge branch 'udp-fwd-mem-sched-on-dequeue' · fc13fd39
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      udp: do fwd memory scheduling on dequeue
      
      After commit 850cbadd ("udp: use it's own memory accounting schema"),
      the udp code needs to acquire twice the receive queue spinlock on dequeue.
      
      This patch series remove the need for the second lock at skb free time,
      moving the udp memory scheduling inside the dequeue operation; the skb
      destructor field is not used anymore and an additional sk argument is added
      to ip_cmsg_recv_offset() to cope with null skb->sk after dequeue.
      
      Many thanks to Eric Dumazed for suggesting pretty all much the above.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc13fd39
    • Paolo Abeni's avatar
      udp: do fwd memory scheduling on dequeue · 7c13f97f
      Paolo Abeni authored
      A new argument is added to __skb_recv_datagram to provide
      an explicit skb destructor, invoked under the receive queue
      lock.
      The UDP protocol uses such argument to perform memory
      reclaiming on dequeue, so that the UDP protocol does not
      set anymore skb->desctructor.
      Instead explicit memory reclaiming is performed at close() time and
      when skbs are removed from the receive queue.
      The in kernel UDP protocol users now need to call a
      skb_recv_udp() variant instead of skb_recv_datagram() to
      properly perform memory accounting on dequeue.
      
      Overall, this allows acquiring only once the receive queue
      lock on dequeue.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr sinks	vanilla		patched
      1		440		560
      3		2150		2300
      6		3650		3800
      9		4450		4600
      12		6250		6450
      
      v1 -> v2:
       - do rmem and allocated memory scheduling under the receive lock
       - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
       - avoid the typdef for the dequeue callback
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c13f97f
    • Paolo Abeni's avatar
      net/sock: add an explicit sk argument for ip_cmsg_recv_offset() · ad959036
      Paolo Abeni authored
      So that we can use it even after orphaining the skbuff.
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad959036
    • David Ahern's avatar
      net: Update raw socket bind to consider l3 domain · cd2c0f45
      David Ahern authored
      Binding a raw socket to a local address fails if the socket is bound
      to an L3 domain:
      
          $ vrf-test  -s -l 10.100.1.2 -R -I red
          error binding socket: 99: Cannot assign requested address
      
      Update raw_bind to look consider if sk_bound_dev_if is bound to an L3
      domain and use inet_addr_type_table to lookup the address.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd2c0f45
    • David S. Miller's avatar
      Merge branch 'ns2-amac' · 0ca6e000
      David S. Miller authored
      Jon Mason says:
      
      ====================
      add NS2 support to bgmac
      
      Changes in v6:
      * Use a common bgmac_phy_connect_direct (per Rafal Milecki)
      * Rebased on latest net-next
      * Added Reviewed-by to the relevant patches
      
      Changes in v5:
      * Change a pr_err to netdev_err (per Scott Branden)
      * Reword the lane swap binding documentation (per Andrew Lunn)
      
      Changes in v4:
      * Actually send out the lane swap binding doc patch (Per Scott Branden)
      * Remove unused #define (Per Andrew Lunn)
      
      Changes in v3:
      * Clean-up the bgmac DT binding doc (per Rob Herring)
      * Document the lane swap binding and make it generic (Per Andrew Lunn)
      
      Changes in v2:
      * Remove the PHY power-on (per Andrew Lunn)
      * Misc PHY clean-ups regarding comments and #defines (per Andrew Lunn)
        This results on none of the original PHY code from Vikas being
        present.  So, I'm removing him as an author and giving him
        "Inspired-by" credit.
      * Move PHY lane swapping to PHY driver (per Andrew Lunn and Florian
        Fainelli)
      * Remove bgmac sleep (per Florian Fainelli)
      * Re-add bgmac chip reset (per Florian Fainelli and Ray Jui)
      * Rebased on latest net-next
      * Added patch for bcm54xx_auxctl_read, which is used in the BCM54810
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ca6e000