1. 24 Apr, 2018 24 commits
    • David S. Miller's avatar
      Merge branch 'ipconfig-NTP-server-support-bug-fixes-documentation-improvements' · bc0fbc66
      David S. Miller authored
      Chris Novakovic says:
      
      ====================
      ipconfig: NTP server support, bug fixes, documentation improvements
      
      This series (against net-next) makes various improvements to ipconfig:
      
       - Patch #1 correctly documents the behaviour of parameter 4 in the
         "ip=" and "nfsaddrs=" command line parameter.
       - Patch #2 tidies up the printk()s for reporting configured name
         servers.
       - Patch #3 fixes a bug in autoconfiguration via BOOTP whereby the IP
         addresses of IEN-116 name servers are requested from the BOOTP
         server, rather than those of DNS name servers.
       - Patch #4 requests the number of DNS servers specified by
         CONF_NAMESERVERS_MAX when autoconfiguring via BOOTP, rather than
         hardcoding it to 2.
       - Patch #5 fully documents the contents and format of /proc/net/pnp in
         Documentation/filesystems/nfs/nfsroot.txt.
       - Patch #6 fixes a bug whereby bogus information is written to
         /proc/net/pnp when ipconfig is not used.
       - Patch #7 creates a new procfs directory for ipconfig-related
         configuration reports at /proc/net/ipconfig.
       - Patch #8 allows for NTP servers to be configured (manually on the
         kernel command line or automatically via DHCP), enabling systems with
         an NFS root filesystem to synchronise their clock before mounting
         their root filesystem. NTP server IP addresses are written to
         /proc/net/ipconfig/ntp_servers.
      
      Changes from v1:
      
       - David requested that a new directory /proc/net/ipconfig be created to
         contain ipconfig-related configuration reports, which is implemented
         in the new patch #7. NTP server IPs are now written to this directory
         instead of /proc/net/ntp in the new patch #8.
       - Cong and David both requested that the modification to CREDITS be
         dropped. This patch has been removed from the series.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc0fbc66
    • Chris Novakovic's avatar
      ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers · c04d2cb2
      Chris Novakovic authored
      Distributed filesystems are most effective when the server and client
      clocks are synchronised. Embedded devices often use NFS for their
      root filesystem but typically do not contain an RTC, so the clocks of
      the NFS server and the embedded device will be out-of-sync when the root
      filesystem is mounted (and may not be synchronised until late in the
      boot process).
      
      Extend ipconfig with the ability to export IP addresses of NTP servers
      it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as
      follows:
      
       - If ipconfig is configured manually via the "ip=" or "nfsaddrs="
         kernel command line parameters, one NTP server can be specified in
         the new "<ntp0-ip>" parameter.
       - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in
         the DHCPDISCOVER message, and record the IP addresses of up to three
         NTP servers sent by the responding DHCP server in the subsequent
         DHCPOFFER message.
      
      ipconfig will only write the NTP server IP addresses it discovers to
      /proc/net/ipconfig/ntp_servers, one per line (in the order received from
      the DHCP server, if DHCP autoconfiguration is used); making use of these
      NTP servers is the responsibility of a user space process (e.g. an
      initrd/initram script that invokes an NTP client before mounting an NFS
      root filesystem).
      Signed-off-by: default avatarChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c04d2cb2
    • Chris Novakovic's avatar
      ipconfig: Create /proc/net/ipconfig directory · 4d019b3f
      Chris Novakovic authored
      To allow ipconfig to report IP configuration details to user space
      processes without cluttering /proc/net, create a new subdirectory
      /proc/net/ipconfig. All files containing IP configuration details should
      be written to this directory.
      Signed-off-by: default avatarChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d019b3f
    • Chris Novakovic's avatar
      ipconfig: Correctly initialise ic_nameservers · 300eec7c
      Chris Novakovic authored
      ic_nameservers, which stores the list of name servers discovered by
      ipconfig, is initialised (i.e. has all of its elements set to NONE, or
      0xffffffff) by ic_nameservers_predef() in the following scenarios:
      
       - before the "ip=" and "nfsaddrs=" kernel command line parameters are
         parsed (in ip_auto_config_setup());
       - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in
         order to clear any values that may have been set after parsing "ip="
         or "nfsaddrs=" and are no longer needed.
      
      This means that ic_nameservers_predef() is not called when neither "ip="
      nor "nfsaddrs=" is specified on the kernel command line. In this
      scenario, every element in ic_nameservers remains set to 0x00000000,
      which is indistinguishable from ANY and causes pnp_seq_show() to write
      the following (bogus) information to /proc/net/pnp:
      
        #MANUAL
        nameserver 0.0.0.0
        nameserver 0.0.0.0
        nameserver 0.0.0.0
      
      This is potentially problematic for systems that blindly link
      /etc/resolv.conf to /proc/net/pnp.
      
      Ensure that ic_nameservers is also initialised when neither "ip=" nor
      "nfsaddrs=" are specified by calling ic_nameservers_predef() in
      ip_auto_config(), but only when ip_auto_config_setup() was not called
      earlier. This causes the following to be written to /proc/net/pnp, and
      is consistent with what gets written when ipconfig is configured
      manually but no name servers are specified on the kernel command line:
      
        #MANUAL
      Signed-off-by: default avatarChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      300eec7c
    • Chris Novakovic's avatar
      ipconfig: Document /proc/net/pnp · 8b0b37c5
      Chris Novakovic authored
      Fully document the format used by the /proc/net/pnp file written by
      ipconfig, explain where its values originate from, and clarify that the
      tertiary name server IP and DNS domain name are only written to the file
      when autoconfiguration is used.
      Signed-off-by: default avatarChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b0b37c5
    • Chris Novakovic's avatar
      ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers · de1fa15b
      Chris Novakovic authored
      When ipconfig is autoconfigured via BOOTP, the request packet
      initialised by ic_bootp_init_ext() always allocates 8 bytes for the name
      server option, limiting the BOOTP server to responding with at most 2
      name servers even though ipconfig in fact supports an arbitrary number
      of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently
      3).
      
      Only request name servers in the request packet if CONF_NAMESERVERS_MAX
      is positive (to comply with [1, §3.8]), and allocate enough space in the
      packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum
      number we can accept in response.
      
      [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
          https://tools.ietf.org/rfc/rfc2132.txtSigned-off-by: default avatarChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de1fa15b
    • Chris Novakovic's avatar
      ipconfig: BOOTP: Don't request IEN-116 name servers · 4e1a8af2
      Chris Novakovic authored
      When ipconfig is autoconfigured via BOOTP, the request packet
      initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name
      Server" [1, §3.7]), but tag 5 in the response isn't processed by
      ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name
      Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears
      to have been the intended tag to request.
      
      This won't cause any breakage for existing users, as tag 5 responses
      provided by BOOTP servers weren't being processed anyway.
      
      [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
          https://tools.ietf.org/rfc/rfc2132.txtSigned-off-by: default avatarChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e1a8af2
    • Chris Novakovic's avatar
      ipconfig: Tidy up reporting of name servers · e18bdc83
      Chris Novakovic authored
      Commit 5e953778 ("ipconfig: add
      nameserver IPs to kernel-parameter ip=") adds the IP addresses of
      discovered name servers to the summary printed by ipconfig when
      configuration is complete. It appears the intention in ip_auto_config()
      was to print the name servers on a new line (especially given the
      spacing and lack of comma before "nameserver0="), but they're actually
      printed on the same line as the NFS root filesystem configuration
      summary:
      
        [    0.686186] IP-Config: Complete:
        [    0.686226]      device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1
        [    0.686328]      host=test, domain=example.com, nis-domain=(none)
        [    0.686386]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=     nameserver0=10.0.0.1
      
      This makes it harder to read and parse ipconfig's output. Instead, print
      the name servers on a separate line:
      
        [    0.791250] IP-Config: Complete:
        [    0.791289]      device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1
        [    0.791407]      host=test, domain=example.com, nis-domain=(none)
        [    0.791475]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
        [    0.791476]      nameserver0=10.0.0.1
      Signed-off-by: default avatarChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e18bdc83
    • Chris Novakovic's avatar
      ipconfig: Document setting of NIS domain name · 660de409
      Chris Novakovic authored
      ic_do_bootp_ext() is responsible for parsing the "ip=" and "nfsaddrs="
      kernel parameters. If a "." character is found in parameter 4 (the
      client's hostname), everything before the first "." is used as the
      hostname, and everything after it is used as the NIS domain name (but
      not necessarily the DNS domain name).
      
      Document this behaviour in Documentation/filesystems/nfs/nfsroot.txt,
      as it is not made explicit.
      Signed-off-by: default avatarChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      660de409
    • David S. Miller's avatar
      Merge branch 'rhash-cleanups' · 5cb5ce33
      David S. Miller authored
      NeilBrown says:
      
      ====================
      A few rhashtables cleanups
      
      2 patches fixes documentation
      1 fixes a bit in rhashtable_walk_start()
      1 improves rhashtable_walk stability.
      
      All reviewed and Acked.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cb5ce33
    • NeilBrown's avatar
      rhashtable: improve rhashtable_walk stability when stop/start used. · 5d240a89
      NeilBrown authored
      When a walk of an rhashtable is interrupted with rhastable_walk_stop()
      and then rhashtable_walk_start(), the location to restart from is based
      on a 'skip' count in the current hash chain, and this can be incorrect
      if insertions or deletions have happened.  This does not happen when
      the walk is not stopped and started as iter->p is a placeholder which
      is safe to use while holding the RCU read lock.
      
      In rhashtable_walk_start() we can revalidate that 'p' is still in the
      same hash chain.  If it isn't then the current method is still used.
      
      With this patch, if a rhashtable walker ensures that the current
      object remains in the table over a stop/start period (possibly by
      elevating the reference count if that is sufficient), it can be sure
      that a walk will not miss objects that were in the hashtable for the
      whole time of the walk.
      
      rhashtable_walk_start() may not find the object even though it is
      still in the hashtable if a rehash has moved it to a new table.  In
      this case it will (eventually) get -EAGAIN and will need to proceed
      through the whole table again to be sure to see everything at least
      once.
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d240a89
    • NeilBrown's avatar
      rhashtable: reset iter when rhashtable_walk_start sees new table · b41cc04b
      NeilBrown authored
      The documentation claims that when rhashtable_walk_start_check()
      detects a resize event, it will rewind back to the beginning
      of the table.  This is not true.  We need to set ->slot and
      ->skip to be zero for it to be true.
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b41cc04b
    • NeilBrown's avatar
      rhashtable: Revise incorrect comment on r{hl, hash}table_walk_enter() · 82266e98
      NeilBrown authored
      Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though
      they do take a spinlock without irq protection.
      So revise the comments to accurately state the contexts in which
      these functions can be called.
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82266e98
    • NeilBrown's avatar
      rhashtable: remove outdated comments about grow_decision etc · 0c6f69a5
      NeilBrown authored
      grow_decision and shink_decision no longer exist, so remove
      the remaining references to them.
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c6f69a5
    • Eric Dumazet's avatar
      tcp: md5: only call tp->af_specific->md5_lookup() for md5 sockets · 8c2320e8
      Eric Dumazet authored
      RETPOLINE made calls to tp->af_specific->md5_lookup() quite expensive,
      given they have no result.
      We can omit the calls for sockets that have no md5 keys.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c2320e8
    • Yafang Shao's avatar
      Revert "net: init sk_cookie for inet socket" · a06ac0d6
      Yafang Shao authored
      This reverts commit <c6849a3a> ("net: init sk_cookie for inet socket")
      
      Per discussion with Eric, when update sock_net(sk)->cookie_gen, the
      whole cache cache line will be invalidated, as this cache line is shared
      with all cpus, that may cause great performace hit.
      
      Bellow is the data form Eric.
      "Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on
      my host" when running synflood test.
      
      Have to revert it to prevent from cache line false sharing.
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a06ac0d6
    • David S. Miller's avatar
      Merge branch 'net-DIM-tx' · 8399743a
      David S. Miller authored
      Tal Gilboa says:
      
      ====================
      Introduce adaptive TX interrupt moderation to net DIM
      
      Net DIM is a library designed for dynamic interrupt moderation. It was
      implemented and optimized with receive side interrupts in mind, since these
      are usually the CPU expensive ones. This patch-set introduces adaptive transmit
      interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
      Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
      Furthermore, it is essential for increasing bandwidth on cases where payload
      aggregation is required.
      
      v3: Remove "inline" from functions in .c files (requested by DaveM). Revert
      adding "enabled" field from struct net_dim and applied mlx5e structural
      suggestions (suggested by SaeedM).
      
      v2: Rebase over proper tree.
      
      v1: Fix compilation issues due to missed function renaming.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8399743a
    • Tal Gilboa's avatar
      net/mlx5e: Enable adaptive-TX moderation · cbce4f44
      Tal Gilboa authored
      Add support for adaptive TX moderation. This greatly reduces TX interrupt
      rate and increases bandwidth, mostly for TCP bandwidth over ARM
      architecture (below). There is a slight single stream TCP with very large
      message sizes degradation (x86). In this case if there's any moderation on
      transmitted packets the bandwidth would reduce due to hitting TCP output limit.
      Since this is a synthetic case, this is still worth doing.
      
      Performance improvement (ConnectX-4Lx 40GbE, ARM)
      TCP 64B bandwidth with 1-50 streams increased 6-35%.
      TCP 64B bandwidth with 100-500 streams increased 20-70%.
      
      Performance improvement (ConnectX-5 100GbE, x86)
      Bandwidth: increased up to 40% (1024B with 10s of streams).
      Interrupt rate: reduced up to 50% (1024B with 1000s of streams).
      
      Performance degradation (ConnectX-5 100GbE, x86)
      Bandwidth: up to 10% decrease single stream TCP (1MB message size from
      51Gb/s to 47Gb/s).
      Signed-off-by: default avatarTal Gilboa <talgi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbce4f44
    • Tal Gilboa's avatar
      net/dim: Support adaptive TX moderation · 623ad755
      Tal Gilboa authored
      Interrupt moderation for TX traffic requires different profiles than RX
      interrupt moderation. The main goal here is to reduce interrupt rate and
      allow better payload aggregation by keeping SKBs in the TX queue a bit
      longer. Ping-pong behavior would get a profile with a short timer, so
      latency wouldn't increase for these scenarios. There might be a slight
      degradation in bandwidth for single stream with large message sizes, since
      net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but
      with many streams performance is always improved.
      Signed-off-by: default avatarTal Gilboa <talgi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      623ad755
    • Tal Gilboa's avatar
      net/dim: Rename *_get_profile() functions to *_get_rx_moderation() · 026a807c
      Tal Gilboa authored
      Preparation for introducing adaptive TX to net DIM.
      Signed-off-by: default avatarTal Gilboa <talgi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      026a807c
    • Paolo Abeni's avatar
      vhost_net: use packet weight for rx handler, too · db688c24
      Paolo Abeni authored
      Similar to commit a2ac9990 ("vhost-net: set packet weight of
      tx polling to 2 * vq size"), we need a packet-based limit for
      handler_rx, too - elsewhere, under rx flood with small packets,
      tx can be delayed for a very long time, even without busypolling.
      
      The pkt limit applied to handle_rx must be the same applied by
      handle_tx, or we will get unfair scheduling between rx and tx.
      Tying such limit to the queue length makes it less effective for
      large queue length values and can introduce large process
      scheduler latencies, so a constant valued is used - likewise
      the existing bytes limit.
      
      The selected limit has been validated with PVP[1] performance
      test with different queue sizes:
      
      queue size		256	512	1024
      
      baseline		366	354	362
      weight 128		715	723	670
      weight 256		740	745	733
      weight 512		600	460	583
      weight 1024		423	427	418
      
      A packet weight of 256 gives peek performances in under all the
      tested scenarios.
      
      No measurable regression in unidirectional performance tests has
      been detected.
      
      [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db688c24
    • Roopa Prabhu's avatar
      net: fib_rules: fix l3mdev netlink attr processing · 9c20b937
      Roopa Prabhu authored
      Fixes: b16fb418 ("net: fib_rules: add extack support")
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c20b937
    • Anders Roxell's avatar
      selftests: net: update .gitignore with missing test · b300fcf8
      Anders Roxell authored
      Fixes: 192dc405 ("selftests: net: add tcp_mmap program")
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b300fcf8
    • Colin Ian King's avatar
      dca: make function dca_common_get_tag static · 064223c1
      Colin Ian King authored
      Function dca_common_get_tag is local to the source and does not need to be
      in global scope, so make it static.
      
      Cleans up sparse warning:
      drivers/dca/dca-core.c:273:4: warning: symbol 'dca_common_get_tag' was
      not declared. Should it be static?
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      064223c1
  2. 23 Apr, 2018 15 commits
  3. 21 Apr, 2018 1 commit