1. 30 Jun, 2019 1 commit
    • David S. Miller's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 11697cfc
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2019-06-28
      
      This series contains a smorgasbord of updates to many of the Intel
      drivers.
      
      Gustavo A. R. Silva updates the ice and iavf drivers to use the
      strcut_size() helper where possible.
      
      Miguel increases the pause and refresh time for flow control in the
      e1000e driver during reset for certain devices.
      
      Dann Frazier fixes a potential NULL pointer dereference in ixgbe driver
      when using non-IPSec enabled devices.
      
      Colin Ian King fixes a potential overflow during a shift in the ixgbe
      driver.  Also fixes a potential NULL pointer dereference in the iavf
      driver by adding a check.
      
      Venkatesh Srinivas converts the e1000 driver to use dma_wmb() instead of
      wmb() for doorbell writes to avoid SFENCEs in the transmit and receive
      paths.
      
      Arjan updates the e1000e driver to improve boot time by over 100 msec by
      reducing the usleep ranges suring system startup.
      
      Artem updates the igb driver register dump in ethtool, first prepares
      the register dump for future additions of registers in the dump, then
      secondly, adds the RR2DCDELAY register to the dump.  When dealing with
      time-sensitive networks, this register is helpful in determining your
      latency from the device to the ring.
      
      Alex fixes the ixgbevf driver to use the current cached link state,
      rather than trying to re-check the value from the PF.
      
      Harshitha adds support for MACVLAN offloads in i40e by using channels as
      MACVLAN interfaces.
      
      Detlev Casanova updates the e1000e driver to use delayed work instead of
      timers to run the watchdog.
      
      Vitaly fixes an issue in e1000e, where when disconnecting and
      reconnecting the physical cable connection, the NIC enters a DMoff
      state.  This state causes a mismatch in link and duplexing, so check the
      PCIm function state and perform a PHY reset when in this state to
      resolve the issue.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11697cfc
  2. 29 Jun, 2019 10 commits
  3. 28 Jun, 2019 29 commits
    • Vitaly Lifshits's avatar
      e1000e: PCIm function state support · def4ec6d
      Vitaly Lifshits authored
      Due to commit: 5d8682588605 ("[misc] mei: me: allow runtime
      pm for platform with D0i3")
      When disconnecting the cable and reconnecting it the NIC
      enters DMoff state. This caused wrong link indication
      and duplex mismatch. This bug is described in:
      https://bugzilla.redhat.com/show_bug.cgi?id=1689436
      
      Checking PCIm function state and performing PHY reset after a
      timeout in watchdog task solves this issue.
      Signed-off-by: default avatarVitaly Lifshits <vitaly.lifshits@intel.com>
      Acked-by: default avatarSasha Neftin <sasha.neftin@intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      def4ec6d
    • Detlev Casanova's avatar
      e1000e: Make watchdog use delayed work · 59653e64
      Detlev Casanova authored
      Use delayed work instead of timers to run the watchdog of the e1000e
      driver.
      
      Simplify the code with one less middle function.
      Signed-off-by: default avatarDetlev Casanova <detlev.casanova@gmail.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      59653e64
    • Harshitha Ramamurthy's avatar
      i40e: Add macvlan support on i40e · 1d8d80b4
      Harshitha Ramamurthy authored
      This patch enables macvlan offloads for i40e. The idea is to use
      channels as macvlan interfaces. The channels are VSIs of
      type VMDQ. When the first macvlan is created, the maximum number of
      channels possible are created. From then on, as a macvlan interface
      is created, a macvlan filter is added to these already created
      channels (VSIs).
      
      This patch utilizes subordinate device traffic classes to make queue
      groups(channels) available for an upper device like a macvlan.
      
      Steps to configure macvlan offloads:
      1. ethtool -K ethx l2-fwd-offload on
      2. ip link add link ethx name macvlan1 type macvlan
      3. ip addr add <address> dev macvlan1
      4. ip link set macvlan1 up
      Signed-off-by: default avatarHarshitha Ramamurthy <harshitha.ramamurthy@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      1d8d80b4
    • Alexander Duyck's avatar
      ixgbevf: Use cached link state instead of re-reading the value for ethtool · 1e1b0c65
      Alexander Duyck authored
      Change the ethtool link settings call to just read the cached state out of
      the adapter structure instead of trying to recheck the value from the PF.
      Doing this should prevent excessive reading of the mailbox.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: default avatar"Guilherme G. Piccoli" <gpiccoli@canonical.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      1e1b0c65
    • Colin Ian King's avatar
      iavf: fix dereference of null rx_buffer pointer · 9fe06a51
      Colin Ian King authored
      A recent commit efa14c39 ("iavf: allow null RX descriptors") added
      a null pointer sanity check on rx_buffer, however, rx_buffer is being
      dereferenced before that check, which implies a null pointer dereference
      bug can potentially occur.  Fix this by only dereferencing rx_buffer
      until after the null pointer check.
      
      Addresses-Coverity: ("Dereference before null check")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      9fe06a51
    • Artem Bityutskiy's avatar
      igb: add RR2DCDELAY to ethtool registers dump · cd502a7f
      Artem Bityutskiy authored
      This patch adds the RR2DCDELAY register to the ethtool registers dump.
      RR2DCDELAY exists on I210 and I211 Intel Gigabit Ethernet chips and it stands
      for "Read Request To Data Completion Delay". Here is how this register is
      described in the I210 datasheet:
      
      "This field captures the maximum PCIe split time in 16 ns units, which is the
      maximum delay between the read request to the first data completion. This is
      giving an estimation of the PCIe round trip time."
      
      In other words, whenever I210 reads from the host memory (e.g., fetches a
      descriptor from the ring), the chip measures every PCI DMA read transaction and
      captures the maximum value. So it ends up containing the longest DMA
      transaction time.
      
      This register is very useful for troubleshooting and research purposes. If you
      are dealing with time-sensitive networks, this register can help you get
      an idea of your "I210-to-ring" latency. This helps answering questions like
      "should I have PCIe ASPM enabled?" or "should I enable deep C-states?" on
      my system.
      
      It is safe to read this register at any point, reading it has no effect on
      the I210 chip functionality.
      Signed-off-by: default avatarArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      cd502a7f
    • Artem Bityutskiy's avatar
      igb: minor ethool regdump amendment · 9379b399
      Artem Bityutskiy authored
      This patch has no functional impact and it is just a preparation
      for the following patch. It removes an early return from the
      'igb_get_regs()' function by moving the 82576-only registers
      dump into an "if" block. With this preparation, we can dump more
      non-82576 registers at the end of this function.
      Signed-off-by: default avatarArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      9379b399
    • Jeff Kirsher's avatar
      iavf: Fix up debug print macro · 75051ce4
      Jeff Kirsher authored
      This aligns the iavf_debug() macro with the other Intel drivers.
      
      Add the bus number, bus_id field to i40e_bus_info so output shows
      each physical port(i.e func) in following format:
        [[[[<domain>]:]<bus>]:][<slot>][.[<func>]]
      domains are numbered from 0 to ffff), bus (0-ff), slot (0-1f) and
      function (0-7).
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      75051ce4
    • Arjan van de Ven's avatar
      e1000e: Reduce boot time by tightening sleep ranges · ab6973ae
      Arjan van de Ven authored
      The e1000e driver is a great user of the usleep_range() API,
      and has nice ranges that in principle help power management.
      
      However the ranges that are used only during system startup are
      very long (and can add easily 100 msec to the boot time) while
      the power savings of such long ranges is irrelevant due to the
      one-off, boot only, nature of these functions.
      
      This patch shrinks some of the longest ranges to be shorter
      (while still using a power friendly 1 msec range); this saves
      100msec+ of boot time on my BDW NUCs
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ab6973ae
    • Gustavo A. R. Silva's avatar
      iavf: use struct_size() helper · af07adbb
      Gustavo A. R. Silva authored
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes, in particular in the
      context in which this code is being used.
      
      So, replace code of the following form:
      
      sizeof(struct virtchnl_ether_addr_list) + (count * sizeof(struct virtchnl_ether_addr))
      
      with:
      
      struct_size(veal, list, count)
      
      and so on...
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: default avatar"Gustavo A. R. Silva" <gustavo@embeddedor.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      af07adbb
    • Venkatesh Srinivas's avatar
      e1000: Use dma_wmb() instead of wmb() before doorbell writes · 583cf7be
      Venkatesh Srinivas authored
      e1000 writes to doorbells to post transmit descriptors and fill the
      receive ring. After writing descriptors to memory but before
      writing to doorbells, use dma_wmb() rather than wmb(). wmb() is more
      heavyweight than necessary for a device to see descriptor writes.
      
      On x86, this avoids SFENCEs before doorbell writes in both the
      Tx and Rx paths. On ARM, this converts DSB ST -> DMB OSHST.
      
      Tested: 82576EB / x86; QEMU (qemu emulates an 8257x)
      Signed-off-by: default avatarVenkatesh Srinivas <venkateshs@google.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      583cf7be
    • Colin Ian King's avatar
      ixgbe: fix potential u32 overflow on shift · b97c0b52
      Colin Ian King authored
      The u32 variable rem is being shifted using u32 arithmetic however
      it is being passed to div_u64 that expects the expression to be a u64.
      The 32 bit shift may potentially overflow, so cast rem to a u64 before
      shifting to avoid this.  Also remove comment about overflow.
      
      Addresses-Coverity: ("Unintentional integer overflow")
      Fixes: cd458320 ("ixgbe: implement support for SDP/PPS output on X550 hardware")
      Fixes: 68d9676f ("ixgbe: fix PTP SDP pin setup on X540 hardware")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b97c0b52
    • Dann Frazier's avatar
      ixgbe: Avoid NULL pointer dereference with VF on non-IPsec hw · 92924064
      Dann Frazier authored
      An ipsec structure will not be allocated if the hardware does not support
      offload. Fixes the following Oops:
      
      [  191.045452] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [  191.054232] Mem abort info:
      [  191.057014]   ESR = 0x96000004
      [  191.060057]   Exception class = DABT (current EL), IL = 32 bits
      [  191.065963]   SET = 0, FnV = 0
      [  191.069004]   EA = 0, S1PTW = 0
      [  191.072132] Data abort info:
      [  191.074999]   ISV = 0, ISS = 0x00000004
      [  191.078822]   CM = 0, WnR = 0
      [  191.081780] user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000043d9e467
      [  191.088382] [0000000000000000] pgd=0000000000000000
      [  191.093252] Internal error: Oops: 96000004 [#1] SMP
      [  191.098119] Modules linked in: vhost_net vhost tap vfio_pci vfio_virqfd vfio_iommu_type1 vfio xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter devlink ebtables ip6table_filter ip6_tables iptable_filter bpfilter ipmi_ssif nls_iso8859_1 input_leds joydev ipmi_si hns_roce_hw_v2 ipmi_devintf hns_roce ipmi_msghandler cppc_cpufreq sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 ses enclosure btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear ixgbevf hibmc_drm ttm
      [  191.168607]  drm_kms_helper aes_ce_blk aes_ce_cipher syscopyarea crct10dif_ce sysfillrect ghash_ce qla2xxx sysimgblt sha2_ce sha256_arm64 hisi_sas_v3_hw fb_sys_fops sha1_ce uas nvme_fc mpt3sas ixgbe drm hisi_sas_main nvme_fabrics usb_storage hclge scsi_transport_fc ahci libsas hnae3 raid_class libahci xfrm_algo scsi_transport_sas mdio aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
      [  191.202952] CPU: 94 PID: 0 Comm: swapper/94 Not tainted 4.19.0-rc1+ #11
      [  191.209553] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.20.01 04/26/2019
      [  191.218064] pstate: 20400089 (nzCv daIf +PAN -UAO)
      [  191.222873] pc : ixgbe_ipsec_vf_clear+0x60/0xd0 [ixgbe]
      [  191.228093] lr : ixgbe_msg_task+0x2d0/0x1088 [ixgbe]
      [  191.233044] sp : ffff000009b3bcd0
      [  191.236346] x29: ffff000009b3bcd0 x28: 0000000000000000
      [  191.241647] x27: ffff000009628000 x26: 0000000000000000
      [  191.246946] x25: ffff803f652d7600 x24: 0000000000000004
      [  191.252246] x23: ffff803f6a718900 x22: 0000000000000000
      [  191.257546] x21: 0000000000000000 x20: 0000000000000000
      [  191.262845] x19: 0000000000000000 x18: 0000000000000000
      [  191.268144] x17: 0000000000000000 x16: 0000000000000000
      [  191.273443] x15: 0000000000000000 x14: 0000000100000026
      [  191.278742] x13: 0000000100000025 x12: ffff8a5f7fbe0df0
      [  191.284042] x11: 000000010000000b x10: 0000000000000040
      [  191.289341] x9 : 0000000000001100 x8 : ffff803f6a824fd8
      [  191.294640] x7 : ffff803f6a825098 x6 : 0000000000000001
      [  191.299939] x5 : ffff000000f0ffc0 x4 : 0000000000000000
      [  191.305238] x3 : ffff000028c00000 x2 : ffff803f652d7600
      [  191.310538] x1 : 0000000000000000 x0 : ffff000000f205f0
      [  191.315838] Process swapper/94 (pid: 0, stack limit = 0x00000000addfed5a)
      [  191.322613] Call trace:
      [  191.325055]  ixgbe_ipsec_vf_clear+0x60/0xd0 [ixgbe]
      [  191.329927]  ixgbe_msg_task+0x2d0/0x1088 [ixgbe]
      [  191.334536]  ixgbe_msix_other+0x274/0x330 [ixgbe]
      [  191.339233]  __handle_irq_event_percpu+0x78/0x270
      [  191.343924]  handle_irq_event_percpu+0x40/0x98
      [  191.348355]  handle_irq_event+0x50/0xa8
      [  191.352180]  handle_fasteoi_irq+0xbc/0x148
      [  191.356263]  generic_handle_irq+0x34/0x50
      [  191.360259]  __handle_domain_irq+0x68/0xc0
      [  191.364343]  gic_handle_irq+0x84/0x180
      [  191.368079]  el1_irq+0xe8/0x180
      [  191.371208]  arch_cpu_idle+0x30/0x1a8
      [  191.374860]  do_idle+0x1dc/0x2a0
      [  191.378077]  cpu_startup_entry+0x2c/0x30
      [  191.381988]  secondary_start_kernel+0x150/0x1e0
      [  191.386506] Code: 6b15003f 54000320 f1404a9f 54000060 (79400260)
      
      Fixes: eda0333a ("ixgbe: add VF IPsec management")
      Signed-off-by: default avatarDann Frazier <dann.frazier@canonical.com>
      Acked-by: default avatarShannon Nelson <snelson@pensando.io>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      92924064
    • Miguel Bernal Marin's avatar
    • Gustavo A. R. Silva's avatar
      ice: Use struct_size() helper · 89f6a305
      Gustavo A. R. Silva authored
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          struct boo entry[];
      };
      
      size = sizeof(struct foo) + count * sizeof(struct boo);
      instance = alloc(size, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      size = struct_size(instance, entry, count);
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: default avatar"Gustavo A. R. Silva" <gustavo@embeddedor.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      89f6a305
    • David S. Miller's avatar
      Merge branch 'net-sched-Add-txtime-assist-support-for-taprio' · 0a7960c7
      David S. Miller authored
      Vedang Patel says:
      
      ====================
      net/sched: Add txtime-assist support for taprio.
      
      Changes in v6:
      - Use _BITUL() instead of BIT() in UAPI for etf. (patch #1)
      - Fix a bug reported by kbuild test bot in length_to_duration(). (patch #6)
      - Remove an unused function (get_cycle_start()). (Patch #6)
      
      Changes in v5:
      - Commit message improved for the igb patch (patch #1).
      - Fixed typo in commit message for etf patch (patch #2).
      
      Changes in v4:
      - Remove inline directive from functions in foo.c.
      - Fix spacing in pkt_sched.h (for etf patch).
      
      Changes in v3:
      - Simplify implementation for taprio flags.
      - txtime_delay can only be set if txtime-assist mode is enabled.
      - txtime_delay and flags will only be visible in tc output if set by user.
      - Minor changes in error reporting.
      
      Changes in v2:
      - Txtime-offload has now been renamed to txtime-assist mode.
      - Renamed the offload parameter to flags.
      - Removed the code which introduced the hardware offloading functionality.
      
      Original Cover letter (with above changes included)
      --------------------------------------------------
      
      Currently, we are seeing packets being transmitted outside their
      timeslices. We can confirm that the packets are being dequeued at the right
      time. So, the delay is induced after the packet is dequeued, because
      taprio, without any offloading, has no control of when a packet is actually
      transmitted.
      
      In order to solve this, we are making use of the txtime feature provided by
      ETF qdisc. Hardware offloading needs to be supported by the ETF qdisc in
      order to take advantage of this feature. The taprio qdisc will assign
      txtime (in skb->tstamp) for all the packets which do not have the txtime
      allocated via the SO_TXTIME socket option. For the packets which already
      have SO_TXTIME set, taprio will validate whether the packet will be
      transmitted in the correct interval.
      
      In order to support this, the following parameters have been added:
      - flags (taprio): This is added in order to support different offloading
        modes which will be added in the future.
      - txtime-delay (taprio): This indicates the minimum time it will take for
        the packet to hit the wire after it reaches taprio_enqueue(). This is
        useful in determining whether we can transmit the packet in the remaining
        time if the gate corresponding to the packet is currently open.
      - skip_skb_check (ETF): ETF currently drops any packet which does not have
        the SO_TXTIME socket option set. This check can be skipped by specifying
        this option.
      
      Following is an example configuration:
      
      tc qdisc replace dev $IFACE parent root handle 100 taprio \\
          num_tc 3 \\
          map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
          queues 1@0 1@0 1@0 \\
          base-time $BASE_TIME \\
          sched-entry S 01 300000 \\
          sched-entry S 02 300000 \\
          sched-entry S 04 400000 \\
          flags 0x1 \\
          txtime-delay 200000 \\
          clockid CLOCK_TAI
      
      tc qdisc replace dev $IFACE parent 100:1 etf \\
          offload delta 200000 clockid CLOCK_TAI skip_skb_check
      
      Here, the "flags" parameter is indicating that the txtime-assist mode is
      enabled. Also, all the traffic classes have been assigned the same queue.
      This is to prevent the traffic classes in the lower priority queues from
      getting starved. Note that this configuration is specific to the i210
      ethernet card. Other network cards where the hardware queues are given the
      same priority, might be able to utilize more than one queue.
      
      Following are some of the other highlights of the series:
      - Fix a bug where hardware timestamping and SO_TXTIME options cannot be
        used together. (Patch 1)
      - Introduces the skip_skb_check option.  (Patch 2)
      - Make TxTime assist mode work with TCP packets (Patch 7).
      
      The following changes are recommended to be done in order to get the best
      performance from taprio in this mode:
      ip link set dev enp1s0 mtu 1514
      ethtool -K eth0 gso off
      ethtool -K eth0 tso off
      ethtool --set-eee eth0 eee off
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a7960c7
    • Vedang Patel's avatar
      taprio: Adjust timestamps for TCP packets · 54002066
      Vedang Patel authored
      When the taprio qdisc is running in "txtime offload" mode, it will
      set the launchtime value (in skb->tstamp) for all the packets which do
      not have the SO_TXTIME socket option. But, the TCP packets already have
      this value set and it indicates the earliest departure time represented
      in CLOCK_MONOTONIC clock.
      
      We need to respect the timestamp set by the TCP subsystem. So, convert
      this time to the clock which taprio is using and ensure that the packet
      is not transmitted before the deadline set by TCP.
      Signed-off-by: default avatarVedang Patel <vedang.patel@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54002066
    • Vedang Patel's avatar
      taprio: make clock reference conversions easier · 7ede7b03
      Vedang Patel authored
      Later in this series we will need to transform from
      CLOCK_MONOTONIC (used in TCP) to the clock reference used in TAPRIO.
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarVedang Patel <vedang.patel@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ede7b03
    • Vedang Patel's avatar
      taprio: Add support for txtime-assist mode · 4cfd5779
      Vedang Patel authored
      Currently, we are seeing non-critical packets being transmitted outside of
      their timeslice. We can confirm that the packets are being dequeued at the
      right time. So, the delay is induced in the hardware side.  The most likely
      reason is the hardware queues are starving the lower priority queues.
      
      In order to improve the performance of taprio, we will be making use of the
      txtime feature provided by the ETF qdisc. For all the packets which do not
      have the SO_TXTIME option set, taprio will set the transmit timestamp (set
      in skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit
      time for the packet is set to when the gate is open. If SO_TXTIME is set,
      the TAPrio qdisc will validate whether the timestamp (in skb->tstamp)
      occurs when the gate corresponding to skb's traffic class is open.
      
      Following two parameters added to support this mode:
      - flags: used to enable txtime-assist mode. Will also be used to enable
        other modes (like hardware offloading) later.
      - txtime-delay: This indicates the minimum time it will take for the packet
        to hit the wire. This is useful in determining whether we can transmit
      the packet in the remaining time if the gate corresponding to the packet is
      currently open.
      
      An example configuration for enabling txtime-assist:
      
      tc qdisc replace dev eth0 parent root handle 100 taprio \\
            num_tc 3 \\
            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
            queues 1@0 1@0 1@0 \\
            base-time 1558653424279842568 \\
            sched-entry S 01 300000 \\
            sched-entry S 02 300000 \\
            sched-entry S 04 400000 \\
            flags 0x1 \\
            txtime-delay 40000 \\
            clockid CLOCK_TAI
      
      tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
            offload delta 200000 clockid CLOCK_TAI
      
      Note that all the traffic classes are mapped to the same queue.  This is
      only possible in taprio when txtime-assist is enabled. Also, note that the
      ETF Qdisc is enabled with offload mode set.
      
      In this mode, if the packet's traffic class is open and the complete packet
      can be transmitted, taprio will try to transmit the packet immediately.
      This will be done by setting skb->tstamp to current_time + the time delta
      indicated in the txtime-delay parameter. This parameter indicates the time
      taken (in software) for packet to reach the network adapter.
      
      If the packet cannot be transmitted in the current interval or if the
      packet's traffic is not currently transmitting, the skb->tstamp is set to
      the next available timestamp value. This is tracked in the next_launchtime
      parameter in the struct sched_entry.
      
      The behaviour w.r.t admin and oper schedules is not changed from what is
      present in software mode.
      
      The transmit time is already known in advance. So, we do not need the HR
      timers to advance the schedule and wakeup the dequeue side of taprio.  So,
      HR timer won't be run when this mode is enabled.
      Signed-off-by: default avatarVedang Patel <vedang.patel@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4cfd5779
    • Vedang Patel's avatar
      taprio: Remove inline directive · 566af331
      Vedang Patel authored
      Remove inline directive from length_to_duration(). We will let the compiler
      make the decisions.
      Signed-off-by: default avatarVedang Patel <vedang.patel@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      566af331
    • Vedang Patel's avatar
      taprio: calculate cycle_time when schedule is installed · 037be037
      Vedang Patel authored
      cycle time for a particular schedule is calculated only when it is first
      installed. So, it makes sense to just calculate it once right after the
      'cycle_time' parameter has been parsed and store it in cycle_time.
      Signed-off-by: default avatarVedang Patel <vedang.patel@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      037be037
    • Vedang Patel's avatar
      etf: Add skip_sock_check · d14d2b20
      Vedang Patel authored
      Currently, etf expects a socket with SO_TXTIME option set for each packet
      it encounters. So, it will drop all other packets. But, in the future
      commits we are planning to add functionality where tstamp value will be set
      by another qdisc. Also, some packets which are generated from within the
      kernel (e.g. ICMP packets) do not have any socket associated with them.
      
      So, this commit adds support for skip_sock_check. When this option is set,
      etf will skip checking for a socket and other associated options for all
      skbs.
      Signed-off-by: default avatarVedang Patel <vedang.patel@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d14d2b20
    • Vedang Patel's avatar
      etf: Don't use BIT() in UAPI headers. · 9903c8dc
      Vedang Patel authored
      The BIT() macro isn't exported as part of the UAPI interface. So, the
      compile-test to ensure they are self contained fails. So, use _BITUL()
      instead.
      Signed-off-by: default avatarVedang Patel <vedang.patel@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9903c8dc
    • Vedang Patel's avatar
      igb: clear out skb->tstamp after reading the txtime · 1e08511d
      Vedang Patel authored
      If a packet which is utilizing the launchtime feature (via SO_TXTIME socket
      option) also requests the hardware transmit timestamp, the hardware
      timestamp is not delivered to the userspace. This is because the value in
      skb->tstamp is mistaken as the software timestamp.
      
      Applications, like ptp4l, request a hardware timestamp by setting the
      SOF_TIMESTAMPING_TX_HARDWARE socket option. Whenever a new timestamp is
      detected by the driver (this work is done in igb_ptp_tx_work() which calls
      igb_ptp_tx_hwtstamps() in igb_ptp.c[1]), it will queue the timestamp in the
      ERR_QUEUE for the userspace to read. When the userspace is ready, it will
      issue a recvmsg() call to collect this timestamp.  The problem is in this
      recvmsg() call. If the skb->tstamp is not cleared out, it will be
      interpreted as a software timestamp and the hardware tx timestamp will not
      be successfully sent to the userspace. Look at skb_is_swtx_tstamp() and the
      callee function __sock_recv_timestamp() in net/socket.c for more details.
      Signed-off-by: default avatarVedang Patel <vedang.patel@intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e08511d
    • David S. Miller's avatar
      Merge branch 'mirred-recurse' · 8747d82d
      David S. Miller authored
      John Hurley says:
      
      ====================
      Track recursive calls in TC act_mirred
      
      These patches aim to prevent act_mirred causing stack overflow events from
      recursively calling packet xmit or receive functions. Such events can
      occur with poor TC configuration that causes packets to travel in loops
      within the system.
      
      Florian Westphal advises that a recursion crash and packets looping are
      separate issues and should be treated as such. David Miller futher points
      out that pcpu counters cannot track the precise skb context required to
      detect loops. Hence these patches are not aimed at detecting packet loops,
      rather, preventing stack flows arising from such loops.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8747d82d
    • John Hurley's avatar
      net: sched: protect against stack overflow in TC act_mirred · e2ca070f
      John Hurley authored
      TC hooks allow the application of filters and actions to packets at both
      ingress and egress of the network stack. It is possible, with poor
      configuration, that this can produce loops whereby an ingress hook calls
      a mirred egress action that has an egress hook that redirects back to
      the first ingress etc. The TC core classifier protects against loops when
      doing reclassifies but there is no protection against a packet looping
      between multiple hooks and recursively calling act_mirred. This can lead
      to stack overflow panics.
      
      Add a per CPU counter to act_mirred that is incremented for each recursive
      call of the action function when processing a packet. If a limit is passed
      then the packet is dropped and CPU counter reset.
      
      Note that this patch does not protect against loops in TC datapaths. Its
      aim is to prevent stack overflow kernel panics that can be a consequence
      of such loops.
      Signed-off-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2ca070f
    • John Hurley's avatar
      net: sched: refactor reinsert action · 720f22fe
      John Hurley authored
      The TC_ACT_REINSERT return type was added as an in-kernel only option to
      allow a packet ingress or egress redirect. This is used to avoid
      unnecessary skb clones in situations where they are not required. If a TC
      hook returns this code then the packet is 'reinserted' and no skb consume
      is carried out as no clone took place.
      
      This return type is only used in act_mirred. Rather than have the reinsert
      called from the main datapath, call it directly in act_mirred. Instead of
      returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED
      which tells the caller that the packet has been stolen by another process
      and that no consume call is required.
      
      Moving all redirect calls to the act_mirred code is in preparation for
      tracking recursion created by act_mirred.
      Signed-off-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      720f22fe
    • Christian Brauner's avatar
      ipv4: enable route flushing in network namespaces · 5cdda5f1
      Christian Brauner authored
      Tools such as vpnc try to flush routes when run inside network
      namespaces by writing 1 into /proc/sys/net/ipv4/route/flush. This
      currently does not work because flush is not enabled in non-initial
      network namespaces.
      Since routes are per network namespace it is safe to enable
      /proc/sys/net/ipv4/route/flush in there.
      
      Link: https://github.com/lxc/lxd/issues/4257Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cdda5f1
    • David S. Miller's avatar
      Merge tag 'batadv-next-for-davem-20190627v2' of git://git.open-mesh.org/linux-merge · 65dc5416
      David S. Miller authored
      Simon Wunderlich says:
      
      ====================
      This feature/cleanup patchset includes the following patches:
      
       - bump version strings, by Simon Wunderlich
      
       - fix includes for _MAX constants, atomic functions and fwdecls,
         by Sven Eckelmann (3 patches)
      
       - shorten multicast tt/tvlv worker spinlock section, by Linus Luessing
      
       - routeable multicast preparations: implement MAC multicast filtering,
         by Linus Luessing (2 patches, David Millers comments integrated)
      
       - remove return value checks for debugfs_create, by Greg Kroah-Hartman
      
       - add routable multicast optimizations, by Linus Luessing (2 patches)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65dc5416