1. 01 May, 2020 21 commits
    • Dan Murphy's avatar
      net: phy: DP83TC811: Fix WoL in config init to be disabled · 6c599044
      Dan Murphy authored
      The WoL feature should be disabled when config_init is called and the
      feature should turned on or off  when set_wol is called.
      
      In addition updated the calls to modify the registers to use the set_bit
      and clear_bit function calls.
      
      Fixes: 6d749428788b ("net: phy: DP83TC811: Introduce support for the
      DP83TC811 phy")
      Signed-off-by: default avatarDan Murphy <dmurphy@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c599044
    • Dan Murphy's avatar
      net: phy: DP83822: Fix WoL in config init to be disabled · 600ac36b
      Dan Murphy authored
      The WoL feature should be disabled when config_init is called and the
      feature should turned on or off  when set_wol is called.
      
      In addition updated the calls to modify the registers to use the set_bit
      and clear_bit function calls.
      
      Fixes: 3b427751a9d0 ("net: phy: DP83822 initial driver submission")
      Signed-off-by: default avatarDan Murphy <dmurphy@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      600ac36b
    • David Ahern's avatar
      ipv6: Use global sernum for dst validation with nexthop objects · 8f34e53b
      David Ahern authored
      Nik reported a bug with pcpu dst cache when nexthop objects are
      used illustrated by the following:
          $ ip netns add foo
          $ ip -netns foo li set lo up
          $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
          $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
          $ ip li add veth1 type veth peer name veth2
          $ ip li set veth1 up
          $ ip addr add 2001:db8:10::1/64 dev veth1
          $ ip li set dev veth2 netns foo
          $ ip -netns foo li set veth2 up
          $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
          $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Create a pcpu entry on cpu 0:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
      
          Re-add the route entry:
          $ ip -6 ro del 2001:db8:11::1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Route get on cpu 0 returns the stale pcpu:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
          RTNETLINK answers: Network is unreachable
      
          While cpu 1 works:
          $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
          2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
      
      Conversion of FIB entries to work with external nexthop objects
      missed an important difference between IPv4 and IPv6 - how dst
      entries are invalidated when the FIB changes. IPv4 has a per-network
      namespace generation id (rt_genid) that is bumped on changes to the FIB.
      Checking if a dst_entry is still valid means comparing rt_genid in the
      rtable to the current value of rt_genid for the namespace.
      
      IPv6 also has a per network namespace counter, fib6_sernum, but the
      count is saved per fib6_node. With the per-node counter only dst_entries
      based on fib entries under the node are invalidated when changes are
      made to the routes - limiting the scope of invalidations. IPv6 uses a
      reference in the rt6_info, 'from', to track the corresponding fib entry
      used to create the dst_entry. When validating a dst_entry, the 'from'
      is used to backtrack to the fib6_node and check the sernum of it to the
      cookie passed to the dst_check operation.
      
      With the inline format (nexthop definition inline with the fib6_info),
      dst_entries cached in the fib6_nh have a 1:1 correlation between fib
      entries, nexthop data and dst_entries. With external nexthops, IPv6
      looks more like IPv4 which means multiple fib entries across disparate
      fib6_nodes can all reference the same fib6_nh. That means validation
      of dst_entries based on external nexthops needs to use the IPv4 format
      - the per-network namespace counter.
      
      Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
      rt6_get_cookie to return sernum if it is set and update dst_check for
      IPv6 to look for sernum set and based the check on it if so. Finally,
      rt6_get_pcpu_route needs to validate the cached entry before returning
      a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
      __mkroute_output for IPv4).
      
      This problem only affects routes using the new, external nexthops.
      
      Thanks to the kbuild test robot for catching the IS_ENABLED needed
      around rt_genid_ipv6 before I sent this out.
      
      Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
      Reported-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f34e53b
    • Rahul Lakkireddy's avatar
      cxgb4: fix EOTID leak when disabling TC-MQPRIO offload · 69422a7e
      Rahul Lakkireddy authored
      Under heavy load, the EOTID termination FLOWC request fails to get
      enqueued to the end of the Tx ring due to lack of credits. This
      results in EOTID leak.
      
      When disabling TC-MQPRIO offload, the link is already brought down
      to cleanup EOTIDs. So, flush any pending enqueued skbs that can't be
      sent outside the wire, to make room for FLOWC request. Also, move the
      FLOWC descriptor consumption logic closer to when the FLOWC request is
      actually posted to hardware.
      
      Fixes: 0e395b3c ("cxgb4: add FLOWC based QoS offload")
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69422a7e
    • Andy Shevchenko's avatar
      stmmac: intel: Fix kernel crash due to wrong error path · ab1c637c
      Andy Shevchenko authored
      Unfortunately sometimes ->probe() may fail. The commit b9663b7c
      ("net: stmmac: Enable SERDES power up/down sequence")
      messed up with error handling and thus:
      
      [   12.811311] ------------[ cut here ]------------
      [   12.811993] kernel BUG at net/core/dev.c:9937!
      
      Fix this by properly crafted error path.
      
      Fixes: b9663b7c ("net: stmmac: Enable SERDES power up/down sequence")
      Cc: Voon Weifeng <weifeng.voon@intel.com>
      Cc: Ong Boon Leong <boon.leong.ong@intel.com>
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab1c637c
    • Jiri Pirko's avatar
      mlxsw: spectrum_acl_tcam: Position vchunk in a vregion list properly · 6ef4889f
      Jiri Pirko authored
      Vregion helpers to get min and max priority depend on the correct
      ordering of vchunks in the vregion list. However, the current code
      always adds new chunk to the end of the list, no matter what the
      priority is. Fix this by finding the correct place in the list and put
      vchunk there.
      
      Fixes: 22a67766 ("mlxsw: spectrum: Introduce ACL core with simple TCAM implementation")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ef4889f
    • Toke Høiland-Jørgensen's avatar
      tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040 · b7237487
      Toke Høiland-Jørgensen authored
      RFC 6040 recommends propagating an ECT(1) mark from an outer tunnel header
      to the inner header if that inner header is already marked as ECT(0). When
      RFC 6040 decapsulation was implemented, this case of propagation was not
      added. This simply appears to be an oversight, so let's fix that.
      
      Fixes: eccc1bb8 ("tunnel: drop packet if ECN present with not-ECT")
      Reported-by: default avatarBob Briscoe <ietf@bobbriscoe.net>
      Reported-by: default avatarOlivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
      Cc: Dave Taht <dave.taht@gmail.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7237487
    • Andy Shevchenko's avatar
      net: macb: Fix runtime PM refcounting · 0ce205d4
      Andy Shevchenko authored
      The commit e6a41c23, while trying to fix an issue,
      
          ("net: macb: ensure interface is not suspended on at91rm9200")
      
      introduced a refcounting regression, because in error case refcounter
      must be balanced. Fix it by calling pm_runtime_put_noidle() in error case.
      
      While here, fix the same mistake in other couple of places.
      
      Fixes: e6a41c23 ("net: macb: ensure interface is not suspended on at91rm9200")
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Claudiu Beznea <claudiu.beznea@microchip.com>
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ce205d4
    • Christophe JAILLET's avatar
      net: moxa: Fix a potential double 'free_irq()' · ee8d2267
      Christophe JAILLET authored
      Should an irq requested with 'devm_request_irq' be released explicitly,
      it should be done by 'devm_free_irq()', not 'free_irq()'.
      
      Fixes: 6c821bd9 ("net: Add MOXA ART SoCs ethernet driver")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee8d2267
    • Scott Dial's avatar
      net: macsec: preserve ingress frame ordering · ab046a5d
      Scott Dial authored
      MACsec decryption always occurs in a softirq context. Since
      the FPU may not be usable in the softirq context, the call to
      decrypt may be scheduled on the cryptd work queue. The cryptd
      work queue does not provide ordering guarantees. Therefore,
      preserving order requires masking out ASYNC implementations
      of gcm(aes).
      
      For instance, an Intel CPU with AES-NI makes available the
      generic-gcm-aesni driver from the aesni_intel module to
      implement gcm(aes). However, this implementation requires
      the FPU, so it is not always available to use from a softirq
      context, and will fallback to the cryptd work queue, which
      does not preserve frame ordering. With this change, such a
      system would select gcm_base(ctr(aes-aesni),ghash-generic).
      While the aes-aesni implementation prefers to use the FPU, it
      will fallback to the aes-asm implementation if unavailable.
      
      By using a synchronous version of gcm(aes), the decryption
      will complete before returning from crypto_aead_decrypt().
      Therefore, the macsec_decrypt_done() callback will be called
      before returning from macsec_decrypt(). Thus, the order of
      calls to macsec_post_decrypt() for the frames is preserved.
      
      While it's presumable that the pure AES-NI version of gcm(aes)
      is more performant, the hybrid solution is capable of gigabit
      speeds on modest hardware. Regardless, preserving the order
      of frames is paramount for many network protocols (e.g.,
      triggering TCP retries). Within the MACsec driver itself, the
      replay protection is tripped by the out-of-order frames, and
      can cause frames to be dropped.
      
      This bug has been present in this code since it was added in
      v4.6, however it may not have been noticed since not all CPUs
      have FPU offload available. Additionally, the bug manifests
      as occasional out-of-order packets that are easily
      misattributed to other network phenomena.
      
      When this code was added in v4.6, the crypto/gcm.c code did
      not restrict selection of the ghash function based on the
      ASYNC flag. For instance, x86 CPUs with PCLMULQDQ would
      select the ghash-clmulni driver instead of ghash-generic,
      which submits to the cryptd work queue if the FPU is busy.
      However, this bug was was corrected in v4.8 by commit
      b30bdfa8, and was backported
      all the way back to the v3.14 stable branch, so this patch
      should be applicable back to the v4.6 stable branch.
      Signed-off-by: default avatarScott Dial <scott@scottdial.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab046a5d
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · b6f875a8
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Do not update the UDP checksum when it's zero, from Guillaume Nault.
      
      2) Fix return of local variable in nf_osf, from Arnd Bergmann.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6f875a8
    • David S. Miller's avatar
      Merge branch 'net-ipa-three-bug-fixes' · c778980a
      David S. Miller authored
      Alex Elder says:
      
      ====================
      net: ipa: three bug fixes
      
      This series fixes three bugs in the Qualcomm IPA code.  The third
      adds a missing error code initialization step.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c778980a
    • Alex Elder's avatar
      net: ipa: zero return code before issuing generic EE command · 0b1ba18a
      Alex Elder authored
      Zero the result code stored in a field of the scratch 0 register
      before issuing a generic EE command.  This just guarantees that
      the value we read later was actually written as a result of the
      command.
      
      Also add the definitions of two more possible result codes that can
      be returned when issuing flow control enable or disable commands:
        INCORRECT_CHANNEL_STATE: - channel must be in started state
        INCORRECT_DIRECTION - flow control is only valid for TX channels
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b1ba18a
    • Alex Elder's avatar
      net: ipa: fix an error message in gsi_channel_init_one() · 0721999f
      Alex Elder authored
      An error message about limiting the number of TREs used prints the
      wrong value.  Fix this bug.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0721999f
    • Alex Elder's avatar
      net: ipa: fix a bug in ipa_endpoint_stop() · 713b6ebb
      Alex Elder authored
      In ipa_endpoint_stop(), for TX endpoints we set the number of retries
      to 0.  When we break out of the loop, retries being 0 means we return
      EIO rather than the value of ret (which should be 0).
      
      Fix this by using a non-zero retry count for both RX and TX
      channels, and just break out of the loop after calling
      gsi_channel_stop() for TX channels.  This way only RX channels
      will retry, and the retry count will be non-zero at the end
      for TX channels (so the proper value gets returned).
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      713b6ebb
    • David S. Miller's avatar
      Merge branch 'ionic-fw-upgrade-bug-fixes' · de04604e
      David S. Miller authored
      Shannon Nelson says:
      
      ====================
      ionic: fw upgrade bug fixes
      
      These patches address issues found in additional internal
      fw-upgrade testing.
      
      v2:
       - replaced extra state flag with postponing first link check
       - added device reset patch
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de04604e
    • Shannon Nelson's avatar
      ionic: add device reset to fw upgrade down · 6bc977fa
      Shannon Nelson authored
      Doing a device reset addresses an obscure FW timing issue in
      the FW upgrade process.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bc977fa
    • Shannon Nelson's avatar
      ionic: refresh devinfo after fw-upgrade · 1d53aedc
      Shannon Nelson authored
      Make sure we can report the new FW version after a
      fw-upgrade has finished by re-reading the device's
      fw version information.
      
      Fixes: c672412f ("ionic: remove lifs on fw reset")
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d53aedc
    • Shannon Nelson's avatar
      ionic: no link check until after probe · 16f3fd3d
      Shannon Nelson authored
      Don't bother with the link check during probe, let
      the watchdog notice the first link-up.  This allows
      probe to finish cleanly without any interruptions
      from over excited user programs opening the device
      as soon as it is registered.
      
      Fixes: c672412f ("ionic: remove lifs on fw reset")
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16f3fd3d
    • Julia Lawall's avatar
      dp83640: reverse arguments to list_add_tail · 86530837
      Julia Lawall authored
      In this code, it appears that phyter_clocks is a list head, based on
      the previous list_for_each, and that clock->list is intended to be a
      list element, given that it has just been initialized in
      dp83640_clock_init.  Accordingly, switch the arguments to
      list_add_tail, which takes the list head as the second argument.
      
      Fixes: cb646e2b ("ptp: Added a clock driver for the National Semiconductor PHYTER.")
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86530837
    • Ido Schimmel's avatar
      net: bridge: vlan: Add a schedule point during VLAN processing · 7979457b
      Ido Schimmel authored
      User space can request to delete a range of VLANs from a bridge slave in
      one netlink request. For each deleted VLAN the FDB needs to be traversed
      in order to flush all the affected entries.
      
      If a large range of VLANs is deleted and the number of FDB entries is
      large or the FDB lock is contented, it is possible for the kernel to
      loop through the deleted VLANs for a long time. In case preemption is
      disabled, this can result in a soft lockup.
      
      Fix this by adding a schedule point after each VLAN is deleted to yield
      the CPU, if needed. This is safe because the VLANs are traversed in
      process context.
      
      Fixes: bdced7ef ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarStefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Tested-by: default avatarStefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7979457b
  2. 30 Apr, 2020 16 commits
    • Juliet Kim's avatar
      ibmvnic: Skip fatal error reset after passive init · f9c6cea0
      Juliet Kim authored
      During MTU change, the following events may happen.
      Client-driven CRQ initialization fails due to partner’s CRQ closed,
      causing client to enqueue a reset task for FATAL_ERROR. Then passive
      (server-driven) CRQ initialization succeeds, causing client to
      release CRQ and enqueue a reset task for failover. If the passive
      CRQ initialization occurs before the FATAL reset task is processed,
      the FATAL error reset task would try to access a CRQ message queue
      that was freed, causing an oops. The problem may be most likely to
      occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
      process will automatically issue a change MTU request.
      
      Fix this by not processing fatal error reset if CRQ is passively
      initialized after client-driven CRQ initialization fails.
      Signed-off-by: default avatarJuliet Kim <julietk@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9c6cea0
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2020-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 81d6bc44
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      Mellanox, mlx5 fixes 2020-04-29
      
      This series introduces some fixes to mlx5 driver.
      
      Please pull and let me know if there is any problem.
      
      v2:
       - Dropped the ktls patch, Tariq has to check if it is fixable in the stack
      
      For -stable v4.12
       ('net/mlx5: Fix forced completion access non initialized command entry')
       ('net/mlx5: Fix command entry leak in Internal Error State')
      
      For -stable v5.4
       ('net/mlx5: DR, On creation set CQ's arm_db member to right value')
      
      For -stable v5.6
       ('net/mlx5e: Fix q counters on uplink representors')
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81d6bc44
    • Paolo Abeni's avatar
      mptcp: fix uninitialized value access · ac2b47fb
      Paolo Abeni authored
      tcp_v{4,6}_syn_recv_sock() set 'own_req' only when returning
      a not NULL 'child', let's check 'own_req' only if child is
      available to avoid an - unharmful - UBSAN splat.
      
      v1 -> v2:
       - reference the correct hash
      
      Fixes: 4c8941de ("mptcp: avoid flipping mp_capable field in syn_recv_sock()")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac2b47fb
    • David S. Miller's avatar
      Merge branch 'mptcp-fix-incoming-options-parsing' · 8c755953
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      mptcp: fix incoming options parsing
      
      This series addresses a serious issue in MPTCP option parsing.
      
      This is bigger than the usual -net change, but I was unable to find a
      working, sane, smaller fix.
      
      The core change is inside patch 2/5 which moved MPTCP options parsing from
      the TCP code inside existing MPTCP hooks and clean MPTCP options status on
      each processed packet.
      
      The patch 1/5 is a needed pre-requisite, and patches 3,4,5 are smaller,
      related fixes.
      
      v1 -> v2:
       - cleaned-up patch 1/5
       - rebased on top of current -net
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c755953
    • Paolo Abeni's avatar
      mptcp: initialize the data_fin field for mpc packets · a77895db
      Paolo Abeni authored
      When parsing MPC+data packets we set the dss field, so
      we must also initialize the data_fin, or we can find stray
      value there.
      
      Fixes: 9a19371b ("mptcp: fix data_fin handing in RX path")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a77895db
    • Paolo Abeni's avatar
      mptcp: fix 'use_ack' option access. · 5a91e32b
      Paolo Abeni authored
      The mentioned RX option field is initialized only for DSS
      packet, we must access it only if 'dss' is set too, or
      the subflow will end-up in a bad status, leading to
      RFC violations.
      
      Fixes: d22f4988 ("mptcp: process MP_CAPABLE data option")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a91e32b
    • Paolo Abeni's avatar
      mptcp: avoid a WARN on bad input. · d6085fe1
      Paolo Abeni authored
      Syzcaller has found a way to trigger the WARN_ON_ONCE condition
      in check_fully_established().
      
      The root cause is a legit fallback to TCP scenario, so replace
      the WARN with a plain message on a more strict condition.
      
      Fixes: f296234c ("mptcp: Add handling of incoming MP_JOIN requests")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6085fe1
    • Paolo Abeni's avatar
      mptcp: move option parsing into mptcp_incoming_options() · cfde141e
      Paolo Abeni authored
      The mptcp_options_received structure carries several per
      packet flags (mp_capable, mp_join, etc.). Such fields must
      be cleared on each packet, even on dropped ones or packet
      not carrying any MPTCP options, but the current mptcp
      code clears them only on TCP option reset.
      
      On several races/corner cases we end-up with stray bits in
      incoming options, leading to WARN_ON splats. e.g.:
      
      [  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
      [  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
      [  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
      [  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
      [  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
      [  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
      [  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
      [  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
      [  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
      [  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
      [  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
      [  171.232586] Call Trace:
      [  171.233109]  <IRQ>
      [  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
      [  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
      [  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
      [  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
      [  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
      [  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
      [  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
      [  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
      [  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
      [  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
      [  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
      [  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
      [  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
      [  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
      [  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
      [  171.282358]  </IRQ>
      
      We could address the issue clearing explicitly the relevant fields
      in several places - tcp_parse_option, tcp_fast_parse_options,
      possibly others.
      
      Instead we move the MPTCP option parsing into the already existing
      mptcp ingress hook, so that we need to clear the fields in a single
      place.
      
      This allows us dropping an MPTCP hook from the TCP code and
      removing the quite large mptcp_options_received from the tcp_sock
      struct. On the flip side, the MPTCP sockets will traverse the
      option space twice (in tcp_parse_option() and in
      mptcp_incoming_options(). That looks acceptable: we already
      do that for syn and 3rd ack packets, plain TCP socket will
      benefit from it, and even MPTCP sockets will experience better
      code locality, reducing the jumps between TCP and MPTCP code.
      
      v1 -> v2:
       - rebased on current '-net' tree
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfde141e
    • Paolo Abeni's avatar
      mptcp: consolidate synack processing. · 263e1201
      Paolo Abeni authored
      Currently the MPTCP code uses 2 hooks to process syn-ack
      packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
      callback.
      
      We can drop the first, moving the relevant code into the
      latter, reducing the hooking into the TCP code. This is
      also needed by the next patch.
      
      v1 -> v2:
       - use local tcp sock ptr instead of casting the sk variable
         several times - DaveM
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      263e1201
    • Roi Dayan's avatar
      net/mlx5e: Fix q counters on uplink representors · 67b38de6
      Roi Dayan authored
      Need to allocate the q counters before init_rx which needs them
      when creating the rq.
      
      Fixes: 8520fa57 ("net/mlx5e: Create q counters on uplink representors")
      Signed-off-by: default avatarRoi Dayan <roid@mellanox.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      67b38de6
    • Moshe Shemesh's avatar
      net/mlx5: Fix command entry leak in Internal Error State · cece6f43
      Moshe Shemesh authored
      Processing commands by cmd_work_handler() while already in Internal
      Error State will result in entry leak, since the handler process force
      completion without doorbell. Forced completion doesn't release the entry
      and event completion will never arrive, so entry should be released.
      
      Fixes: 73dd3a48 ("net/mlx5: Avoid using pending command interface slots")
      Signed-off-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      cece6f43
    • Moshe Shemesh's avatar
      net/mlx5: Fix forced completion access non initialized command entry · f3cb3ceb
      Moshe Shemesh authored
      mlx5_cmd_flush() will trigger forced completions to all valid command
      entries. Triggered by an asynch event such as fast teardown it can
      happen at any stage of the command, including command initialization.
      It will trigger forced completion and that can lead to completion on an
      uninitialized command entry.
      
      Setting MLX5_CMD_ENT_STATE_PENDING_COMP only after command entry is
      initialized will ensure force completion is treated only if command
      entry is initialized.
      
      Fixes: 73dd3a48 ("net/mlx5: Avoid using pending command interface slots")
      Signed-off-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      f3cb3ceb
    • Erez Shitrit's avatar
      net/mlx5: DR, On creation set CQ's arm_db member to right value · 8075411d
      Erez Shitrit authored
      In polling mode, set arm_db member to a value that will avoid CQ
      event recovery by the HW.
      Otherwise we might get event without completion function.
      In addition,empty completion function to was added to protect from
      unexpected events.
      
      Fixes: 297ccceb ("net/mlx5: DR, Expose an internal API to issue RDMA operations")
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarAlex Vesker <valex@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      8075411d
    • Parav Pandit's avatar
      net/mlx5: E-switch, Fix mutex init order · f8d1edda
      Parav Pandit authored
      In cited patch mutex is initialized after its used.
      Below call trace is observed.
      Fix the order to initialize the mutex early enough.
      Similarly follow mirror sequence during cleanup.
      
      kernel: DEBUG_LOCKS_WARN_ON(lock->magic != lock)
      kernel: WARNING: CPU: 5 PID: 45916 at kernel/locking/mutex.c:938
      __mutex_lock+0x7d6/0x8a0
      kernel: Call Trace:
      kernel: ? esw_vport_tbl_get+0x3b/0x250 [mlx5_core]
      kernel: ? mark_held_locks+0x55/0x70
      kernel: ? __slab_free+0x274/0x400
      kernel: ? lockdep_hardirqs_on+0x140/0x1d0
      kernel: esw_vport_tbl_get+0x3b/0x250 [mlx5_core]
      kernel: ? mlx5_esw_chains_create_fdb_prio+0xa57/0xc20 [mlx5_core]
      kernel: mlx5_esw_vport_tbl_get+0x88/0xf0 [mlx5_core]
      kernel: mlx5_esw_chains_create+0x2f3/0x3e0 [mlx5_core]
      kernel: esw_create_offloads_fdb_tables+0x11d/0x580 [mlx5_core]
      kernel: esw_offloads_enable+0x26d/0x540 [mlx5_core]
      kernel: mlx5_eswitch_enable_locked+0x155/0x860 [mlx5_core]
      kernel: mlx5_devlink_eswitch_mode_set+0x1af/0x320 [mlx5_core]
      kernel: devlink_nl_cmd_eswitch_set_doit+0x41/0xb0
      
      Fixes: 96e32687 ("net/mlx5e: Eswitch, Use per vport tables for mirroring")
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Reviewed-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      f8d1edda
    • Parav Pandit's avatar
      net/mlx5: E-switch, Fix printing wrong error value · e9864539
      Parav Pandit authored
      When mlx5_modify_header_alloc() fails, instead of printing the error
      value returned, current error log prints 0.
      
      Fix by printing correct error value returned by
      mlx5_modify_header_alloc().
      
      Fixes: 6724e66b ("net/mlx5: E-Switch, Get reg_c1 value on miss")
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      e9864539
    • Parav Pandit's avatar
      net/mlx5: E-switch, Fix error unwinding flow for steering init failure · 79949985
      Parav Pandit authored
      Error unwinding is done incorrectly in the cited commit.
      When steering init fails, there is no need to perform steering cleanup.
      When vport error exists, error cleanup should be mirror of the setup
      routine, i.e. to perform steering cleanup before metadata cleanup.
      
      This avoids the call trace in accessing uninitialized objects which are
      skipped during steering_init() due to failure in steering_init().
      
      Call trace:
      mlx5_cmd_modify_header_alloc:805:(pid 21128): too many modify header
      actions 1, max supported 0
      E-Switch: Failed to create restore mod header
      
      BUG: kernel NULL pointer dereference, address: 00000000000000d0
      [  677.263079]  mlx5_destroy_flow_group+0x13/0x80 [mlx5_core]
      [  677.268921]  esw_offloads_steering_cleanup+0x51/0xf0 [mlx5_core]
      [  677.275281]  esw_offloads_enable+0x1a5/0x800 [mlx5_core]
      [  677.280949]  mlx5_eswitch_enable_locked+0x155/0x860 [mlx5_core]
      [  677.287227]  mlx5_devlink_eswitch_mode_set+0x1af/0x320
      [  677.293741]  devlink_nl_cmd_eswitch_set_doit+0x41/0xb0
      [  677.299217]  genl_rcv_msg+0x1eb/0x430
      
      Fixes: 7983a675 ("net/mlx5: E-Switch, Enable chains only if regs loopback is enabled")
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      79949985
  3. 29 Apr, 2020 3 commits