1. 24 Jan, 2022 9 commits
    • Wen Gu's avatar
      net/smc: Transitional solution for clcsock race issue · c0bf3d8a
      Wen Gu authored
      We encountered a crash in smc_setsockopt() and it is caused by
      accessing smc->clcsock after clcsock was released.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000020
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 1 PID: 50309 Comm: nginx Kdump: loaded Tainted: G E     5.16.0-rc4+ #53
       RIP: 0010:smc_setsockopt+0x59/0x280 [smc]
       Call Trace:
        <TASK>
        __sys_setsockopt+0xfc/0x190
        __x64_sys_setsockopt+0x20/0x30
        do_syscall_64+0x34/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f16ba83918e
        </TASK>
      
      This patch tries to fix it by holding clcsock_release_lock and
      checking whether clcsock has already been released before access.
      
      In case that a crash of the same reason happens in smc_getsockopt()
      or smc_switch_to_fallback(), this patch also checkes smc->clcsock
      in them too. And the caller of smc_switch_to_fallback() will identify
      whether fallback succeeds according to the return value.
      
      Fixes: fd57770d ("net/smc: wait for pending work before clcsock release_sock")
      Link: https://lore.kernel.org/lkml/5dd7ffd1-28e2-24cc-9442-1defec27375e@linux.ibm.com/T/Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0bf3d8a
    • Sukadev Bhattiprolu's avatar
      ibmvnic: remove unused ->wait_capability · 3a5d9db7
      Sukadev Bhattiprolu authored
      With previous bug fix, ->wait_capability flag is no longer needed and can
      be removed.
      
      Fixes: 249168ad ("ibmvnic: Make CRQ interrupt tasklet wait for all capabilities crqs")
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Reviewed-by: default avatarDany Madden <drt@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a5d9db7
    • Sukadev Bhattiprolu's avatar
      ibmvnic: don't spin in tasklet · 48079e7f
      Sukadev Bhattiprolu authored
      ibmvnic_tasklet() continuously spins waiting for responses to all
      capability requests. It does this to avoid encountering an error
      during initialization of the vnic. However if there is a bug in the
      VIOS and we do not receive a response to one or more queries the
      tasklet ends up spinning continuously leading to hard lock ups.
      
      If we fail to receive a message from the VIOS it is reasonable to
      timeout the login attempt rather than spin indefinitely in the tasklet.
      
      Fixes: 249168ad ("ibmvnic: Make CRQ interrupt tasklet wait for all capabilities crqs")
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Reviewed-by: default avatarDany Madden <drt@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      48079e7f
    • Sukadev Bhattiprolu's avatar
      ibmvnic: init ->running_cap_crqs early · 151b6a5c
      Sukadev Bhattiprolu authored
      We use ->running_cap_crqs to determine when the ibmvnic_tasklet() should
      send out the next protocol message type. i.e when we get back responses
      to all our QUERY_CAPABILITY CRQs we send out REQUEST_CAPABILITY crqs.
      Similiary, when we get responses to all the REQUEST_CAPABILITY crqs, we
      send out the QUERY_IP_OFFLOAD CRQ.
      
      We currently increment ->running_cap_crqs as we send out each CRQ and
      have the ibmvnic_tasklet() send out the next message type, when this
      running_cap_crqs count drops to 0.
      
      This assumes that all the CRQs of the current type were sent out before
      the count drops to 0. However it is possible that we send out say 6 CRQs,
      get preempted and receive all the 6 responses before we send out the
      remaining CRQs. This can result in ->running_cap_crqs count dropping to
      zero before all messages of the current type were sent and we end up
      sending the next protocol message too early.
      
      Instead initialize the ->running_cap_crqs upfront so the tasklet will
      only send the next protocol message after all responses are received.
      
      Use the cap_reqs local variable to also detect any discrepancy (either
      now or in future) in the number of capability requests we actually send.
      
      Currently only send_query_cap() is affected by this behavior (of sending
      next message early) since it is called from the worker thread (during
      reset) and from application thread (during ->ndo_open()) and they can be
      preempted. send_request_cap() is only called from the tasklet  which
      processes CRQ responses sequentially, is not be affected.  But to
      maintain the existing symmtery with send_query_capability() we update
      send_request_capability() also.
      
      Fixes: 249168ad ("ibmvnic: Make CRQ interrupt tasklet wait for all capabilities crqs")
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Reviewed-by: default avatarDany Madden <drt@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      151b6a5c
    • Sukadev Bhattiprolu's avatar
      ibmvnic: Allow extra failures before disabling · db9f0e8b
      Sukadev Bhattiprolu authored
      If auto-priority-failover (APF) is enabled and there are at least two
      backing devices of different priorities, some resets like fail-over,
      change-param etc can cause at least two back to back failovers. (Failover
      from high priority backing device to lower priority one and then back
      to the higher priority one if that is still functional).
      
      Depending on the timimg of the two failovers it is possible to trigger
      a "hard" reset and for the hard reset to fail due to failovers. When this
      occurs, the driver assumes that the network is unstable and disables the
      VNIC for a 60-second "settling time". This in turn can cause the ethtool
      command to fail with "No such device" while the vnic automatically recovers
      a little while later.
      
      Given that it's possible to have two back to back failures, allow for extra
      failures before disabling the vnic for the settling time.
      
      Fixes: f15fde9d ("ibmvnic: delay next reset if hard reset fails")
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Reviewed-by: default avatarDany Madden <drt@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db9f0e8b
    • Jakub Kicinski's avatar
      ipv4: fix ip option filtering for locally generated fragments · 27a8caa5
      Jakub Kicinski authored
      During IP fragmentation we sanitize IP options. This means overwriting
      options which should not be copied with NOPs. Only the first fragment
      has the original, full options.
      
      ip_fraglist_prepare() copies the IP header and options from previous
      fragment to the next one. Commit 19c3401a ("net: ipv4: place control
      buffer handling away from fragmentation iterators") moved sanitizing
      options before ip_fraglist_prepare() which means options are sanitized
      and then overwritten again with the old values.
      
      Fixing this is not enough, however, nor did the sanitization work
      prior to aforementioned commit.
      
      ip_options_fragment() (which does the sanitization) uses ipcb->opt.optlen
      for the length of the options. ipcb->opt of fragments is not populated
      (it's 0), only the head skb has the state properly built. So even when
      called at the right time ip_options_fragment() does nothing. This seems
      to date back all the way to v2.5.44 when the fast path for pre-fragmented
      skbs had been introduced. Prior to that ip_options_build() would have been
      called for every fragment (in fact ever since v2.5.44 the fragmentation
      handing in ip_options_build() has been dead code, I'll clean it up in
      -next).
      
      In the original patch (see Link) caixf mentions fixing the handling
      for fragments other than the second one, but I'm not sure how _any_
      fragment could have had their options sanitized with the code
      as it stood.
      
      Tested with python (MTU on lo lowered to 1000 to force fragmentation):
      
        import socket
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.setsockopt(socket.IPPROTO_IP, socket.IP_OPTIONS,
                     bytearray([7,4,5,192, 20|0x80,4,1,0]))
        s.sendto(b'1'*2000, ('127.0.0.1', 1234))
      
      Before:
      
      IP (tos 0x0, ttl 64, id 1053, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost.36500 > localhost.search-agent: UDP, length 2000
      IP (tos 0x0, ttl 64, id 1053, offset 968, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost > localhost: udp
      IP (tos 0x0, ttl 64, id 1053, offset 1936, flags [none], proto UDP (17), length 100, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost > localhost: udp
      
      After:
      
      IP (tos 0x0, ttl 96, id 42549, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost.51607 > localhost.search-agent: UDP, bad length 2000 > 960
      IP (tos 0x0, ttl 96, id 42549, offset 968, flags [+], proto UDP (17), length 996, options (NOP,NOP,NOP,NOP,RA value 256))
          localhost > localhost: udp
      IP (tos 0x0, ttl 96, id 42549, offset 1936, flags [none], proto UDP (17), length 100, options (NOP,NOP,NOP,NOP,RA value 256))
          localhost > localhost: udp
      
      RA (20 | 0x80) is now copied as expected, RR (7) is "NOPed out".
      
      Link: https://lore.kernel.org/netdev/20220107080559.122713-1-ooppublic@163.com/
      Fixes: 19c3401a ("net: ipv4: place control buffer handling away from fragmentation iterators")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarcaixf <ooppublic@163.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27a8caa5
    • Jianguo Wu's avatar
      net-procfs: show net devices bound packet types · 1d10f8a1
      Jianguo Wu authored
      After commit:7866a621 ("dev: add per net_device packet type chains"),
      we can not get packet types that are bound to a specified net device by
      /proc/net/ptype, this patch fix the regression.
      
      Run "tcpdump -i ens192 udp -nns0" Before and after apply this patch:
      
      Before:
        [root@localhost ~]# cat /proc/net/ptype
        Type Device      Function
        0800          ip_rcv
        0806          arp_rcv
        86dd          ipv6_rcv
      
      After:
        [root@localhost ~]# cat /proc/net/ptype
        Type Device      Function
        ALL  ens192   tpacket_rcv
        0800          ip_rcv
        0806          arp_rcv
        86dd          ipv6_rcv
      
      v1 -> v2:
        - fix the regression rather than adding new /proc API as
          suggested by Stephen Hemminger.
      
      Fixes: 7866a621 ("dev: add per net_device packet type chains")
      Signed-off-by: default avatarJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d10f8a1
    • Hangbin Liu's avatar
      bonding: use rcu_dereference_rtnl when get bonding active slave · aa603467
      Hangbin Liu authored
      bond_option_active_slave_get_rcu() should not be used in rtnl_mutex as it
      use rcu_dereference(). Replace to rcu_dereference_rtnl() so we also can use
      this function in rtnl protected context.
      
      With this update, we can rmeove the rcu_read_lock/unlock in
      bonding .ndo_eth_ioctl and .get_ts_info.
      Reported-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Fixes: 94dd016a ("bond: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to active device")
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa603467
    • Marek Behún's avatar
      net: sfp: ignore disabled SFP node · 2148927e
      Marek Behún authored
      Commit ce0aa27f ("sfp: add sfp-bus to bridge between network devices
      and sfp cages") added code which finds SFP bus DT node even if the node
      is disabled with status = "disabled". Because of this, when phylink is
      created, it ends with non-null .sfp_bus member, even though the SFP
      module is not probed (because the node is disabled).
      
      We need to ignore disabled SFP bus node.
      
      Fixes: ce0aa27f ("sfp: add sfp-bus to bridge between network devices and sfp cages")
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Cc: stable@vger.kernel.org # 2203cbf2 ("net: sfp: move fwnode parsing into sfp-bus layer")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2148927e
  2. 22 Jan, 2022 3 commits
  3. 21 Jan, 2022 21 commits
  4. 20 Jan, 2022 7 commits