1. 23 Mar, 2022 2 commits
    • Alexander Lobakin's avatar
      ice: don't allow to run ice_send_event_to_aux() in atomic ctx · 5a315693
      Alexander Lobakin authored
      ice_send_event_to_aux() eventually descends to mutex_lock()
      (-> might_sched()), so it must not be called under non-task
      context. However, at least two fixes have happened already for the
      bug splats occurred due to this function being called from atomic
      context.
      To make the emergency landings softer, bail out early when executed
      in non-task context emitting a warn splat only once. This way we
      trade some events being potentially lost for system stability and
      avoid any related hangs and crashes.
      
      Fixes: 348048e7 ("ice: Implement iidc operations")
      Signed-off-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Tested-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Acked-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5a315693
    • Alexander Lobakin's avatar
      ice: fix 'scheduling while atomic' on aux critical err interrupt · 32d53c0a
      Alexander Lobakin authored
      There's a kernel BUG splat on processing aux critical error
      interrupts in ice_misc_intr():
      
      [ 2100.917085] BUG: scheduling while atomic: swapper/15/0/0x00010000
      ...
      [ 2101.060770] Call Trace:
      [ 2101.063229]  <IRQ>
      [ 2101.065252]  dump_stack+0x41/0x60
      [ 2101.068587]  __schedule_bug.cold.100+0x4c/0x58
      [ 2101.073060]  __schedule+0x6a4/0x830
      [ 2101.076570]  schedule+0x35/0xa0
      [ 2101.079727]  schedule_preempt_disabled+0xa/0x10
      [ 2101.084284]  __mutex_lock.isra.7+0x310/0x420
      [ 2101.088580]  ? ice_misc_intr+0x201/0x2e0 [ice]
      [ 2101.093078]  ice_send_event_to_aux+0x25/0x70 [ice]
      [ 2101.097921]  ice_misc_intr+0x220/0x2e0 [ice]
      [ 2101.102232]  __handle_irq_event_percpu+0x40/0x180
      [ 2101.106965]  handle_irq_event_percpu+0x30/0x80
      [ 2101.111434]  handle_irq_event+0x36/0x53
      [ 2101.115292]  handle_edge_irq+0x82/0x190
      [ 2101.119148]  handle_irq+0x1c/0x30
      [ 2101.122480]  do_IRQ+0x49/0xd0
      [ 2101.125465]  common_interrupt+0xf/0xf
      [ 2101.129146]  </IRQ>
      ...
      
      As Andrew correctly mentioned previously[0], the following call
      ladder happens:
      
      ice_misc_intr() <- hardirq
        ice_send_event_to_aux()
          device_lock()
            mutex_lock()
              might_sleep()
                might_resched() <- oops
      
      Add a new PF state bit which indicates that an aux critical error
      occurred and serve it in ice_service_task() in process context.
      The new ice_pf::oicr_err_reg is read-write in both hardirq and
      process contexts, but only 3 bits of non-critical data probably
      aren't worth explicit synchronizing (and they're even in the same
      byte [31:24]).
      
      [0] https://lore.kernel.org/all/YeSRUVmrdmlUXHDn@lunn.ch
      
      Fixes: 348048e7 ("ice: Implement iidc operations")
      Signed-off-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Tested-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Acked-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32d53c0a
  2. 22 Mar, 2022 4 commits
  3. 21 Mar, 2022 5 commits
    • Jakub Kicinski's avatar
      tcp: ensure PMTU updates are processed during fastopen · ed0c99dc
      Jakub Kicinski authored
      tp->rx_opt.mss_clamp is not populated, yet, during TFO send so we
      rise it to the local MSS. tp->mss_cache is not updated, however:
      
      tcp_v6_connect():
        tp->rx_opt.mss_clamp = IPV6_MIN_MTU - headers;
        tcp_connect():
           tcp_connect_init():
             tp->mss_cache = min(mtu, tp->rx_opt.mss_clamp)
           tcp_send_syn_data():
             tp->rx_opt.mss_clamp = tp->advmss
      
      After recent fixes to ICMPv6 PTB handling we started dropping
      PMTU updates higher than tp->mss_cache. Because of the stale
      tp->mss_cache value PMTU updates during TFO are always dropped.
      
      Thanks to Wei for helping zero in on the problem and the fix!
      
      Fixes: c7bb4b89 ("ipv6: tcp: drop silly ICMPv6 packet too big messages")
      Reported-by: default avatarAndre Nash <alnash@fb.com>
      Reported-by: default avatarNeil Spring <ntspring@fb.com>
      Reviewed-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220321165957.1769954-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ed0c99dc
    • Jeremy Linton's avatar
      net: bcmgenet: Use stronger register read/writes to assure ordering · 8d3ea3d4
      Jeremy Linton authored
      GCC12 appears to be much smarter about its dependency tracking and is
      aware that the relaxed variants are just normal loads and stores and
      this is causing problems like:
      
      [  210.074549] ------------[ cut here ]------------
      [  210.079223] NETDEV WATCHDOG: enabcm6e4ei0 (bcmgenet): transmit queue 1 timed out
      [  210.086717] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:529 dev_watchdog+0x234/0x240
      [  210.095044] Modules linked in: genet(E) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat]
      [  210.146561] ACPI CPPC: PCC check channel failed for ss: 0. ret=-110
      [  210.146927] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E     5.17.0-rc7G12+ #58
      [  210.153226] CPPC Cpufreq:cppc_scale_freq_workfn: failed to read perf counters
      [  210.161349] Hardware name: Raspberry Pi Foundation Raspberry Pi 4 Model B/Raspberry Pi 4 Model B, BIOS EDK2-DEV 02/08/2022
      [  210.161353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  210.161358] pc : dev_watchdog+0x234/0x240
      [  210.161364] lr : dev_watchdog+0x234/0x240
      [  210.161368] sp : ffff8000080a3a40
      [  210.161370] x29: ffff8000080a3a40 x28: ffffcd425af87000 x27: ffff8000080a3b20
      [  210.205150] x26: ffffcd425aa00000 x25: 0000000000000001 x24: ffffcd425af8ec08
      [  210.212321] x23: 0000000000000100 x22: ffffcd425af87000 x21: ffff55b142688000
      [  210.219491] x20: 0000000000000001 x19: ffff55b1426884c8 x18: ffffffffffffffff
      [  210.226661] x17: 64656d6974203120 x16: 0000000000000001 x15: 6d736e617274203a
      [  210.233831] x14: 2974656e65676d63 x13: ffffcd4259c300d8 x12: ffffcd425b07d5f0
      [  210.241001] x11: 00000000ffffffff x10: ffffcd425b07d5f0 x9 : ffffcd4258bdad9c
      [  210.248171] x8 : 00000000ffffdfff x7 : 000000000000003f x6 : 0000000000000000
      [  210.255341] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000001000
      [  210.262511] x2 : 0000000000001000 x1 : 0000000000000005 x0 : 0000000000000044
      [  210.269682] Call trace:
      [  210.272133]  dev_watchdog+0x234/0x240
      [  210.275811]  call_timer_fn+0x3c/0x15c
      [  210.279489]  __run_timers.part.0+0x288/0x310
      [  210.283777]  run_timer_softirq+0x48/0x80
      [  210.287716]  __do_softirq+0x128/0x360
      [  210.291392]  __irq_exit_rcu+0x138/0x140
      [  210.295243]  irq_exit_rcu+0x1c/0x30
      [  210.298745]  el1_interrupt+0x38/0x54
      [  210.302334]  el1h_64_irq_handler+0x18/0x24
      [  210.306445]  el1h_64_irq+0x7c/0x80
      [  210.309857]  arch_cpu_idle+0x18/0x2c
      [  210.313445]  default_idle_call+0x4c/0x140
      [  210.317470]  cpuidle_idle_call+0x14c/0x1a0
      [  210.321584]  do_idle+0xb0/0x100
      [  210.324737]  cpu_startup_entry+0x30/0x8c
      [  210.328675]  secondary_start_kernel+0xe4/0x110
      [  210.333138]  __secondary_switched+0x94/0x98
      
      The assumption when these were relaxed seems to be that device memory
      would be mapped non reordering, and that other constructs
      (spinlocks/etc) would provide the barriers to assure that packet data
      and in memory rings/queues were ordered with respect to device
      register reads/writes. This itself seems a bit sketchy, but the real
      problem with GCC12 is that it is moving the actual reads/writes around
      at will as though they were independent operations when in truth they
      are not, but the compiler can't know that. When looking at the
      assembly dumps for many of these routines its possible to see very
      clean, but not strictly in program order operations occurring as the
      compiler would be free to do if these weren't actually register
      reads/write operations.
      
      Its possible to suppress the timeout with a liberal bit of dma_mb()'s
      sprinkled around but the device still seems unable to reliably
      send/receive data. A better plan is to use the safer readl/writel
      everywhere.
      
      Since this partially reverts an older commit, which notes the use of
      the relaxed variants for performance reasons. I would suggest that
      any performance problems with this commit are targeted at relaxing only
      the performance critical code paths after assuring proper barriers.
      
      Fixes: 69d2ea9c ("net: bcmgenet: Use correct I/O accessors")
      Reported-by: default avatarPeter Robinson <pbrobinson@gmail.com>
      Signed-off-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Acked-by: default avatarPeter Robinson <pbrobinson@gmail.com>
      Tested-by: default avatarPeter Robinson <pbrobinson@gmail.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20220310045358.224350-1-jeremy.linton@arm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8d3ea3d4
    • David S. Miller's avatar
      Merge branch 'ax25-fixes' · ed32641e
      David S. Miller authored
      Duoming Zhou says:
      
      ====================
      Fix refcount leak and NPD bugs in ax25
      
      The first patch fixes refcount leak in ax25 that could cause
      ax25-ex-connected-session-now-listening-state-bug.
      
      The second patch fixes NPD bugs in ax25 timers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed32641e
    • Duoming Zhou's avatar
      ax25: Fix NULL pointer dereferences in ax25 timers · fc6d01ff
      Duoming Zhou authored
      The previous commit 7ec02f5a ("ax25: fix NPD bug in ax25_disconnect")
      move ax25_disconnect into lock_sock() in order to prevent NPD bugs. But
      there are race conditions that may lead to null pointer dereferences in
      ax25_heartbeat_expiry(), ax25_t1timer_expiry(), ax25_t2timer_expiry(),
      ax25_t3timer_expiry() and ax25_idletimer_expiry(), when we use
      ax25_kill_by_device() to detach the ax25 device.
      
      One of the race conditions that cause null pointer dereferences can be
      shown as below:
      
            (Thread 1)                    |      (Thread 2)
      ax25_connect()                      |
       ax25_std_establish_data_link()     |
        ax25_start_t1timer()              |
         mod_timer(&ax25->t1timer,..)     |
                                          | ax25_kill_by_device()
         (wait a time)                    |  ...
                                          |  s->ax25_dev = NULL; //(1)
         ax25_t1timer_expiry()            |
          ax25->ax25_dev->values[..] //(2)|  ...
           ...                            |
      
      We set null to ax25_cb->ax25_dev in position (1) and dereference
      the null pointer in position (2).
      
      The corresponding fail log is shown below:
      ===============================================================
      BUG: kernel NULL pointer dereference, address: 0000000000000050
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.17.0-rc6-00794-g45690b7d0
      RIP: 0010:ax25_t1timer_expiry+0x12/0x40
      ...
      Call Trace:
       call_timer_fn+0x21/0x120
       __run_timers.part.0+0x1ca/0x250
       run_timer_softirq+0x2c/0x60
       __do_softirq+0xef/0x2f3
       irq_exit_rcu+0xb6/0x100
       sysvec_apic_timer_interrupt+0xa2/0xd0
      ...
      
      This patch moves ax25_disconnect() before s->ax25_dev = NULL
      and uses del_timer_sync() to delete timers in ax25_disconnect().
      If ax25_disconnect() is called by ax25_kill_by_device() or
      ax25->ax25_dev is NULL, the reason in ax25_disconnect() will be
      equal to ENETUNREACH, it will wait all timers to stop before we
      set null to s->ax25_dev in ax25_kill_by_device().
      
      Fixes: 7ec02f5a ("ax25: fix NPD bug in ax25_disconnect")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc6d01ff
    • Duoming Zhou's avatar
      ax25: Fix refcount leaks caused by ax25_cb_del() · 9fd75b66
      Duoming Zhou authored
      The previous commit d01ffb9e ("ax25: add refcount in ax25_dev to
      avoid UAF bugs") and commit feef318c ("ax25: fix UAF bugs of
      net_device caused by rebinding operation") increase the refcounts of
      ax25_dev and net_device in ax25_bind() and decrease the matching refcounts
      in ax25_kill_by_device() in order to prevent UAF bugs, but there are
      reference count leaks.
      
      The root cause of refcount leaks is shown below:
      
           (Thread 1)                      |      (Thread 2)
      ax25_bind()                          |
       ...                                 |
       ax25_addr_ax25dev()                 |
        ax25_dev_hold()   //(1)            |
        ...                                |
       dev_hold_track()   //(2)            |
       ...                                 | ax25_destroy_socket()
                                           |  ax25_cb_del()
                                           |   ...
                                           |   hlist_del_init() //(3)
                                           |
                                           |
           (Thread 3)                      |
      ax25_kill_by_device()                |
       ...                                 |
       ax25_for_each(s, &ax25_list) {      |
        if (s->ax25_dev == ax25_dev) //(4) |
         ...                               |
      
      Firstly, we use ax25_bind() to increase the refcount of ax25_dev in
      position (1) and increase the refcount of net_device in position (2).
      Then, we use ax25_cb_del() invoked by ax25_destroy_socket() to delete
      ax25_cb in hlist in position (3) before calling ax25_kill_by_device().
      Finally, the decrements of refcounts in ax25_kill_by_device() will not
      be executed, because no s->ax25_dev equals to ax25_dev in position (4).
      
      This patch adds decrements of refcounts in ax25_release() and use
      lock_sock() to do synchronization. If refcounts decrease in ax25_release(),
      the decrements of refcounts in ax25_kill_by_device() will not be
      executed and vice versa.
      
      Fixes: d01ffb9e ("ax25: add refcount in ax25_dev to avoid UAF bugs")
      Fixes: 87563a04 ("ax25: fix reference count leaks of ax25_dev")
      Fixes: feef318c ("ax25: fix UAF bugs of net_device caused by rebinding operation")
      Reported-by: default avatarThomas Osterried <thomas@osterried.de>
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fd75b66
  4. 19 Mar, 2022 1 commit
    • Petr Machata's avatar
      af_netlink: Fix shift out of bounds in group mask calculation · 0caf6d99
      Petr Machata authored
      When a netlink message is received, netlink_recvmsg() fills in the address
      of the sender. One of the fields is the 32-bit bitfield nl_groups, which
      carries the multicast group on which the message was received. The least
      significant bit corresponds to group 1, and therefore the highest group
      that the field can represent is 32. Above that, the UB sanitizer flags the
      out-of-bounds shift attempts.
      
      Which bits end up being set in such case is implementation defined, but
      it's either going to be a wrong non-zero value, or zero, which is at least
      not misleading. Make the latter choice deterministic by always setting to 0
      for higher-numbered multicast groups.
      
      To get information about membership in groups >= 32, userspace is expected
      to use nl_pktinfo control messages[0], which are enabled by NETLINK_PKTINFO
      socket option.
      [0] https://lwn.net/Articles/147608/
      
      The way to trigger this issue is e.g. through monitoring the BRVLAN group:
      
      	# bridge monitor vlan &
      	# ip link add name br type bridge
      
      Which produces the following citation:
      
      	UBSAN: shift-out-of-bounds in net/netlink/af_netlink.c:162:19
      	shift exponent 32 is too large for 32-bit type 'int'
      
      Fixes: f7fa9b10 ("[NETLINK]: Support dynamic number of multicast groups per netlink family")
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/2bef6aabf201d1fc16cca139a744700cff9dcb04.1647527635.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0caf6d99
  5. 18 Mar, 2022 10 commits
    • Yonglong Li's avatar
      mptcp: Fix crash due to tcp_tsorted_anchor was initialized before release skb · 3ef3905a
      Yonglong Li authored
      Got crash when doing pressure test of mptcp:
      
      ===========================================================================
      dst_release: dst:ffffa06ce6e5c058 refcnt:-1
      kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
      BUG: unable to handle kernel paging request at ffffa06ce6e5c058
      PGD 190a01067 P4D 190a01067 PUD 43fffb067 PMD 22e403063 PTE 8000000226e5c063
      Oops: 0011 [#1] SMP PTI
      CPU: 7 PID: 7823 Comm: kworker/7:0 Kdump: loaded Tainted: G            E
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.2.1 04/01/2014
      Call Trace:
       ? skb_release_head_state+0x68/0x100
       ? skb_release_all+0xe/0x30
       ? kfree_skb+0x32/0xa0
       ? mptcp_sendmsg_frag+0x57e/0x750
       ? __mptcp_retrans+0x21b/0x3c0
       ? __switch_to_asm+0x35/0x70
       ? mptcp_worker+0x25e/0x320
       ? process_one_work+0x1a7/0x360
       ? worker_thread+0x30/0x390
       ? create_worker+0x1a0/0x1a0
       ? kthread+0x112/0x130
       ? kthread_flush_work_fn+0x10/0x10
       ? ret_from_fork+0x35/0x40
      ===========================================================================
      
      In __mptcp_alloc_tx_skb skb was allocated and skb->tcp_tsorted_anchor will
      be initialized, in under memory pressure situation sk_wmem_schedule will
      return false and then kfree_skb. In this case skb->_skb_refdst is not null
      because_skb_refdst and tcp_tsorted_anchor are stored in the same mem, and
      kfree_skb will try to release dst and cause crash.
      
      Fixes: f70cad10 ("mptcp: stop relying on tcp_tx_skb_cache")
      Reviewed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarYonglong Li <liyonglong@chinatelecom.cn>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Link: https://lore.kernel.org/r/20220317220953.426024-1-mathew.j.martineau@linux.intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3ef3905a
    • Jakub Kicinski's avatar
      Merge branch 'ipv4-handle-tos-and-scope-properly-for-icmp-redirects-and-pmtu-updates' · 03e2777c
      Jakub Kicinski authored
      Guillaume Nault says:
      
      ====================
      ipv4: Handle TOS and scope properly for ICMP redirects and PMTU updates
      
      ICMPv4 PMTU and redirect handlers didn't properly initialise the
      struct flowi4 they used for route lookups:
      
        * ECN bits sometimes weren't cleared from ->flowi4_tos.
        * The RTO_ONLINK flag wasn't taken into account for ->flowi4_scope.
      
      In some special cases, this resulted in ICMP redirects and PMTU updates
      not being taken into account because fib_lookup() couldn't retrieve the
      correct route.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1647519748.git.gnault@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      03e2777c
    • Guillaume Nault's avatar
      selftest: net: Test IPv4 PMTU exceptions with DSCP and ECN · ec730c3e
      Guillaume Nault authored
      Add two tests to pmtu.sh, for verifying that PMTU exceptions get
      properly created for routes that don't belong to the main table.
      
      A fib-rule based on the packet's DSCP field is used to jump to the
      correct table. ECN shouldn't interfere with this process, so each test
      has two components: one that only sets DSCP and one that sets both DSCP
      and ECN.
      
      One of the test triggers PMTU exceptions using ICMP Echo Requests, the
      other using UDP packets (to test different handlers in the kernel).
      
      A few adjustments are necessary in the rest of the script to allow
      policy routing scenarios:
      
        * Add global variable rt_table that allows setup_routing_*() to
          add routes to a specific routing table. By default rt_table is set
          to "main", so existing tests don't need to be modified.
      
        * Another global variable, policy_mark, is used to define which
          dsfield value is used for policy routing. This variable has no
          effect on tests that don't use policy routing.
      
        * The UDP version of the test uses socat. So cleanup() now also need
          to kill socat PIDs.
      
        * route_get_dst_pmtu_from_exception() and route_get_dst_exception()
          now take an optional third argument specifying the dsfield. If
          not specified, 0 is used, so existing users don't need to be
          modified.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec730c3e
    • Guillaume Nault's avatar
      ipv4: Fix route lookups when handling ICMP redirects and PMTU updates · 544b4dd5
      Guillaume Nault authored
      The PMTU update and ICMP redirect helper functions initialise their fl4
      variable with either __build_flow_key() or build_sk_flow_key(). These
      initialisation functions always set ->flowi4_scope with
      RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is
      not a problem when the route lookup is later done via
      ip_route_output_key_hash(), which properly clears the ECN bits from
      ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK
      flag. However, some helpers call fib_lookup() directly, without
      sanitising the tos and scope fields, so the route lookup can fail and,
      as a result, the ICMP redirect or PMTU update aren't taken into
      account.
      
      Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation
      code into ip_rt_fix_tos(), then use this function in handlers that call
      fib_lookup() directly.
      
      Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central
      place (like __build_flow_key() or flowi4_init_output()), because
      ip_route_output_key_hash() expects non-sanitised values. When called
      with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with
      RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to
      sanitise the values only for those paths that don't call
      ip_route_output_key_hash().
      
      Note 2: The problem is mostly about sanitising ->flowi4_tos. Having
      ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of
      RT_SCOPE_LINK probably wasn't really a problem: sockets with the
      SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set)
      normally shouldn't receive ICMP redirects or PMTU updates.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      544b4dd5
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 6bd0c76b
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-03-18
      
      We've added 2 non-merge commits during the last 18 day(s) which contain
      a total of 2 files changed, 50 insertions(+), 20 deletions(-).
      
      The main changes are:
      
      1) Fix a race in XSK socket teardown code that can lead to a NULL pointer
         dereference, from Magnus.
      
      2) Small MAINTAINERS doc update to remove Lorenz from sockmap, from Lorenz.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        xsk: Fix race at socket teardown
        bpf: Remove Lorenz Bauer from L7 BPF maintainers
      ====================
      
      Link: https://lore.kernel.org/r/20220318152418.28638-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6bd0c76b
    • David S. Miller's avatar
      Merge branch 'af_unix-OOB-fixes' · 9905eed4
      David S. Miller authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix some OOB implementation.
      
      This series fixes some data-races and adds a missing feature around the
      commit 314001f0 ("af_unix: Add OOB support").
      
      Changelog:
        - v3:
          - Add the first patch
      
        - v2: https://lore.kernel.org/netdev/20220315054801.72035-1-kuniyu@amazon.co.jp/
          - Add READ_ONCE() to avoid a race reported by KCSAN (Eric)
          - Add IS_ENABLED(CONFIG_AF_UNIX_OOB) (Shoaib)
      
        - v1: https://lore.kernel.org/netdev/20220314052110.53634-1-kuniyu@amazon.co.jp/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9905eed4
    • Kuniyuki Iwashima's avatar
      af_unix: Support POLLPRI for OOB. · d9a232d4
      Kuniyuki Iwashima authored
      The commit 314001f0 ("af_unix: Add OOB support") introduced OOB for
      AF_UNIX, but it lacks some changes for POLLPRI.  Let's add the missing
      piece.
      
      In the selftest, normal datagrams are sent followed by OOB data, so this
      commit replaces `POLLIN | POLLPRI` with just `POLLPRI` in the first test
      case.
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9a232d4
    • Kuniyuki Iwashima's avatar
      af_unix: Fix some data-races around unix_sk(sk)->oob_skb. · e82025c6
      Kuniyuki Iwashima authored
      Out-of-band data automatically places a "mark" showing wherein the
      sequence the out-of-band data would have been.  If the out-of-band data
      implies cancelling everything sent so far, the "mark" is helpful to flush
      them.  When the socket's read pointer reaches the "mark", the ioctl() below
      sets a non zero value to the arg `atmark`:
      
      The out-of-band data is queued in sk->sk_receive_queue as well as ordinary
      data and also saved in unix_sk(sk)->oob_skb.  It can be used to test if the
      head of the receive queue is the out-of-band data meaning the socket is at
      the "mark".
      
      While testing that, unix_ioctl() reads unix_sk(sk)->oob_skb locklessly.
      Thus, all accesses to oob_skb need some basic protection to avoid
      load/store tearing which KCSAN detects when these are called concurrently:
      
        - ioctl(fd_a, SIOCATMARK, &atmark, sizeof(atmark))
        - send(fd_b_connected_to_a, buf, sizeof(buf), MSG_OOB)
      
      BUG: KCSAN: data-race in unix_ioctl / unix_stream_sendmsg
      
      write to 0xffff888003d9cff0 of 8 bytes by task 175 on cpu 1:
       unix_stream_sendmsg (net/unix/af_unix.c:2087 net/unix/af_unix.c:2191)
       sock_sendmsg (net/socket.c:705 net/socket.c:725)
       __sys_sendto (net/socket.c:2040)
       __x64_sys_sendto (net/socket.c:2048)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
      
      read to 0xffff888003d9cff0 of 8 bytes by task 176 on cpu 0:
       unix_ioctl (net/unix/af_unix.c:3101 (discriminator 1))
       sock_do_ioctl (net/socket.c:1128)
       sock_ioctl (net/socket.c:1242)
       __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
      
      value changed: 0xffff888003da0c00 -> 0xffff888003da0d00
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 176 Comm: unix_race_oob_i Not tainted 5.17.0-rc5-59529-g83dc4c2a #12
      Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e82025c6
    • Sukadev Bhattiprolu's avatar
      ibmvnic: fix race between xmit and reset · 4219196d
      Sukadev Bhattiprolu authored
      There is a race between reset and the transmit paths that can lead to
      ibmvnic_xmit() accessing an scrq after it has been freed in the reset
      path. It can result in a crash like:
      
      	Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
      	BUG: Kernel NULL pointer dereference on read at 0x00000000
      	Faulting instruction address: 0xc0080000016189f8
      	Oops: Kernel access of bad area, sig: 11 [#1]
      	...
      	NIP [c0080000016189f8] ibmvnic_xmit+0x60/0xb60 [ibmvnic]
      	LR [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	Call Trace:
      	[c008000001618f08] ibmvnic_xmit+0x570/0xb60 [ibmvnic] (unreliable)
      	[c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	[c000000000c9cfcc] sch_direct_xmit+0xec/0x330
      	[c000000000bfe640] __dev_xmit_skb+0x3a0/0x9d0
      	[c000000000c00ad4] __dev_queue_xmit+0x394/0x730
      	[c008000002db813c] __bond_start_xmit+0x254/0x450 [bonding]
      	[c008000002db8378] bond_start_xmit+0x40/0xc0 [bonding]
      	[c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	[c000000000c00ca4] __dev_queue_xmit+0x564/0x730
      	[c000000000cf97e0] neigh_hh_output+0xd0/0x180
      	[c000000000cfa69c] ip_finish_output2+0x31c/0x5c0
      	[c000000000cfd244] __ip_queue_xmit+0x194/0x4f0
      	[c000000000d2a3c4] __tcp_transmit_skb+0x434/0x9b0
      	[c000000000d2d1e0] __tcp_retransmit_skb+0x1d0/0x6a0
      	[c000000000d2d984] tcp_retransmit_skb+0x34/0x130
      	[c000000000d310e8] tcp_retransmit_timer+0x388/0x6d0
      	[c000000000d315ec] tcp_write_timer_handler+0x1bc/0x330
      	[c000000000d317bc] tcp_write_timer+0x5c/0x200
      	[c000000000243270] call_timer_fn+0x50/0x1c0
      	[c000000000243704] __run_timers.part.0+0x324/0x460
      	[c000000000243894] run_timer_softirq+0x54/0xa0
      	[c000000000ea713c] __do_softirq+0x15c/0x3e0
      	[c000000000166258] __irq_exit_rcu+0x158/0x190
      	[c000000000166420] irq_exit+0x20/0x40
      	[c00000000002853c] timer_interrupt+0x14c/0x2b0
      	[c000000000009a00] decrementer_common_virt+0x210/0x220
      	--- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
      
      The immediate cause of the crash is the access of tx_scrq in the following
      snippet during a reset, where the tx_scrq can be either NULL or an address
      that will soon be invalid:
      
      	ibmvnic_xmit()
      	{
      		...
      		tx_scrq = adapter->tx_scrq[queue_num];
      		txq = netdev_get_tx_queue(netdev, queue_num);
      		ind_bufp = &tx_scrq->ind_buf;
      
      		if (test_bit(0, &adapter->resetting)) {
      		...
      	}
      
      But beyond that, the call to ibmvnic_xmit() itself is not safe during a
      reset and the reset path attempts to avoid this by stopping the queue in
      ibmvnic_cleanup(). However just after the queue was stopped, an in-flight
      ibmvnic_complete_tx() could have restarted the queue even as the reset is
      progressing.
      
      Since the queue was restarted we could get a call to ibmvnic_xmit() which
      can then access the bad tx_scrq (or other fields).
      
      We cannot however simply have ibmvnic_complete_tx() check the ->resetting
      bit and skip starting the queue. This can race at the "back-end" of a good
      reset which just restarted the queue but has not cleared the ->resetting
      bit yet. If we skip restarting the queue due to ->resetting being true,
      the queue would remain stopped indefinitely potentially leading to transmit
      timeouts.
      
      IOW ->resetting is too broad for this purpose. Instead use a new flag
      that indicates whether or not the queues are active. Only the open/
      reset paths control when the queues are active. ibmvnic_complete_tx()
      and others wake up the queue only if the queue is marked active.
      
      So we will have:
      	A. reset/open thread in ibmvnic_cleanup() and __ibmvnic_open()
      
      		->resetting = true
      		->tx_queues_active = false
      		disable tx queues
      		...
      		->tx_queues_active = true
      		start tx queues
      
      	B. Tx interrupt in ibmvnic_complete_tx():
      
      		if (->tx_queues_active)
      			netif_wake_subqueue();
      
      To ensure that ->tx_queues_active and state of the queues are consistent,
      we need a lock which:
      
      	- must also be taken in the interrupt path (ibmvnic_complete_tx())
      	- shared across the multiple queues in the adapter (so they don't
      	  become serialized)
      
      Use rcu_read_lock() and have the reset thread synchronize_rcu() after
      updating the ->tx_queues_active state.
      
      While here, consolidate a few boolean fields in ibmvnic_adapter for
      better alignment.
      
      Based on discussions with Brian King and Dany Madden.
      
      Fixes: 7ed5b31f ("net/ibmvnic: prevent more than one thread from running in reset")
      Reported-by: default avatarVaishnavi Bhat <vaish123@in.ibm.com>
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4219196d
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 4fa331b4
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix PPPoE and QinQ with flowtable inet family.
      
      2) Missing register validation in nf_tables.
      
      3) Initialize registers to avoid stack memleak to userspace.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fa331b4
  6. 17 Mar, 2022 16 commits
  7. 16 Mar, 2022 2 commits