1. 25 Oct, 2022 33 commits
    • Kees Cook's avatar
      net: dev: Convert sa_data to flexible array in struct sockaddr · b5f0de6d
      Kees Cook authored
      One of the worst offenders of "fake flexible arrays" is struct sockaddr,
      as it is the classic example of why GCC and Clang have been traditionally
      forced to treat all trailing arrays as fake flexible arrays: in the
      distant misty past, sa_data became too small, and code started just
      treating it as a flexible array, even though it was fixed-size. The
      special case by the compiler is specifically that sizeof(sa->sa_data)
      and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1))
      do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat
      it as a flexible array.
      
      However, the coming -fstrict-flex-arrays compiler flag will remove
      these special cases so that FORTIFY_SOURCE can gain coverage over all
      the trailing arrays in the kernel that are _not_ supposed to be treated
      as a flexible array. To deal with this change, convert sa_data to a true
      flexible array. To keep the structure size the same, move sa_data into
      a union with a newly introduced sa_data_min with the original size. The
      result is that FORTIFY_SOURCE can continue to have no idea how large
      sa_data may actually be, but anything using sizeof(sa->sa_data) must
      switch to sizeof(sa->sa_data_min).
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Dylan Yudaken <dylany@fb.com>
      Cc: Yajun Deng <yajun.deng@linux.dev>
      Cc: Petr Machata <petrm@nvidia.com>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: syzbot <syzkaller@googlegroups.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221018095503.never.671-kees@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b5f0de6d
    • Kees Cook's avatar
      bnx2: Use kmalloc_size_roundup() to match ksize() usage · d6dd5080
      Kees Cook authored
      Round up allocations with kmalloc_size_roundup() so that build_skb()'s
      use of ksize() is always accurate and no special handling of the memory
      is needed by KASAN, UBSAN_BOUNDS, nor FORTIFY_SOURCE.
      
      Cc: Rasesh Mody <rmody@marvell.com>
      Cc: GR-Linux-NIC-Dev@marvell.com
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221022021004.gonna.489-kees@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d6dd5080
    • Paolo Abeni's avatar
      Merge branch 'mptcp-socket-option-updates' · 6459838a
      Paolo Abeni authored
      Mat Martineau says:
      
      ====================
      mptcp: Socket option updates
      
      Patches 1 and 3 refactor a recent socket option helper function for more
      generic use, and make use of it in a couple of places.
      
      Patch 2 adds TCP_FASTOPEN_NO_COOKIE functionality to MPTCP sockets,
      similar to TCP_FASTOPEN_CONNECT support recently added in v6.1
      ====================
      
      Link: https://lore.kernel.org/r/20221022004505.160988-1-mathew.j.martineau@linux.intel.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6459838a
    • Matthieu Baerts's avatar
      mptcp: sockopt: use new helper for TCP_DEFER_ACCEPT · caea6467
      Matthieu Baerts authored
      mptcp_setsockopt_sol_tcp_defer() was doing the same thing as
      mptcp_setsockopt_first_sf_only() except for the returned code in case of
      error.
      
      Ignoring the error is needed to mimic how TCP_DEFER_ACCEPT is handled
      when used with "plain" TCP sockets.
      
      The specific function for TCP_DEFER_ACCEPT can be replaced by the new
      mptcp_setsockopt_first_sf_only() helper and errors can be ignored to
      stay compatible with TCP. A bit of cleanup.
      Suggested-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      caea6467
    • Matthieu Baerts's avatar
      mptcp: add TCP_FASTOPEN_NO_COOKIE support · e64d4deb
      Matthieu Baerts authored
      The goal of this socket option is to configure MPTCP + TFO without
      cookie per socket.
      
      It was already possible to enable TFO without a cookie per netns by
      setting net.ipv4.tcp_fastopen sysctl knob to the right value. Per route
      was also supported by setting 'fastopen_no_cookie' option. This patch
      adds a per socket support like it is possible to do with TCP thanks to
      TCP_FASTOPEN_NO_COOKIE socket option.
      
      The only thing to do here is to relay the request to the first subflow
      like it is already done for TCP_FASTOPEN_CONNECT.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e64d4deb
    • Matthieu Baerts's avatar
      mptcp: sockopt: make 'tcp_fastopen_connect' generic · d3d42904
      Matthieu Baerts authored
      There are other socket options that need to act only on the first
      subflow, e.g. all TCP_FASTOPEN* socket options.
      
      This is similar to the getsockopt version.
      
      In the next commit, this new mptcp_setsockopt_first_sf_only() helper is
      used by other another option.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d3d42904
    • Paolo Abeni's avatar
      Merge branch 'soreuseport-fix-broken-so_incoming_cpu' · 818a2604
      Paolo Abeni authored
      Kuniyuki Iwashima says:
      
      ====================
      soreuseport: Fix broken SO_INCOMING_CPU.
      
      setsockopt(SO_INCOMING_CPU) for UDP/TCP is broken since 4.5/4.6 due to
      these commits:
      
        * e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
        * c125e80b ("soreuseport: fast reuseport TCP socket selection")
      
      These commits introduced the O(1) socket selection algorithm and removed
      O(n) iteration over the list, but it ignores the score calculated by
      compute_score().  As a result, it caused two misbehaviours:
      
        * Unconnected sockets receive packets sent to connected sockets
        * SO_INCOMING_CPU does not work
      
      The former is fixed by commit acdcecc6 ("udp: correct reuseport
      selection with connected sockets").  This series fixes the latter and
      adds some tests for SO_INCOMING_CPU.
      ====================
      
      Link: https://lore.kernel.org/r/20221021204435.4259-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      818a2604
    • Kuniyuki Iwashima's avatar
      selftest: Add test for SO_INCOMING_CPU. · 6df96146
      Kuniyuki Iwashima authored
      Some highly optimised applications use SO_INCOMING_CPU to make them
      efficient, but they didn't test if it's working correctly by getsockopt()
      to avoid slowing down.  As a result, no one noticed it had been broken
      for years, so it's a good time to add a test to catch future regression.
      
      The test does
      
        1) Create $(nproc) TCP listeners associated with each CPU.
      
        2) Create 32 child sockets for each listener by calling
           sched_setaffinity() for each CPU.
      
        3) Check if accept()ed sockets' sk_incoming_cpu matches
           listener's one.
      
      If we see -EAGAIN, SO_INCOMING_CPU is broken.  However, we might not see
      any error even if broken; the kernel could miraculously distribute all SYN
      to correct listeners.  Not to let that happen, we must increase the number
      of clients and CPUs to some extent, so the test requires $(nproc) >= 2 and
      creates 64 sockets at least.
      
      Test:
        $ nproc
        96
        $ ./so_incoming_cpu
      
      Before the previous patch:
      
        # Starting 12 tests from 5 test cases.
        #  RUN           so_incoming_cpu.before_reuseport.test1 ...
        # so_incoming_cpu.c:191:test1:Expected cpu (5) == i (0)
        # test1: Test terminated by assertion
        #          FAIL  so_incoming_cpu.before_reuseport.test1
        not ok 1 so_incoming_cpu.before_reuseport.test1
        ...
        # FAILED: 0 / 12 tests passed.
        # Totals: pass:0 fail:12 xfail:0 xpass:0 skip:0 error:0
      
      After:
      
        # Starting 12 tests from 5 test cases.
        #  RUN           so_incoming_cpu.before_reuseport.test1 ...
        # so_incoming_cpu.c:199:test1:SO_INCOMING_CPU is very likely to be working correctly with 3072 sockets.
        #            OK  so_incoming_cpu.before_reuseport.test1
        ok 1 so_incoming_cpu.before_reuseport.test1
        ...
        # PASSED: 12 / 12 tests passed.
        # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6df96146
    • Kuniyuki Iwashima's avatar
      soreuseport: Fix socket selection for SO_INCOMING_CPU. · b261eda8
      Kuniyuki Iwashima authored
      Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
      with setsockopt(SO_REUSEPORT) since v4.6.
      
      With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
      build a highly efficient server application.
      
      setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
      or UDP socket, and then incoming packets processed on the CPU will
      likely be distributed to the socket.  Technically, a socket could
      even receive packets handled on another CPU if no sockets in the
      reuseport group have the same CPU receiving the flow.
      
      The logic exists in compute_score() so that a socket will get a higher
      score if it has the same CPU with the flow.  However, the score gets
      ignored after the blamed two commits, which introduced a faster socket
      selection algorithm for SO_REUSEPORT.
      
      This patch introduces a counter of sockets with SO_INCOMING_CPU in
      a reuseport group to check if we should iterate all sockets to find
      a proper one.  We increment the counter when
      
        * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT
      
        * enabling SO_INCOMING_CPU if the socket is in a reuseport group
      
      Also, we decrement it when
      
        * detaching a socket out of the group to apply SO_INCOMING_CPU to
          migrated TCP requests
      
        * disabling SO_INCOMING_CPU if the socket is in a reuseport group
      
      When the counter reaches 0, we can get back to the O(1) selection
      algorithm.
      
      The overall changes are negligible for the non-SO_INCOMING_CPU case,
      and the only notable thing is that we have to update sk_incomnig_cpu
      under reuseport_lock.  Otherwise, the race prevents transitioning to
      the O(n) algorithm and results in the wrong socket selection.
      
       cpu1 (setsockopt)               cpu2 (listen)
      +-----------------+             +-------------+
      
      lock_sock(sk1)                  lock_sock(sk2)
      
      reuseport_update_incoming_cpu(sk1, val)
      .
      |  /* set CPU as 0 */
      |- WRITE_ONCE(sk1->incoming_cpu, val)
      |
      |                               spin_lock_bh(&reuseport_lock)
      |                               reuseport_grow(sk2, reuse)
      |                               .
      |                               |- more_socks_size = reuse->max_socks * 2U;
      |                               |- if (more_socks_size > U16_MAX &&
      |                               |       reuse->num_closed_socks)
      |                               |  .
      |                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
      |                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
      |                               |     .
      |                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
      |                               |        .
      |                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
      |                               |        |   * without lock_sock().
      |                               |        |   */
      |                               |        `- if (sk1->sk_incoming_cpu >= 0)
      |                               |           .
      |                               |           |  /* decrement not-yet-incremented
      |                               |           |   * count, which is never incremented.
      |                               |           |   */
      |                               |           `- __reuseport_put_incoming_cpu(reuse);
      |                               |
      |                               `- spin_lock_bh(&reuseport_lock)
      |
      |- spin_lock_bh(&reuseport_lock)
      |
      |- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
      |- if (!reuse)
      |  .
      |  |  /* Cannot increment reuse->incoming_cpu. */
      |  `- goto out;
      |
      `- spin_unlock_bh(&reuseport_lock)
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: c125e80b ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: default avatarKazuho Oku <kazuhooku@gmail.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b261eda8
    • Paolo Abeni's avatar
      Merge branch 'net-ipa-validation-cleanup' · 71920a77
      Paolo Abeni authored
      Alex Elder says:
      
      ====================
      net: ipa: validation cleanup
      
      This series gathers a set of IPA driver cleanups, mostly involving
      code that ensures certain things are known to be correct *early*
      (either at build or initializatin time), so they can be assumed good
      during normal operation.
      
      The first removes three constant symbols, by making a (reasonable)
      assumption that a routing table consists of entries for the modem
      followed by entries for the AP, with no unused entries between them.
      
      The second removes two checks that are redundant (they verify the
      sizes of two memory regions are in range, which will have been done
      earlier for all regions).
      
      The third adds some new checks to routing and filter tables that
      can be done at "init time" (without requiring any access to IPA
      hardware).
      
      The fourth moves a check that routing and filter table addresses can
      be encoded within certain IPA immediate commands, so it's performed
      earlier; the checks can be done without touching IPA hardware.  The
      fifth moves some other command-related checks earlier, for the same
      reason.
      
      The sixth removes the definition ipa_table_valid(), because what it
      does has become redundant.  Finally, the last patch moves two more
      validation calls so they're done very early in the probe process.
      This will be required by some upcoming patches, which will record
      the size of the routing and filter tables at this time so they're
      available for subsequent initialization.
      ====================
      
      Link: https://lore.kernel.org/r/20221021191340.4187935-1-elder@linaro.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      71920a77
    • Alex Elder's avatar
      net: ipa: check table memory regions earlier · 73da9cac
      Alex Elder authored
      Verify that the sizes of the routing and filter table memory regions
      are valid as part of memory initialization, rather than waiting for
      table initialization.  The main reason to do this is that upcoming
      patches use these memory region sizes to determine the number of
      entries in these tables, and we'll want to know these sizes are good
      sooner.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      73da9cac
    • Alex Elder's avatar
      net: ipa: kill ipa_table_valid() · 39ad8152
      Alex Elder authored
      What ipa_table_valid() (and ipa_table_valid_one(), which it calls)
      does is ensure that the memory regions that hold routing and filter
      tables have reasonable size.  Specifically, it checks that the size
      of a region is sufficient (or rather, exactly the right size) to
      hold the maximum number of entries supported by the driver.  (There
      is an additional check that's erroneous, but in practice it is never
      reached.)
      
      Recently ipa_table_mem_valid() was added, which is called by
      ipa_table_init().  That function verifies that all table memory
      regions are of sufficient size, and requires hashed tables to have
      zero size if hashing is not supported.  It only ensures the filter
      table is large enough to hold the number of endpoints that support
      filtering, but that is adequate.
      
      Therefore everything that ipa_table_valid() does is redundant, so
      get rid of it.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      39ad8152
    • Alex Elder's avatar
      net: ipa: introduce ipa_cmd_init() · 7fd10a2a
      Alex Elder authored
      Currently, ipa_cmd_data_valid() is called by ipa_mem_config().
      Nothing it does requires access to hardware though, so it can be
      done during the init phase of IPA driver startup.
      
      Create a new function ipa_cmd_init(), whose purpose is to do early
      initialization related to IPA immediate commands.  It will call the
      build-time validation function, then will make the two calls made
      previously by ipa_cmd_data_valid().  This make ipa_cmd_data_valid()
      unnecessary, so get rid of it.
      
      Rename ipa_cmd_header_valid() to be ipa_cmd_header_init_local_valid(),
      so its name is clearer about which IPA immediate command it is
      associated with.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7fd10a2a
    • Alex Elder's avatar
      net: ipa: verify table sizes fit in commands early · 5444b0ea
      Alex Elder authored
      We currently verify the table size and offset fit in the immediate
      command fields that must encode them in ipa_table_valid_one().  We
      can now make this check earlier, in ipa_table_mem_valid().
      
      The non-hashed IPv4 filter and route tables will always exist, and
      their sizes will match the IPv6 tables, as well as the hashed tables
      (if supported).  So it's sufficient to verify the offset and size of
      the IPv4 non-hashed tables fit into these fields.
      
      Rename the function ipa_cmd_table_init_valid(), to reinforce that
      it is the TABLE_INIT immediate command fields we're checking.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5444b0ea
    • Alex Elder's avatar
      net: ipa: validate IPA table memory earlier · cf139196
      Alex Elder authored
      Add checks in ipa_table_init() to ensure the memory regions defined
      for IPA filter and routing tables are valid.
      
      For routing tables, the checks ensure:
        - The non-hashed IPv4 and IPv6 routing tables are defined
        - The non-hashed IPv4 and IPv6 routing tables are the same size
        - The number entries in the non-hashed IPv4 routing table is enough
          to hold the number entries available to the modem, plus at least
          one usable by the AP.
      
      For filter tables, the checks ensure:
        - The non-hashed IPv4 and IPv6 filter tables are defined
        - The non-hashed IPv4 and IPv6 filter tables are the same size
        - The number entries in the non-hashed IPv4 filter table is enough
          to hold the endpoint bitmap, plus an entry for each defined
          endpoint that supports filtering.
      
      In addition, for both routing and filter tables:
        - If hashing isn't supported (IPA v4.2), hashed tables are zero size
        - If hashing *is* supported, all hashed tables are the same size as
          their non-hashed counterparts.
      
      When validating the size of routing tables, require the AP to have
      at least one entry (in addition to those used by the modem).
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      cf139196
    • Alex Elder's avatar
      net: ipa: remove two memory region checks · 2554322b
      Alex Elder authored
      There's no need to ensure table memory regions fit within the
      IPA-local memory range.  And there's no need to ensure the modem
      header memory region is in range either.  These are verified for all
      memory regions in ipa_mem_size_valid(), once we have settled on the
      size of IPA memory.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      2554322b
    • Alex Elder's avatar
      net: ipa: kill two constant symbols · fb4014ac
      Alex Elder authored
      The entries in each IPA routing table are divided between the modem
      and the AP.  The modem always gets some number of entries located at
      the base of the table; the AP gets all those that follow.
      
      There's no reason to think the modem will use anything different
      from the first entries in a routing table, so:
        - Get rid of IPA_ROUTE_MODEM_MIN (just assume it's 0)
        - Get rid of IPA_ROUTE_AP_MIN (just assume it's IPA_ROUTE_MODEM_COUNT)
      And finally:
        - Open-code IPA_ROUTE_AP_COUNT and remove its definition
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fb4014ac
    • Paolo Abeni's avatar
      Merge branch 'extend-action-skbedit-to-rx-queue-mapping' · 34802d06
      Paolo Abeni authored
      Amritha Nambiar says:
      
      ====================
      Extend action skbedit to RX queue mapping
      
      Based on the discussion on
      https://lore.kernel.org/netdev/166260012413.81018.8010396115034847972.stgit@anambiarhost.jf.intel.com/ ,
      the following series extends skbedit tc action to RX queue mapping.
      Currently, skbedit action in tc allows overriding of transmit queue.
      Extending this ability of skedit action supports the selection of
      receive queue for incoming packets. On the receive side, this action
      is supported only in hardware, so the skip_sw flag is enforced.
      
      Enabled ice driver to offload this type of filter into the hardware
      for accepting packets to the device's receive queue.
      ====================
      
      Link: https://lore.kernel.org/r/166633888716.52141.3425659377117969638.stgit@anambiarhost.jf.intel.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      34802d06
    • Amritha Nambiar's avatar
      Documentation: networking: TC queue based filtering · d5ae8ecf
      Amritha Nambiar authored
      Add tc-queue-filters.rst with notes on TC filters for
      selecting a set of queues and/or a queue.
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d5ae8ecf
    • Amritha Nambiar's avatar
      ice: Enable RX queue selection using skbedit action · 143b86f3
      Amritha Nambiar authored
      This patch uses TC skbedit queue_mapping action to support
      forwarding packets to a device queue. Such filters with action
      forward to queue will be the highest priority switch filter in
      HW.
      Example:
      $ tc filter add dev ens4f0 protocol ip ingress flower\
        dst_ip 192.168.1.12 ip_proto tcp dst_port 5001\
        action skbedit queue_mapping 5 skip_sw
      
      The above command adds an ingress filter, incoming packets
      qualifying the match will be accepted into queue 5. The queue
      number is in decimal format.
      
      Refactored ice_add_tc_flower_adv_fltr() to consolidate code with
      action FWD_TO_VSI and FWD_TO QUEUE.
      Reviewed-by: default avatarSridhar Samudrala <sridhar.samudrala@intel.com>
      Reviewed-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      143b86f3
    • Amritha Nambiar's avatar
      act_skbedit: skbedit queue mapping for receive queue · 4a6a676f
      Amritha Nambiar authored
      Add support for skbedit queue mapping action on receive
      side. This is supported only in hardware, so the skip_sw
      flag is enforced. This enables offloading filters for
      receive queue selection in the hardware using the
      skbedit action. Traffic arrives on the Rx queue requested
      in the skbedit action parameter. A new tc action flag
      TCA_ACT_FLAGS_AT_INGRESS is introduced to identify the
      traffic direction the action queue_mapping is requested
      on during filter addition. This is used to disallow
      offloading the skbedit queue mapping action on transmit
      side.
      
      Example:
      $tc filter add dev $IFACE ingress protocol ip flower dst_ip $DST_IP\
       action skbedit queue_mapping $rxq_id skip_sw
      Reviewed-by: default avatarSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4a6a676f
    • Jakub Kicinski's avatar
      Merge branch 'net-sfp-improve-high-power-module-implementation' · 6143eca3
      Jakub Kicinski authored
      Russell King says:
      
      ====================
      net: sfp: improve high power module implementation
      
      This series aims to improve the power level switching between standard
      level 1 and the higher power levels.
      
      The first patch updates the DT binding documentation to include the
      minimum and default of 1W, which is the base level that every SFP cage
      must support. Hence, it makes sense to document this in the binding.
      
      The second patch enforces a minimum of 1W when parsing the firmware
      description, and optimises the code for that case; there's no need to
      check for SFF8472 compliance since we will not need to touch the
      A2h registers.
      
      Patch 3 validates that the module supports SFF-8472 rev 10.2 before
      checking for power level 2 - rev 10.2 is where support for power
      levels was introduced, so if the module doesn't support this revision,
      it doesn't support power levels. Setting the power level 2 declaration
      bit is likely to be spurious.
      
      Patch 4 does the same for power level 3, except this was introduced in
      SFF-8472 rev 11.9. The revision code was never updated, so we use the
      rev 11.4 to signify this.
      
      Patch 5 cleans up the code - rather than using BIT(0), we now use a
      properly named value for the power level select bit.
      
      Patch 6 introduces a read-modify-write helper.
      
      Patch 7 gets rid of the DM7052 hack (which sets a power level
      declaration bit but is not compatible with SFF-8472 rev 10.2, and
      the module does not implement the A2h I2C address.)
      
      Series tested with my DM7052.
      
      v2: update sff.sfp.yaml with Rob's feedback
      ====================
      
      Andrew's review tags from v1.
      
      Link: https://lore.kernel.org/r/Y0%2F7dAB8OU3jrbz6@shell.armlinux.org.uk
      Link: https://lore.kernel.org/r/Y1K17UtfFopACIi2@shell.armlinux.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6143eca3
    • Russell King (Oracle)'s avatar
      net: sfp: get rid of DM7052 hack when enabling high power · bd1432f6
      Russell King (Oracle) authored
      Since we no longer mis-detect high-power mode with the DM7052 module,
      we no longer need the hack in sfp_module_enable_high_power(), and can
      now switch this to use sfp_modify_u8().
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bd1432f6
    • Russell King (Oracle)'s avatar
      net: sfp: add sfp_modify_u8() helper · a3c536fc
      Russell King (Oracle) authored
      Add a helper to modify bits in a single byte in memory space, and use
      it when updating the soft tx-disable flag in the module.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a3c536fc
    • Russell King (Oracle)'s avatar
      net: sfp: provide a definition for the power level select bit · 39890049
      Russell King (Oracle) authored
      Provide a named definition for the power level select bit in the
      extended status register, rather than using BIT(0) in the code.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      39890049
    • Russell King (Oracle)'s avatar
      net: sfp: ignore power level 3 prior to SFF-8472 Rev 11.4 · f8810ca7
      Russell King (Oracle) authored
      Power level 3 was included in SFF-8472 revision 11.9, but this does
      not have a compliance code. Use revision 11.4 as the minimum
      compliance level instead.
      
      This should avoid any spurious indication of 2W modules.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f8810ca7
    • Russell King (Oracle)'s avatar
      net: sfp: ignore power level 2 prior to SFF-8472 Rev 10.2 · 18cc659e
      Russell King (Oracle) authored
      Power level 2 was introduced by SFF-8472 revision 10.2. Ignore
      the power declaration bit for modules that are not compliant with
      at least this revision.
      
      This should remove any spurious indication of 1.5W modules.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      18cc659e
    • Russell King (Oracle)'s avatar
      net: sfp: check firmware provided max power · 02eaf5a7
      Russell King (Oracle) authored
      Check that the firmware provided maximum power is at least 1W, which
      is the minimum power level for any SFP module.
      
      Now that we enforce the minimum of 1W, we can exit early from
      sfp_module_parse_power() if the module power is 1W or less.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      02eaf5a7
    • Russell King (Oracle)'s avatar
      dt-bindings: net: sff,sfp: update binding · a272bcb9
      Russell King (Oracle) authored
      Add a minimum and default for the maximum-power-milliwatt option;
      module power levels were originally up to 1W, so this is the default
      and the minimum power level we can have for a functional SFP cage.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a272bcb9
    • Jakub Kicinski's avatar
      Merge branch 'bnxt_en-driver-updates' · 1b3d6ecd
      Jakub Kicinski authored
      Michael Chan says:
      
      ====================
      bnxt_en: Driver updates
      
      This patchset adds .get_module_eeprom_by_page() support and adds
      an NVRAM resize step to allow larger firmware images to be flashed
      to older firmware.
      ====================
      
      Link: https://lore.kernel.org/r/1666334243-23866-1-git-send-email-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1b3d6ecd
    • Vikas Gupta's avatar
      bnxt_en: check and resize NVRAM UPDATE entry before flashing · 45034224
      Vikas Gupta authored
      Resize of the UPDATE entry is required if the image to
      be flashed is larger than the available space. Add this step,
      otherwise flashing larger firmware images by ethtool or devlink
      may fail.
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      45034224
    • Vikas Gupta's avatar
      bnxt_en: add .get_module_eeprom_by_page() support · 7ef3d390
      Vikas Gupta authored
      Add support for .get_module_eeprom_by_page() callback which
      implements generic solution for module`s eeprom access.
      
      v3: Add bnxt_get_module_status() to get a more specific extack error
          string.
          Return -EINVAL from bnxt_get_module_eeprom_by_page() when we
          don't want to fallback to old method.
      v2: Simplification suggested by Ido Schimmel
      
      Link: https://lore.kernel.org/netdev/YzVJ%2FvKJugoz15yV@shredder/Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7ef3d390
    • Michael Chan's avatar
      bnxt_en: Update firmware interface to 1.10.2.118 · 84a911db
      Michael Chan authored
      The main changes are PTM timestamp support, CMIS EEPROM support, and
      asymmetric CoS queues support.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      84a911db
  2. 24 Oct, 2022 7 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 96917bb3
      Jakub Kicinski authored
      include/linux/net.h
        a5ef058d ("net: introduce and use custom sockopt socket flag")
        e993ffe3 ("net: flag sockets supporting msghdr originated zerocopy")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      96917bb3
    • Linus Torvalds's avatar
      Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 337a0a0b
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf.
      
        The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
        I'm biased.
      
        Current release - regressions:
      
         - eth: fman: re-expose location of the MAC address to userspace,
           apparently some udev scripts depended on the exact value
      
        Current release - new code bugs:
      
         - bpf:
             - wait for busy refill_work when destroying bpf memory allocator
             - allow bpf_user_ringbuf_drain() callbacks to return 1
             - fix dispatcher patchable function entry to 5 bytes nop
      
        Previous releases - regressions:
      
         - net-memcg: avoid stalls when under memory pressure
      
         - tcp: fix indefinite deferral of RTO with SACK reneging
      
         - tipc: fix a null-ptr-deref in tipc_topsrv_accept
      
         - eth: macb: specify PHY PM management done by MAC
      
         - tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
      
        Previous releases - always broken:
      
         - eth: amd-xgbe: SFP fixes and compatibility improvements
      
        Misc:
      
         - docs: netdev: offer performance feedback to contributors"
      
      * tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
        net-memcg: avoid stalls when under memory pressure
        tcp: fix indefinite deferral of RTO with SACK reneging
        tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
        net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
        net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
        docs: netdev: offer performance feedback to contributors
        kcm: annotate data-races around kcm->rx_wait
        kcm: annotate data-races around kcm->rx_psock
        net: fman: Use physical address for userspace interfaces
        net/mlx5e: Cleanup MACsec uninitialization routine
        atlantic: fix deadlock at aq_nic_stop
        nfp: only clean `sp_indiff` when application firmware is unloaded
        amd-xgbe: add the bit rate quirk for Molex cables
        amd-xgbe: fix the SFP compliance codes check for DAC cables
        amd-xgbe: enable PLL_CTL for fixed PHY modes only
        amd-xgbe: use enums for mailbox cmd and sub_cmds
        amd-xgbe: Yellow carp devices do not need rrc
        bpf: Use __llist_del_all() whenever possbile during memory draining
        bpf: Wait for busy refill_work when destroying bpf memory allocator
        MAINTAINERS: add keyword match on PTP
        ...
      337a0a0b
    • Linus Torvalds's avatar
      Merge tag 'rcu-urgent.2022.10.20a' of... · f6602a97
      Linus Torvalds authored
      Merge tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
      
      Pull RCU fix from Paul McKenney:
       "Fix a regression caused by commit bf95b2bc ("rcu: Switch polled
        grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
        interrupts enabled after an early-boot call to synchronize_rcu().
      
        Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
        to properly interact with polled grace periods, but the code did not
        take into account the possibility of synchronize_rcu() being invoked
        from the portion of the boot sequence during which interrupts are
        disabled.
      
        This commit therefore switches the lock acquisition and release from
        irq to irqsave/irqrestore"
      
      * tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
        rcu: Keep synchronize_rcu() from enabling irqs in early boot
      f6602a97
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of... · 2a91e897
      Linus Torvalds authored
      Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull KUnit fixes from Shuah Khan:
       "One single fix to update alloc_string_stream() callers to check for
        IS_ERR() instead of NULL to be in sync with alloc_string_stream()
        returning an ERR_PTR()"
      
      * tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kunit: update NULL vs IS_ERR() tests
      2a91e897
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-6.1-rc3' of... · 21c92498
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
      
       - futex, intel_pstate, kexec build fixes
      
       - ftrace dynamic_events dependency check fix
      
       - memory-hotplug fix to remove redundant warning from test report
      
      * tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests/ftrace: fix dynamic_events dependency check
        selftests/memory-hotplug: Remove the redundant warning information
        selftests/kexec: fix build for ARCH=x86_64
        selftests/intel_pstate: fix build for ARCH=x86_64
        selftests/futex: fix build for clang
      21c92498
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 74d5b415
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
      
       - Fix typos in UART1 and MMC in the Ingenic driver
      
       - A really well researched glitch bug fix to the Qualcomm driver that
         was tracked down and fixed by Dough Anderson from Chromium. Hats off
         for this one!
      
       - Revert two patches on the Xilinx ZynqMP driver: this needs a proper
         solution making use of firmware version information to adapt to
         different firmware releases
      
       - Fix interrupt triggers in the Ocelot driver
      
      * tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: ocelot: Fix incorrect trigger of the interrupt.
        Revert "dt-bindings: pinctrl-zynqmp: Add output-enable configuration"
        Revert "pinctrl: pinctrl-zynqmp: Add support for output-enable and bias-high-impedance"
        pinctrl: qcom: Avoid glitching lines when we first mux to output
        pinctrl: Ingenic: JZ4755 bug fixes
      74d5b415
    • Jakub Kicinski's avatar
      net-memcg: avoid stalls when under memory pressure · 720ca52b
      Jakub Kicinski authored
      As Shakeel explains the commit under Fixes had the unintended
      side-effect of no longer pre-loading the cached memory allowance.
      Even tho we previously dropped the first packet received when
      over memory limit - the consecutive ones would get thru by using
      the cache. The charging was happening in batches of 128kB, so
      we'd let in 128kB (truesize) worth of packets per one drop.
      
      After the change we no longer force charge, there will be no
      cache filling side effects. This causes significant drops and
      connection stalls for workloads which use a lot of page cache,
      since we can't reclaim page cache under GFP_NOWAIT.
      
      Some of the latency can be recovered by improving SACK reneg
      handling but nowhere near enough to get back to the pre-5.15
      performance (the application I'm experimenting with still
      sees 5-10x worst latency).
      
      Apply the suggested workaround of using GFP_ATOMIC. We will now
      be more permissive than previously as we'll drop _no_ packets
      in softirq when under pressure. But I can't think of any good
      and simple way to address that within networking.
      
      Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/Suggested-by: default avatarShakeel Butt <shakeelb@google.com>
      Fixes: 4b1327be ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      720ca52b