1. 06 Feb, 2023 19 commits
  2. 04 Feb, 2023 21 commits
    • David S. Miller's avatar
      Merge branch 'net-smc-parallelism' · 042b7858
      David S. Miller authored
      D. Wythe says:
      
      ====================
      net/smc: optimize the parallelism of SMC-R connections
      
      This patch set attempts to optimize the parallelism of SMC-R connections,
      mainly to reduce unnecessary blocking on locks, and to fix exceptions that
      occur after thoses optimization.
      
      According to Off-CPU graph, SMC worker's off-CPU as that:
      
      smc_close_passive_work                  (1.09%)
              smcr_buf_unuse                  (1.08%)
                      smc_llc_flow_initiate   (1.02%)
      
      smc_listen_work                         (48.17%)
              __mutex_lock.isra.11            (47.96%)
      
      An ideal SMC-R connection process should only block on the IO events
      of the network, but it's quite clear that the SMC-R connection now is
      queued on the lock most of the time.
      
      The goal of this patchset is to achieve our ideal situation where
      network IO events are blocked for the majority of the connection lifetime.
      
      There are three big locks here:
      
      1. smc_client_lgr_pending & smc_server_lgr_pending
      
      2. llc_conf_mutex
      
      3. rmbs_lock & sndbufs_lock
      
      And an implementation issue:
      
      1. confirm/delete rkey msg can't be sent concurrently while
      protocol allows indeed.
      
      Unfortunately,The above problems together affect the parallelism of
      SMC-R connection. If any of them are not solved. our goal cannot
      be achieved.
      
      After this patch set, we can get a quite ideal off-CPU graph as
      following:
      
      smc_close_passive_work                                  (41.58%)
              smcr_buf_unuse                                  (41.57%)
                      smc_llc_do_delete_rkey                  (41.57%)
      
      smc_listen_work                                         (39.10%)
              smc_clc_wait_msg                                (13.18%)
                      tcp_recvmsg_locked                      (13.18)
              smc_listen_find_device                          (25.87%)
                      smcr_lgr_reg_rmbs                       (25.87%)
                              smc_llc_do_confirm_rkey         (25.87%)
      
      We can see that most of the waiting times are waiting for network IO
      events. This also has a certain performance improvement on our
      short-lived conenction wrk/nginx benchmark test:
      
      +--------------+------+------+-------+--------+------+--------+
      |conns/qps     |c4    | c8   |  c16  |  c32   | c64  |  c200  |
      +--------------+------+------+-------+--------+------+--------+
      |SMC-R before  |9.7k  | 10k  |  10k  |  9.9k  | 9.1k |  8.9k  |
      +--------------+------+------+-------+--------+------+--------+
      |SMC-R now     |13k   | 19k  |  18k  |  16k   | 15k  |  12k   |
      +--------------+------+------+-------+--------+------+--------+
      |TCP           |15k   | 35k  |  51k  |  80k   | 100k |  162k  |
      +--------------+------+------+-------+--------+------+--------+
      
      The reason why the benefit is not obvious after the number of connections
      has increased dues to workqueue. If we try to change workqueue to UNBOUND,
      we can obtain at least 4-5 times performance improvement, reach up to half
      of TCP. However, this is not an elegant solution, the optimization of it
      will be much more complicated. But in any case, we will submit relevant
      optimization patches as soon as possible.
      
      Please note that the premise here is that the lock related problem
      must be solved first, otherwise, no matter how we optimize the workqueue,
      there won't be much improvement.
      
      Because there are a lot of related changes to the code, if you have
      any questions or suggestions, please let me know.
      
      Thanks
      D. Wythe
      
      v1 -> v2:
      
      1. Fix panic in SMC-D scenario
      2. Fix lnkc related hashfn calculation exception, caused by operator
      priority
      3. Only wake up one connection if the lnk is not active
      4. Delete obsolete unlock logic in smc_listen_work()
      5. PATCH format, do Reverse Christmas tree
      6. PATCH format, change all xxx_lnk_xxx function to xxx_link_xxx
      7. PATCH format, add correct fix tag for the patches for fixes.
      8. PATCH format, fix some spelling error
      9. PATCH format, rename slow to do_slow
      
      v2 -> v3:
      
      1. add SMC-D support, remove the concept of link cluster since SMC-D has
      no link at all. Replace it by lgr decision maker, who provides suggestions
      to SMC-D and SMC-R on whether to create new link group.
      
      2. Fix the corruption problem described by PATCH 'fix application
      data exception' on SMC-D.
      
      v3 -> v4:
      
      1. Fix panic caused by uninitialization map.
      
      v4 -> v5:
      
      1. Make SMC-D buf creation be serial to avoid Potential error
      2. Add a flag to synchronize the success of the first contact
      with the ready of the link group, including SMC-D and SMC-R.
      3. Fixed possible reference count leak in smc_llc_flow_start().
      4. reorder the patch, make bugfix PATCH be ahead.
      
      v5 -> v6:
      
      1. Separate the bugfix patches to make it independent.
      2. Merge patch 'fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending'
      with patch 'remove locks smc_client_lgr_pending and smc_server_lgr_pending'
      3. Format code styles, including alignment and reverse christmas tree
      style.
      4. Fix a possible memory leak in smc_llc_rmt_delete_rkey()
      and smc_llc_rmt_conf_rkey().
      
      v6 -> v7:
      
      1. Discard patch attempting to remove global locks
      2. Discard patch attempting make confirm/delete rkey process concurrently
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      042b7858
    • D. Wythe's avatar
      net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore · aff7bfed
      D. Wythe authored
      It's clear that rmbs_lock and sndbufs_lock are aims to protect the
      rmbs list or the sndbufs list.
      
      During connection establieshment, smc_buf_get_slot() will always
      be invoked, and it only performs read semantics in rmbs list and
      sndbufs list.
      
      Based on the above considerations, we replace mutex with rw_semaphore.
      Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
      run concurrently, other part use down_write() to keep exclusive
      semantics.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aff7bfed
    • D. Wythe's avatar
      net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() · 4da68744
      D. Wythe authored
      Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
      exclusive when assigned rmb_desc was not registered, although it can be
      executed in parallel when assigned rmb_desc was registered already
      and only performs read semtamics on it. Hence, we can not simply replace
      it with read semaphore.
      
      The idea here is that if the assigned rmb_desc was registered already,
      use read semaphore to protect the critical section, once the assigned
      rmb_desc was not registered, keep using keep write semaphore still
      to keep its exclusivity.
      
      Thanks to the reusable features of rmb_desc, which allows us to execute
      in parallel in most cases.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4da68744
    • D. Wythe's avatar
      net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() · f6421014
      D. Wythe authored
      Following is part of Off-CPU graph during frequent SMC-R short-lived
      processing:
      
      process_one_work				(51.19%)
      smc_close_passive_work			(28.36%)
      	smcr_buf_unuse				(28.34%)
      	rwsem_down_write_slowpath		(28.22%)
      
      smc_listen_work				(22.83%)
      	smc_clc_wait_msg			(1.84%)
      	smc_buf_create				(20.45%)
      		smcr_buf_map_usable_links
      		rwsem_down_write_slowpath	(20.43%)
      	smcr_lgr_reg_rmbs			(0.53%)
      		rwsem_down_write_slowpath	(0.43%)
      		smc_llc_do_confirm_rkey		(0.08%)
      
      We can clearly see that during the connection establishment time,
      waiting time of connections is not on IO, but on llc_conf_mutex.
      
      What is more important, the core critical area (smcr_buf_unuse() &
      smc_buf_create()) only perfroms read semantics on links, we can
      easily replace it with read semaphore.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6421014
    • D. Wythe's avatar
      net/smc: llc_conf_mutex refactor, replace it with rw_semaphore · b5dd4d69
      D. Wythe authored
      llc_conf_mutex was used to protect links and link related configurations
      in the same link group, for example, add or delete links. However,
      in most cases, the protected critical area has only read semantics and
      with no write semantics at all, such as obtaining a usable link or an
      available rmb_desc.
      
      This patch do simply code refactoring, replace mutex with rw_semaphore,
      replace mutex_lock with down_write and replace mutex_unlock with
      up_write.
      
      Theoretically, this replacement is equivalent, but after this patch,
      we can distinguish lock granularity according to different semantics
      of critical areas.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5dd4d69
    • Jakub Kicinski's avatar
      Merge branch 'updates-to-enetc-txq-management' · 88c940cc
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Updates to ENETC TXQ management
      
      The set ensures that the number of TXQs given by enetc to the network
      stack (mqprio or TX hashing) + the number of TXQs given to XDP never
      exceeds the number of available TXQs.
      
      These are the first 4 patches of series "[v5,net-next,00/17] ENETC
      mqprio/taprio cleanup" from here:
      https://patchwork.kernel.org/project/netdevbpf/cover/20230202003621.2679603-1-vladimir.oltean@nxp.com/
      
      There is no change in this version compared to there. I split them off
      because this contains a fix for net-next and it would be good if it
      could go in quickly. I also did it to reduce the patch count of that
      other series, if I need to respin it again.
      ====================
      
      Link: https://lore.kernel.org/r/20230203001116.3814809-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      88c940cc
    • Vladimir Oltean's avatar
      net: enetc: ensure we always have a minimum number of TXQs for stack · 800db2d1
      Vladimir Oltean authored
      Currently it can happen that an mqprio qdisc is installed with num_tc 8,
      and this will reserve 8 (out of 8) TXQs for the network stack. Then we
      can attach an XDP program, and this will crop 2 TXQs, leaving just 6 for
      mqprio. That's not what the user requested, and we should fail it.
      
      On the other hand, if mqprio isn't requested, we still give the 8 TXQs
      to the network stack (with hashing among a single traffic class), but
      then, cropping 2 TXQs for XDP is fine, because the user didn't
      explicitly ask for any number of TXQs, so no expectations are violated.
      
      Simply put, the logic that mqprio should impose a minimum number of TXQs
      for the network never existed. Let's say (more or less arbitrarily) that
      without mqprio, the driver expects a minimum number of TXQs equal to the
      number of CPUs (on NXP LS1028A, that is either 1, or 2). And with mqprio,
      mqprio gives the minimum required number of TXQs.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      800db2d1
    • Vladimir Oltean's avatar
      net: enetc: recalculate num_real_tx_queues when XDP program attaches · 4ea1dd74
      Vladimir Oltean authored
      Since the blamed net-next commit, enetc_setup_xdp_prog() no longer goes
      through enetc_open(), and therefore, the function which was supposed to
      detect whether a BPF program exists (in order to crop some TX queues
      from network stack usage), enetc_num_stack_tx_queues(), no longer gets
      called.
      
      We can move the netif_set_real_num_rx_queues() call to enetc_alloc_msix()
      (probe time), since it is a runtime invariant. We can do the same thing
      with netif_set_real_num_tx_queues(), and let enetc_reconfigure_xdp_cb()
      explicitly recalculate and change the number of stack TX queues.
      
      Fixes: c33bfaf9 ("net: enetc: set up XDP program under enetc_reconfigure()")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4ea1dd74
    • Vladimir Oltean's avatar
      net: enetc: allow the enetc_reconfigure() callback to fail · 46a0ecf9
      Vladimir Oltean authored
      enetc_reconfigure() was modified in commit c33bfaf9 ("net: enetc:
      set up XDP program under enetc_reconfigure()") to take an optional
      callback that runs while the netdev is down, but this callback currently
      cannot fail.
      
      Code up the error handling so that the interface is restarted with the
      old resources if the callback fails.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      46a0ecf9
    • Vladimir Oltean's avatar
      net: enetc: simplify enetc_num_stack_tx_queues() · 1c81a9b3
      Vladimir Oltean authored
      We keep a pointer to the xdp_prog in the private netdev structure as
      well; what's replicated per RX ring is done so just for more convenient
      access from the NAPI poll procedure.
      
      Simplify enetc_num_stack_tx_queues() by looking at priv->xdp_prog rather
      than iterating through the information replicated per RX ring.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1c81a9b3
    • Jakub Kicinski's avatar
      Merge branch 'raw-add-drop-reasons-and-use-another-hash-function' · 8788260e
      Jakub Kicinski authored
      Eric Dumazet says:
      
      ====================
      raw: add drop reasons and use another hash function
      
      Two first patches add drop reasons to raw input processing.
      
      Last patch spreads RAW sockets in the shared hash tables
      to avoid long hash buckets in some cases.
      ====================
      
      Link: https://lore.kernel.org/r/20230202094100.3083177-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8788260e
    • Eric Dumazet's avatar
      raw: use net_hash_mix() in hash function · 6579f5ba
      Eric Dumazet authored
      Some applications seem to rely on RAW sockets.
      
      If they use private netns, we can avoid piling all RAW
      sockets bound to a given protocol into a single bucket.
      
      Also place (struct raw_hashinfo).lock into its own
      cache line to limit false sharing.
      
      Alternative would be to have per-netns hashtables,
      but this seems too expensive for most netns
      where RAW sockets are not used.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6579f5ba
    • Eric Dumazet's avatar
      ipv4: raw: add drop reasons · 42186e6c
      Eric Dumazet authored
      Use existing helpers and drop reason codes for RAW input path.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      42186e6c
    • Eric Dumazet's avatar
      ipv6: raw: add drop reasons · 8d8ebd77
      Eric Dumazet authored
      Use existing helpers and drop reason codes for RAW input path.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8d8ebd77
    • Jakub Kicinski's avatar
      Merge branch 'devlink-move-devlink-dev-code-to-a-separate-file' · dfefcb0c
      Jakub Kicinski authored
      Moshe Shemesh says:
      
      ====================
      devlink: Move devlink dev code to a separate file
      
      This patchset is moving code from the file leftover.c to new file dev.c.
      About 1.3K lines are moved by this patchset covering most of the devlink
      dev object callbacks and functionality: reload, eswitch, info, flash and
      selftest.
      ====================
      
      Link: https://lore.kernel.org/r/1675349226-284034-1-git-send-email-moshe@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dfefcb0c
    • Moshe Shemesh's avatar
      devlink: Move devlink dev selftest code to dev · 7c976c7c
      Moshe Shemesh authored
      Move devlink dev selftest callbacks and related code from leftover.c to
      file dev.c. No functional change in this patch.
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7c976c7c
    • Moshe Shemesh's avatar
      devlink: Move devlink_info_req struct to be local · ec4a0ce9
      Moshe Shemesh authored
      As all users of the struct devlink_info_req are already in dev.c, move
      this struct from devl_internal.c to be local in dev.c.
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec4a0ce9
    • Moshe Shemesh's avatar
      devlink: Move devlink dev flash code to dev · a13aab66
      Moshe Shemesh authored
      Move devlink dev flash callbacks, helpers and other related code from
      leftover.c to dev.c. No functional change in this patch.
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a13aab66
    • Moshe Shemesh's avatar
      devlink: Move devlink dev info code to dev · d60191c4
      Moshe Shemesh authored
      Move devlink dev info callbacks, related drivers helpers functions and
      other related code from leftover.c to dev.c. No functional change in
      this patch.
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d60191c4
    • Moshe Shemesh's avatar
      devlink: Move devlink dev eswitch code to dev · af2f8c1f
      Moshe Shemesh authored
      Move devlink dev eswitch callbacks and related code from leftover.c to
      file dev.c. No functional change in this patch.
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      af2f8c1f
    • Moshe Shemesh's avatar
      devlink: Move devlink dev reload code to dev · c6ed7d6e
      Moshe Shemesh authored
      Move devlink dev reload callback and related code from leftover.c to
      file dev.c. No functional change in this patch.
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c6ed7d6e