1. 08 Aug, 2023 25 commits
    • Tejun Heo's avatar
      workqueue: Implement non-strict affinity scope for unbound workqueues · 8639eceb
      Tejun Heo authored
      An unbound workqueue can be served by multiple worker_pools to improve
      locality. The segmentation is achieved by grouping CPUs into pods. By
      default, the cache boundaries according to cpus_share_cache() define the
      CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
      system has two L3 caches. The workqueue would be mapped to two worker_pools
      each serving one L3 cache domains.
      
      While this improves locality, because the pod boundaries are strict, it
      limits the total bandwidth a given issuer can consume. For example, let's
      say there is a thread pinned to a CPU issuing enough work items to saturate
      the whole machine. With the machine segmented into two pods, no matter how
      many work items it issues, it can only use half of the CPUs on the system.
      
      While this limitation has existed for a very long time, it wasn't very
      pronounced because the affinity grouping used to be always by NUMA nodes.
      With cache boundaries as the default and support for even finer grained
      scopes (smt and cpu), it is now an a lot more pressing problem.
      
      This patch implements non-strict affinity scope where the pod boundaries
      aren't enforced strictly. Going back to the previous example, the workqueue
      would still be mapped to two worker_pools; however, the affinity enforcement
      would be soft. The workers in both pools would have their cpus_allowed set
      to the whole machine thus allowing the scheduler to migrate them anywhere on
      the machine. However, whenever an idle worker is woken up, the workqueue
      code asks the scheduler to bring back the task within the pod if the worker
      is outside. ie. work items start executing within its affinity scope but can
      be migrated outside as the scheduler sees fit. This removes the hard cap on
      utilization while maintaining the benefits of affinity scopes.
      
      After the earlier ->__pod_cpumask changes, the implementation is pretty
      simple. When non-strict which is the new default:
      
      * pool_allowed_cpus() returns @pool->attrs->cpumask instead of
        ->__pod_cpumask so that the workers are allowed to run on any CPU that
        the associated workqueues allow.
      
      * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
        the field to a CPU within the pod.
      
      This would be the first use of task_struct->wake_cpu outside scheduler
      proper, so it isn't clear whether this would be acceptable. However, other
      methods of migrating tasks are significantly more expensive and are likely
      prohibitively so if we want to do this on every work item. This needs
      discussion with scheduler folks.
      
      There is also a race window where setting ->wake_cpu wouldn't be effective
      as the target task is still on CPU. However, the window is pretty small and
      this being a best-effort optimization, it doesn't seem to warrant more
      complexity at the moment.
      
      While the non-strict cache affinity scopes seem to be the best option, the
      performance picture interacts with the affinity scope and is a bit
      complicated to fully discuss in this patch, so the behavior is made easily
      selectable through wqattrs and sysfs and the next patch will add
      documentation to discuss performance implications.
      
      v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      8639eceb
    • Tejun Heo's avatar
      workqueue: Add workqueue_attrs->__pod_cpumask · 9546b29e
      Tejun Heo authored
      workqueue_attrs has two uses:
      
      * to specify the required unouned workqueue properties by users
      
      * to match worker_pool's properties to workqueues by core code
      
      For example, if the user wants to restrict a workqueue to run only CPUs 0
      and 2, and the two CPUs are on different affinity scopes, the workqueue's
      attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
      associated with two worker_pools, one with attrs->cpumask containing just
      CPU 0 and the other CPU 2.
      
      Workqueue wants to support non-strict affinity scopes where work items are
      started in their matching affinity scopes but the scheduler is free to
      migrate them outside the starting scopes, which can enable utilizing the
      whole machine while maintaining most of the locality benefits from affinity
      scopes.
      
      To enable that, worker_pools need to distinguish the strict affinity that it
      has to follow (because that's the restriction coming from the user) and the
      soft affinity that it wants to apply when dispatching work items. Note that
      two worker_pools with different soft dispatching requirements have to be
      separate; otherwise, for example, we'd be ping-ponging worker threads across
      NUMA boundaries constantly.
      
      This patch adds workqueue_attrs->__pod_cpumask. The new field is double
      underscored as it's only used internally to distinguish worker_pools. A
      worker_pool's ->cpumask is now always the same as the online subset of
      allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
      subset of that ->cpumask. Going back to the example above, both worker_pools
      would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
      would contain 0 while the other's 2.
      
      * pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
        that the pool's workers must stay within. This is currently always
        ->__pod_cpumask as all boundaries are still strict.
      
      * As a workqueue_attrs can now track both the associated workqueues' cpumask
        and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
        out-argument. Drop @cpumask and instead store the result in
        ->__pod_cpumask.
      
      * The above also simplifies apply_wqattrs_prepare() as the same
        workqueue_attrs can be used to create all pods associated with a
        workqueue. tmp_attrs is dropped.
      
      * wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
        update is needed instead of only comparing ->cpumask so that
        ->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
        but the code is easier to understand and more robust this way.
      
      The only user-visible behavior change is that two workqueues with different
      cpumasks no longer can share worker_pools even when their pod subsets
      coincide. Going back to the example, let's say there's another workqueue
      with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
      to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
      the same cpumask as the first pod of the earlier example and would have
      shared the same worker_pool but that's no longer the case after this patch.
      The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
      wouldn't match.
      
      While this is necessary to support non-strict affinity scopes, there can be
      further optimizations to maintain sharing among strict affinity scopes.
      However, non-strict affinity scopes are going to be preferable for most use
      cases and we don't see very diverse mixture of unbound workqueue cpumasks
      anyway, so the additional overhead doesn't seem to justify the extra
      complexity.
      
      v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
            to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
            using wqattrs_equal() for comparison instead.
      
          - Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
            a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      9546b29e
    • Tejun Heo's avatar
      workqueue: Factor out need_more_worker() check and worker wake-up · 0219a352
      Tejun Heo authored
      Checking need_more_worker() and calling wake_up_worker() is a repeated
      pattern. Let's add kick_pool(), which checks need_more_worker() and
      open-code wake_up_worker(), and replace wake_up_worker() uses. The following
      conversions aren't one-to-one:
      
      * __queue_work() was using __need_more_work() because it knows that
        pool->worklist isn't empty. Switching to kick_pool() adds an extra
        list_empty() test.
      
      * create_worker() always needs to wake up the newly minted worker whether
        there's more work to do or not to avoid triggering hung task check on the
        new task. Keep the current wake_up_process() and still add kick_pool().
        This may lead to an extra wakeup which isn't harmful.
      
      * pwq_adjust_max_active() was explicitly checking whether it needs to wake
        up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up
        a worker when necessary, this explicit check is no longer necessary and
        dropped.
      
      * unbind_workers() now calls kick_pool() instead of wake_up_worker() adding
        a need_more_worker() test. This avoids spurious wakeups and shouldn't
        break anything.
      
      wake_up_worker() is dropped as kick_pool() replaces all its users. After
      this patch, all paths that wakes up a non-rescuer worker to initiate work
      item execution use kick_pool(). This will enable future changes to improve
      locality.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      0219a352
    • Tejun Heo's avatar
      workqueue: Factor out work to worker assignment and collision handling · 873eaca6
      Tejun Heo authored
      The two work execution paths in worker_thread() and rescuer_thread() use
      move_linked_works() to claim work items from @pool->worklist. Once claimed,
      process_schedule_works() is called which invokes process_one_work() on each
      work item. process_one_work() then uses find_worker_executing_work() to
      detect and handle collisions - situations where the work item to be executed
      is still running on another worker.
      
      This works fine, but, to improve work execution locality, we want to
      establish work to worker association earlier and know for sure that the
      worker is going to excute the work once asssigned, which requires performing
      collision handling earlier while trying to assign the work item to the
      worker.
      
      This patch introduces assign_work() which assigns a work item to a worker
      using move_linked_works() and then performs collision handling. As collision
      handling is handled earlier, process_one_work() no longer needs to worry
      about them.
      
      After the this patch, collision checks for linked work items are skipped,
      which should be fine as they can't be queued multiple times concurrently.
      For work items running from rescuers, the timing of collision handling may
      change but the invariant that the work items go through collision handling
      before starting execution does not.
      
      This patch shouldn't cause noticeable behavior changes, especially given
      that worker_thread() behavior remains the same.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      873eaca6
    • Tejun Heo's avatar
      workqueue: Add multiple affinity scopes and interface to select them · 63c5484e
      Tejun Heo authored
      Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE
      the default. The code changes to actually add the additional scopes are
      trivial.
      
      Also add module parameter "workqueue.default_affinity_scope" to override the
      default scope and "affinity_scope" sysfs file to configure it per workqueue.
      wq_dump.py and documentations are updated accordingly.
      
      This enables significant flexibility in configuring how unbound workqueues
      behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu
      workqueue. On the other hand, "system" removes all locality boundaries.
      
      Many modern machines have multiple L3 caches often while being mostly
      uniform in terms of memory access. Thus, workqueue's previous behavior of
      spreading work items in each NUMA node had negative performance implications
      from unncessarily crossing L3 boundaries between issue and execution.
      However, picking a finer grained affinity scope also has a downside in that
      an issuer in one group can't utilize CPUs in other groups.
      
      While dependent on the specifics of workload, there's usually a noticeable
      penalty in crossing L3 boundaries, so let's default to CACHE. This issue
      will be further addressed and documented with examples in future patches.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      63c5484e
    • Tejun Heo's avatar
      workqueue: Modularize wq_pod_type initialization · 025e1684
      Tejun Heo authored
      While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM
      init is hard coded into workqueue_init_topology(). This patch modularizes
      the init path by introducing init_pod_type() which takes a callback to
      determine whether two CPUs should share a pod as an argument.
      
      init_pod_type() first scans the CPU combinations testing for sharing to
      assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once
      ->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized
      accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with
      cpus_share_numa() which tests whether the CPU belongs to the same NUMA node.
      
      This patch may change the pod ID assigned to each NUMA node but that
      shouldn't cause any behavior changes as the NUMA node to use for allocations
      are tracked separately in pod_type->pod_node[]. This makes adding new
      affinty types pretty easy.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      025e1684
    • Tejun Heo's avatar
      workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration · 7f7dc377
      Tejun Heo authored
      Lack of visibility has always been a pain point for workqueues. While the
      recently added wq_monitor.py improved the situation, it's still difficult to
      understand what worker pools are active in the system, how workqueues map to
      them and why. The lack of visibility into how workqueues are configured is
      going to become more noticeable as workqueue improves locality awareness and
      provides more mechanisms to customize locality related behaviors.
      
      Now that the basic framework for more flexible locality support is in place,
      this is a good time to improve the situation. This patch adds
      tools/workqueues/wq_dump.py which prints out the topology configuration,
      worker pools and how workqueues are mapped to pools. Read the command's help
      message for more details.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      7f7dc377
    • Tejun Heo's avatar
      workqueue: Generalize unbound CPU pods · 84193c07
      Tejun Heo authored
      While renamed to pod, the code still assumes that the pods are defined by
      NUMA boundaries. Let's generalize it:
      
      * workqueue_attrs->affn_scope is added. Each enum represents the type of
        boundaries that define the pods. There are currently two scopes -
        WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
        - one pod per NUMA node. The latter defines one global pod across the
        whole system.
      
      * struct wq_pod_type is added which describes how pods are configured for
        each affnity scope. For each pod, it lists the member CPUs and the
        preferred NUMA node for memory allocations. The reverse mapping from CPU
        to pod is also available.
      
      * wq_pod_enabled is dropped. Pod is now always enabled. The previously
        disabled behavior is now implemented through WQ_AFFN_SYSTEM.
      
      * get_unbound_pool() wants to determine the NUMA node to allocate memory
        from for the new pool. The variables are renamed from node to pod but the
        logic still assumes they're one and the same. Clearly distinguish them -
        walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
        NUMA node.
      
      * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
        node. Take @cpu instead and determine the cpumask to use from the pod_type
        matching @attrs.
      
      * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
        NULL so that it can indicate -EINVAL on invalid affinity scopes.
      
      This patch allows CPUs to be grouped into pods however desired per type.
      While this patch causes some internal behavior changes, nothing material
      should change for workqueue users.
      
      v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is
          WQ_AFFN_NR_TYPES which indicates that the function is called with a
          worker_pool's attrs instead of a workqueue's.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      84193c07
    • Tejun Heo's avatar
      workqueue: Factor out clearing of workqueue-only attrs fields · 5de7a03c
      Tejun Heo authored
      workqueue_attrs can be used for both workqueues and worker_pools. However,
      some fields, currently only ->ordered, only apply to workqueues and should
      be cleared to the default / invalid values.
      
      Currently, an unbound workqueue explicitly clears attrs->ordered in
      get_unbound_pool() after copying the source workqueue attrs, while per-cpu
      workqueues rely on the fact that zeroing on allocation gives us the desired
      default value for pool->attrs->ordered.
      
      This is fragile. Let's add wqattrs_clear_for_pool() which clears
      attrs->ordered and is called from both init_worker_pool() and
      get_unbound_pool(). This will ease adding more workqueue-only attrs fields.
      
      In get_unbound_pool(), pool->node initialization is moved upwards for
      readability. This shouldn't cause any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      5de7a03c
    • Tejun Heo's avatar
      workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod() · 0f36ee24
      Tejun Heo authored
      For an unbound pool, multiple cpumasks are involved.
      
      U: The user-specified cpumask (may be filtered with cpu_possible_mask).
      
      A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering
         leaves no CPU, wq_unbound_cpumask is used.
      
      P: Per-pod subsets of #A.
      
      wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and
      wq->cpu_pwq[CPU]->pool->attrs->cpumask #P.
      
      wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To
      calculate the new #P for each workqueue, it needs to call
      wq_calc_pod_cpumask() with @attrs that contains #A. Currently,
      wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with
      wq->dfl_pwq->pool->attrs.
      
      This is rather fragile because we're calling wq_calc_pod_cpumask() with
      @attrs of a worker_pool rather than the workqueue's actual attrs when what
      we want to calculate is the workqueue's cpumask on the pod. While this works
      fine currently, future changes will add fields which are used differently
      between workqueues and worker_pools and this subtlety will bite us.
      
      This patch factors out #U -> #A calculation from apply_wqattrs_prepare()
      into wqattrs_actualize_cpumask and updates wq_update_pod() to copy
      wq->unbound_attrs and use the new helper to obtain #A freshly instead of
      abusing wq->dfl_pwq->pool_attrs.
      
      This shouldn't cause any behavior changes in the current code.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
      0f36ee24
    • Tejun Heo's avatar
      workqueue: Initialize unbound CPU pods later in the boot · 2930155b
      Tejun Heo authored
      During boot, to initialize unbound CPU pods, wq_pod_init() was called from
      workqueue_init(). This is early enough for NUMA nodes to be set up but
      before SMP is brought up and CPU topology information is populated.
      
      Workqueue is in the process of improving CPU locality for unbound workqueues
      and will need access to topology information during pod init. This adds a
      new init function workqueue_init_topology() which is called after CPU
      topology information is available and replaces wq_pod_init().
      
      As unbound CPU pods are now initialized after workqueues are activated, we
      need to revisit the workqueues to apply the pod configuration. Workqueues
      which are created before workqueue_init_topology() are set up so that they
      always use the default worker pool. After pods are set up in
      workqueue_init_topology(), wq_update_pod() is called on all existing
      workqueues to update the pool associations accordingly.
      
      Note that wq_update_pod_attrs_buf allocation is moved to
      workqueue_init_early(). This isn't necessary right now but enables further
      generalization of pod handling in the future.
      
      This patch changes the initialization sequence but the end result should be
      the same.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2930155b
    • Tejun Heo's avatar
      workqueue: Move wq_pod_init() below workqueue_init() · a86feae6
      Tejun Heo authored
      wq_pod_init() is called from workqueue_init() and responsible for
      initializing unbound CPU pods according to NUMA node. Workqueue is in the
      process of improving affinity awareness and wants to use other topology
      information to initialize unbound CPU pods; however, unlike NUMA nodes,
      other topology information isn't yet available in workqueue_init().
      
      The next patch will introduce a later stage init function for workqueue
      which will be responsible for initializing unbound CPU pods. Relocate
      wq_pod_init() below workqueue_init() where the new init function is going to
      be located so that the diff can show the content differences.
      
      Just a relocation. No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a86feae6
    • Tejun Heo's avatar
      workqueue: Rename NUMA related names to use pod instead · fef59c9c
      Tejun Heo authored
      Workqueue is in the process of improving CPU affinity awareness. It will
      become more flexible and won't be tied to NUMA node boundaries. This patch
      renames all NUMA related names in workqueue.c to use "pod" instead.
      
      While "pod" isn't a very common term, it short and captures the grouping of
      CPUs well enough. These names are only going to be used within workqueue
      implementation proper, so the specific naming doesn't matter that much.
      
      * wq_numa_possible_cpumask -> wq_pod_cpus
      
      * wq_numa_enabled -> wq_pod_enabled
      
      * wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf
      
      * workqueue_select_cpu_near -> select_numa_node_cpu
      
        This rename is different from others. The function is only used by
        queue_work_node() and specifically tries to find a CPU in the specified
        NUMA node. As workqueue affinity will become more flexible and untied from
        NUMA, this function's name should specifically describe that it's for
        NUMA.
      
      * wq_calc_node_cpumask -> wq_calc_pod_cpumask
      
      * wq_update_unbound_numa -> wq_update_pod
      
      * wq_numa_init -> wq_pod_init
      
      * node -> pod in local variables
      
      Only renames. No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      fef59c9c
    • Tejun Heo's avatar
      workqueue: Rename workqueue_attrs->no_numa to ->ordered · af73f5c9
      Tejun Heo authored
      With the recent removal of NUMA related module param and sysfs knob,
      workqueue_attrs->no_numa is now only used to implement ordered workqueues.
      Let's rename the field so that it's less confusing especially with the
      planned CPU affinity awareness improvements.
      
      Just a rename. No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      af73f5c9
    • Tejun Heo's avatar
      workqueue: Make unbound workqueues to use per-cpu pool_workqueues · 636b927e
      Tejun Heo authored
      A pwq (pool_workqueue) represents an association between a workqueue and a
      worker_pool. When a work item is queued, the workqueue selects the pwq to
      use, which in turn determines the pool, and queues the work item to the pool
      through the pwq. pwq is also what implements the maximum concurrency limit -
      @max_active.
      
      As a per-cpu workqueue should be assocaited with a different worker_pool on
      each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
      However, unbound workqueues were sharing a pwq within each NUMA node by
      default. The sharing has several downsides:
      
      * Because @max_active is per-pwq, the meaning of @max_active changes
        depending on the machine configuration and whether workqueue NUMA locality
        support is enabled.
      
      * Makes per-cpu and unbound code deviate.
      
      * Gets in the way of making workqueue CPU locality awareness more flexible.
      
      This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
      workqueues do by making the following changes:
      
      * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
        just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
        workqueues.
      
      * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
        the specified pwq to the target CPU's wq->cpu_pwq.
      
      * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
        unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
        This makes the return value of wq_calc_node_cpumask() unnecessary. It now
        returns void.
      
      * @max_active now means the same thing for both per-cpu and unbound
        workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
        documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
        used in workqueue implementation and will be removed later.
      
      * All unbound pwq operations which used to be per-numa-node are now per-cpu.
      
      For most unbound workqueue users, this shouldn't cause noticeable changes.
      Work item issue and completion will be a small bit faster, flush_workqueue()
      would become a bit more expensive, and the total concurrency limit would
      likely become higher. All @max_active==1 use cases are currently being
      audited for conversion into alloc_ordered_workqueue() and they shouldn't be
      affected once the audit and conversion is complete.
      
      One area where the behavior change may be more noticeable is
      workqueue_congested() as the reported congestion state is now per CPU
      instead of NUMA node. There are only two users of this interface -
      drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
      cc'd. Inputs on the behavior change would be very much appreciated.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: Karsten Graul <kgraul@linux.ibm.com>
      Cc: Wenjia Zhang <wenjia@linux.ibm.com>
      Cc: Jan Karcher <jaka@linux.ibm.com>
      636b927e
    • Tejun Heo's avatar
      workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug · 4cbfd3de
      Tejun Heo authored
      When a CPU went online or offline, wq_update_unbound_numa() was called only
      on the CPU which was going up or down. This works fine because all CPUs on
      the same NUMA node share the same pool_workqueue slot - one CPU updating it
      updates it for everyone in the node.
      
      However, future changes will make each CPU use a separate pool_workqueue
      even when they're sharing the same worker_pool, which requires updating
      pool_workqueue's for all CPUs which may be sharing the same pool_workqueue
      on hotplug.
      
      To accommodate the planned changes, this patch updates
      workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for
      all CPUs sharing the same NUMA node as the CPU going up or down. In the
      current code, the second+ calls would be noops and there shouldn't be any
      behavior changes.
      
      * As wq_update_unbound_numa() is now called on multiple CPUs per each
        hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument
        is added. The former indicates the CPU being hot[un]plugged and the latter
        the CPU whose pool_workqueue is being updated.
      
      * In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency
        with the new @hotplug_cpu.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4cbfd3de
    • Tejun Heo's avatar
      workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones · 687a9aa5
      Tejun Heo authored
      Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly
      through a per-cpu allocation and thus, unlike unbound workqueues, not
      reference counted. This difference in lifetime management between the two
      types is a bit confusing.
      
      Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which
      isn't suitiable for the planned CPU locality related improvements. The plan
      is to unify pwq handling across per-cpu and unbound workqueues so that
      they're always accessed through wq->cpu_pwq.
      
      In preparation, this patch makes per-cpu pwq's to be allocated, reference
      counted and released the same way as unbound pwq's. wq->cpu_pwq now holds
      pointers to pwq's instead of containing them directly.
      
      pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now
      also used for per-cpu work items.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      687a9aa5
    • Tejun Heo's avatar
      workqueue: Use a kthread_worker to release pool_workqueues · 967b494e
      Tejun Heo authored
      pool_workqueue release path is currently bounced to system_wq; however, this
      is a bit tricky because this bouncing occurs while holding a pool lock and
      thus has risk of causing a A-A deadlock. This is currently addressed by the
      fact that only unbound workqueues use this bouncing path and system_wq is a
      per-cpu workqueue.
      
      While this works, it's brittle and requires a work-around like setting the
      lockdep subclass for the lock of unbound pools. Besides, future changes will
      use the bouncing path for per-cpu workqueues too making the current approach
      unusable.
      
      Let's just use a dedicated kthread_worker to untangle the dependency. This
      is just one more kthread for all workqueues and makes the pwq release logic
      simpler and more robust.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      967b494e
    • Tejun Heo's avatar
      workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa · fcecfa8f
      Tejun Heo authored
      Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA
      specific knobs won't make sense anymore. Remove them. Also, the pool_ids
      knob was used for debugging and not really meaningful given that there is no
      visibility into the pools associated with those IDs. Remove it too. A future
      patch will improve overall visibility.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      fcecfa8f
    • Tejun Heo's avatar
      workqueue: Relocate worker and work management functions · 797e8345
      Tejun Heo authored
      Collect first_idle_worker(), worker_enter/leave_idle(),
      find_worker_executing_work(), move_linked_works() and wake_up_worker() into
      one place. These functions will later be used to implement higher level
      worker management logic.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      797e8345
    • Tejun Heo's avatar
      workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq · ee1ceef7
      Tejun Heo authored
      wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue.
      The field name being plural is unusual and confusing. Rename it to singular.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ee1ceef7
    • Tejun Heo's avatar
      workqueue: Not all work insertion needs to wake up a worker · fe089f87
      Tejun Heo authored
      insert_work() always tried to wake up a worker; however, the only time it
      needs to try to wake up a worker is when a new active work item is queued.
      When a work item goes on the inactive list or queueing a flush work item,
      there's no reason to try to wake up a worker.
      
      This patch moves the worker wakeup logic out of insert_work() and places it
      in the active new work item queueing path in __queue_work().
      
      While at it:
      
      * __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
        pool.
      
      * Every caller of insert_work() calls debug_work_activate(). Consolidate the
        invocations into insert_work().
      
      * In __queue_work() pool->watchdog_ts update is relocated slightly. This is
        to better accommodate future changes.
      
      This makes wakeups more precise and will help the planned change to assign
      work items to workers before waking them up. No behavior changes intended.
      
      v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as
          suggested by Lai.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      fe089f87
    • Tejun Heo's avatar
      workqueue: Cleanups around process_scheduled_works() · c0ab017d
      Tejun Heo authored
      * Drop the trivial optimization in worker_thread() where it bypasses calling
        process_scheduled_works() if the first work item isn't linked. This is a
        mostly pointless micro optimization and gets in the way of improving the
        work processing path.
      
      * Consolidate pool->watchdog_ts updates in the two callers into
        process_scheduled_works().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      c0ab017d
    • Tejun Heo's avatar
      workqueue: Drop the special locking rule for worker->flags and worker_pool->flags · bc8b50c2
      Tejun Heo authored
      worker->flags used to be accessed from scheduler hooks without grabbing
      pool->lock for concurrency management. This is no longer true since
      6d25be57 ("sched/core, workqueues: Distangle worker accounting from rq
      lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
      All relevant users are accessing it under the pool lock.
      
      Let's drop the special "X" rule and use the "L" rule for these flag fields
      instead. While at it, replace the CONTEXT comment with
      lockdep_assert_held().
      
      This allows worker_set/clr_flags() to be used from context which isn't the
      worker itself. This will be used later to implement assinging work items to
      workers before waking them up so that workqueue can have better control over
      which worker executes which work item on which CPU.
      
      The only actual changes are sanity checks. There shouldn't be any visible
      behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bc8b50c2
    • Tejun Heo's avatar
      workqueue: Merge branch 'for-6.5-fixes' into for-6.6 · 87437656
      Tejun Heo authored
      Unbound workqueue execution locality improvement patchset is about to
      applied which will cause merge conflicts with changes in for-6.5-fixes.
      Let's avoid future merge conflict by pulling in for-6.5-fixes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      87437656
  2. 07 Aug, 2023 1 commit
  3. 25 Jul, 2023 1 commit
    • Tejun Heo's avatar
      workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000 · aa6fde93
      Tejun Heo authored
      wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
      Once detected, they're excluded from concurrency management to prevent them
      from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
      enabled, repeat offenders are also reported so that the code can be updated.
      
      The default threshold is 10ms which is long enough to do fair bit of work on
      modern CPUs while short enough to be usually not noticeable. This
      unfortunately leads to a lot of, arguable spurious, detections on very slow
      CPUs. Using the same threshold across CPUs whose performance levels may be
      apart by multiple levels of magnitude doesn't make whole lot of sense.
      
      This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
      is below 4000. This is obviously very inaccurate but it doesn't have to be
      accurate to be useful. The mechanism is still useful when the threshold is
      fully scaled up and the benefits of reports are usually shared with everyone
      regardless of who's reporting, so as long as there are sufficient number of
      fast machines reporting, we don't lose much.
      
      Some (or is it all?) ARM CPUs systemtically report significantly lower
      BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
      are, it's at least a missed opportunity and it probably would be a good idea
      to teach workqueue about it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      aa6fde93
  4. 11 Jul, 2023 1 commit
  5. 10 Jul, 2023 4 commits
  6. 09 Jul, 2023 8 commits