1. 21 Feb, 2024 7 commits
  2. 16 Feb, 2024 1 commit
    • Tejun Heo's avatar
      workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK · fd0a68a2
      Tejun Heo authored
      2f34d733 ("workqueue: Fix queue_work_on() with BH workqueues") added
      irq_work usage to workqueue; however, it turns out irq_work is actually
      optional and the change breaks build on configuration which doesn't have
      CONFIG_IRQ_WORK enabled.
      
      Fix build by making workqueue use irq_work only when CONFIG_SMP and enabling
      CONFIG_IRQ_WORK when CONFIG_SMP is set. It's reasonable to argue that it may
      be better to just always enable it. However, this still saves a small bit of
      memory for tiny UP configs and also the least amount of change, so, for now,
      let's keep it conditional.
      
      Verified to do the right thing for x86_64 allnoconfig and defconfig, and
      aarch64 allnoconfig, allnoconfig + prink disable (SMP but nothing selects
      IRQ_WORK) and a modified aarch64 Kconfig where !SMP and nothing selects
      IRQ_WORK.
      
      v2: `depends on SMP` leads to Kconfig warnings when CONFIG_IRQ_WORK is
          selected by something else when !CONFIG_SMP. Use `def_bool y if SMP`
          instead.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Tested-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Fixes: 2f34d733 ("workqueue: Fix queue_work_on() with BH workqueues")
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      fd0a68a2
  3. 14 Feb, 2024 1 commit
    • Tejun Heo's avatar
      workqueue: Fix queue_work_on() with BH workqueues · 2f34d733
      Tejun Heo authored
      When queue_work_on() is used to queue a BH work item on a remote CPU, the
      work item is queued on that CPU but kick_pool() raises softirq on the local
      CPU. This leads to stalls as the work item won't be executed until something
      else on the remote CPU schedules a BH work item or tasklet locally.
      
      Fix it by bouncing raising softirq to the target CPU using per-cpu irq_work.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 4cb1ef64 ("workqueue: Implement BH workqueues to eventually replace tasklets")
      2f34d733
  4. 09 Feb, 2024 3 commits
  5. 08 Feb, 2024 4 commits
    • Waiman Long's avatar
      workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask · 49584bb8
      Waiman Long authored
      Commit 85f0ab43 ("kernel/workqueue: Bind rescuer to unbound
      cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of
      an unbound workqueue to the cpumask in wq->unbound_attrs. However
      unbound_attrs->cpumask's of all workqueues are initialized to
      cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag
      to expose a cpumask sysfs file to be written by users. So this patch
      doesn't achieve what it is intended to do.
      
      If an unbound workqueue is created after wq_unbound_cpumask is modified
      and there is no more unbound cpumask update after that, the unbound
      rescuer will be bound to all CPUs unless the workqueue is created
      with the WQ_SYSFS flag and a user explicitly modified its cpumask
      sysfs file.  Fix this problem by binding directly to wq_unbound_cpumask
      in init_rescuer().
      
      Fixes: 85f0ab43 ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      49584bb8
    • Juri Lelli's avatar
      kernel/workqueue: Let rescuers follow unbound wq cpumask changes · d64f2fa0
      Juri Lelli authored
      When workqueue cpumask changes are committed the associated rescuer (if
      one exists) affinity is not touched and this might be a problem down the
      line for isolated setups.
      
      Make sure rescuers affinity is updated every time a workqueue cpumask
      changes, so that rescuers can't break isolation.
      
       [longman: set_cpus_allowed_ptr() will block until the designated task
        is enqueued on an allowed CPU, no wake_up_process() needed. Also use
        the unbound_effective_cpumask() helper as suggested by Tejun.]
      Signed-off-by: default avatarJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d64f2fa0
    • Waiman Long's avatar
      workqueue: Enable unbound cpumask update on ordered workqueues · 4c065dbc
      Waiman Long authored
      Ordered workqueues does not currently follow changes made to the
      global unbound cpumask because per-pool workqueue changes may break
      the ordering guarantee. IOW, a work function in an ordered workqueue
      may run on an isolated CPU.
      
      This patch enables ordered workqueues to follow changes made to the
      global unbound cpumask by temporaily plug or suspend the newly allocated
      pool_workqueue from executing newly queued work items until the old
      pwq has been properly drained. For ordered workqueues, there should
      only be one pwq that is unplugged, the rests should be plugged.
      
      This enables ordered workqueues to follow the unbound cpumask changes
      like other unbound workqueues at the expense of some delay in execution
      of work functions during the transition period.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4c065dbc
    • Waiman Long's avatar
      workqueue: Link pwq's into wq->pwqs from oldest to newest · 26fb7e3d
      Waiman Long authored
      Add a new pwq into the tail of wq->pwqs so that pwq iteration will
      start from the oldest pwq to the newest. This ordering will facilitate
      the inclusion of ordered workqueues in a wq_unbound_cpumask update.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      26fb7e3d
  6. 06 Feb, 2024 3 commits
    • Tejun Heo's avatar
      Merge branch 'for-6.8-fixes' into for-6.9 · 40911d44
      Tejun Heo authored
      The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit
      ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      40911d44
    • Tejun Heo's avatar
      Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()" · aac8a595
      Tejun Heo authored
      This reverts commit ca10d851.
      
      The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED
      on now removed implicitly ordered workqueues. This was incorrect in that
      system-wide config change shouldn't break ordering properties of all
      workqueues. The reason why apply_workqueue_attrs() path was allowed to do so
      was because it was targeting the specific workqueue - either the workqueue
      had WQ_SYSFS set or the workqueue user specifically tried to change
      max_active, both of which indicate that the workqueue doesn't need to be
      ordered.
      
      The implicitly ordered workqueue promotion was removed by the previous
      commit 3bc1e711 ("workqueue: Don't implicitly make UNBOUND workqueues w/
      @max_active==1 ordered"). However, it didn't update this path and broke
      build. Let's revert the commit which was incorrect in the first place which
      also fixes build.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 3bc1e711 ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered")
      Fixes: ca10d851 ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()")
      Cc: stable@vger.kernel.org # v6.6+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      aac8a595
    • Tejun Heo's avatar
      workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered · 3bc1e711
      Tejun Heo authored
      5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
      automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered
      workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way
      to create ordered workqueues and the new NUMA support broke it. These
      problems can be subtle and the fact that they can only trigger on NUMA
      machines made them even more difficult to debug.
      
      However, overloading the UNBOUND allocation interface this way creates other
      issues. It's difficult to tell whether a given workqueue actually needs to
      be ordered and users that legitimately want a min concurrency level wq
      unexpectedly gets an ordered one instead. With planned UNBOUND workqueue
      udpates to improve execution locality and more prevalence of chiplet designs
      which can benefit from such improvements, this isn't a state we wanna be in
      forever.
      
      There aren't that many UNBOUND w/ @max_active==1 users in the tree and the
      preceding patches audited all and converted them to
      alloc_ordered_workqueue() as appropriate. This patch removes the implicit
      promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.
      
      v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in
          apply_workqueue_attrs_locked() which spuriously triggers WARNING and
          fails workqueue creation. Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com
      3bc1e711
  7. 05 Feb, 2024 3 commits
  8. 04 Feb, 2024 5 commits
    • Tejun Heo's avatar
      workqueue: Implement BH workqueues to eventually replace tasklets · 4cb1ef64
      Tejun Heo authored
      The only generic interface to execute asynchronously in the BH context is
      tasklet; however, it's marked deprecated and has some design flaws such as
      the execution code accessing the tasklet item after the execution is
      complete which can lead to subtle use-after-free in certain usage scenarios
      and less-developed flush and cancel mechanisms.
      
      This patch implements BH workqueues which share the same semantics and
      features of regular workqueues but execute their work items in the softirq
      context. As there is always only one BH execution context per CPU, none of
      the concurrency management mechanisms applies and a BH workqueue can be
      thought of as a convenience wrapper around softirq.
      
      Except for the inability to sleep while executing and lack of max_active
      adjustments, BH workqueues and work items should behave the same as regular
      workqueues and work items.
      
      Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
      convert all tasklet users over to BH workqueues. Once the conversion is
      complete, tasklet can be removed and BH workqueues can directly take over
      the tasklet softirqs.
      
      system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
      tasklet, all existing tasklet users should be able to use the system BH
      workqueues without creating their own workqueues.
      
      v3: - Add missing interrupt.h include.
      
      v2: - Instead of using tasklets, hook directly into its softirq action
            functions - tasklet[_hi]_action(). This is slightly cheaper and closer
            to the eventual code structure we want to arrive at. Suggested by Lai.
      
          - Lai also pointed out several places which need NULL worker->task
            handling or can use clarification. Updated.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.comTested-by: default avatarAllen Pais <allen.lkml@gmail.com>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      4cb1ef64
    • Tejun Heo's avatar
      workqueue: Factor out init_cpu_worker_pool() · 2fcdb1b4
      Tejun Heo authored
      Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure
      reorganization in preparation of BH workqueue support.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarAllen Pais <allen.lkml@gmail.com>
      2fcdb1b4
    • Tejun Heo's avatar
      workqueue: Update lock debugging code · c35aea39
      Tejun Heo authored
      These changes are in preparation of BH workqueue which will execute work
      items from BH context.
      
      - Update lock and RCU depth checks in process_one_work() so that it
        remembers and checks against the starting depths and prints out the depth
        changes.
      
      - Factor out lockdep annotations in the flush paths into
        touch_{wq|work}_lockdep_map(). The work->lockdep_map touching is moved
        from __flush_work() to its callee - start_flush_work(). This brings it
        closer to the wq counterpart and will allow testing the associated wq's
        flags which will be needed to support BH workqueues. This is not expected
        to cause any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarAllen Pais <allen.lkml@gmail.com>
      c35aea39
    • Ricardo B. Marliere's avatar
      workqueue: make wq_subsys const · d412ace1
      Ricardo B. Marliere authored
      Now that the driver core can properly handle constant struct bus_type,
      move the wq_subsys variable to be a constant structure as well,
      placing it into read-only memory which can not be modified at runtime.
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Suggested-and-reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarRicardo B. Marliere <ricardo@marliere.net>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d412ace1
    • Tejun Heo's avatar
      workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending() · c70e1779
      Tejun Heo authored
      dd6c3c54 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work
      item handling") relocated pwq_dec_nr_in_flight() after
      set_work_pool_and_keep_pending(). However, the latter destroys information
      contained in work->data that's needed by pwq_dec_nr_in_flight() including
      the flush color. With flush color destroyed, flush_workqueue() can stall
      easily when mixed with cancel_work*() usages.
      
      This is easily triggered by running xfstests generic/001 test on xfs:
      
           INFO: task umount:6305 blocked for more than 122 seconds.
           ...
           task:umount          state:D stack:13008 pid:6305  tgid:6305  ppid:6301   flags:0x00004000
           Call Trace:
            <TASK>
            __schedule+0x2f6/0xa20
            schedule+0x36/0xb0
            schedule_timeout+0x20b/0x280
            wait_for_completion+0x8a/0x140
            __flush_workqueue+0x11a/0x3b0
            xfs_inodegc_flush+0x24/0xf0
            xfs_unmountfs+0x14/0x180
            xfs_fs_put_super+0x3d/0x90
            generic_shutdown_super+0x7c/0x160
            kill_block_super+0x1b/0x40
            xfs_kill_sb+0x12/0x30
            deactivate_locked_super+0x35/0x90
            deactivate_super+0x42/0x50
            cleanup_mnt+0x109/0x170
            __cleanup_mnt+0x12/0x20
            task_work_run+0x60/0x90
            syscall_exit_to_user_mode+0x146/0x150
            do_syscall_64+0x5d/0x110
            entry_SYSCALL_64_after_hwframe+0x6c/0x74
      
      Fix it by stashing work_data before calling set_work_pool_and_keep_pending()
      and using the stashed value for pwq_dec_nr_in_flight().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64
      Fixes: dd6c3c54 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")
      c70e1779
  9. 01 Feb, 2024 1 commit
  10. 31 Jan, 2024 2 commits
    • Tejun Heo's avatar
      workqueue: Avoid premature init of wq->node_nr_active[].max · c5f8cd6c
      Tejun Heo authored
      System workqueues are allocated early during boot from
      workqueue_init_early(). While allocating unbound workqueues,
      wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
      accesses NUMA topology to initialize wq->node_nr_active[].max.
      
      However, topology information may not be set up at this point.
      wq_update_node_max_active() is explicitly invoked from
      workqueue_init_topology() later when topology information is known to be
      available.
      
      This doesn't seem to crash anything but it's doing useless work with dubious
      data. Let's skip the premature and duplicate node_max_active updates by
      initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
      wq_update_node_max_active() noop until workqueue_init_topology().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ---
       kernel/workqueue.c |    8 ++++++++
       1 file changed, 8 insertions(+)
      
      diff --git a/kernel/workqueue.c b/kernel/workqueue.c
      index 9221a4c57ae1..a65081ec6780 100644
      --- a/kernel/workqueue.c
      +++ b/kernel/workqueue.c
      @@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
       	[WQ_AFFN_SYSTEM]		= "system",
       };
      
      +static bool wq_topo_initialized = false;
      +
       /*
        * Per-cpu work items which run for longer than the following threshold are
        * automatically considered CPU intensive and excluded from concurrency
      @@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)
      
       	lockdep_assert_held(&wq->mutex);
      
      +	if (!wq_topo_initialized)
      +		return;
      +
       	if (!cpumask_test_cpu(off_cpu, effective))
       		off_cpu = -1;
      
      @@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)
      
       static void init_node_nr_active(struct wq_node_nr_active *nna)
       {
      +	nna->max = WQ_DFL_MIN_ACTIVE;
       	atomic_set(&nna->nr, 0);
       	raw_spin_lock_init(&nna->lock);
       	INIT_LIST_HEAD(&nna->pending_pwqs);
      @@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
       	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
       	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
      
      +	wq_topo_initialized = true;
      +
       	mutex_lock(&wq_pool_mutex);
      
       	/*
      c5f8cd6c
    • Tejun Heo's avatar
      workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active() · 15930da4
      Tejun Heo authored
      For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
      going down. The function was incorrectly calling cpumask_test_cpu() with -1
      CPU leading to oopses like the following on some archs:
      
        Unable to handle kernel paging request at virtual address ffff0002100296e0
        ..
        pc : wq_update_node_max_active+0x50/0x1fc
        lr : wq_update_node_max_active+0x1f0/0x1fc
        ...
        Call trace:
          wq_update_node_max_active+0x50/0x1fc
          apply_wqattrs_commit+0xf0/0x114
          apply_workqueue_attrs_locked+0x58/0xa0
          alloc_workqueue+0x5ac/0x774
          workqueue_init_early+0x460/0x540
          start_kernel+0x258/0x684
          __primary_switched+0xb8/0xc0
        Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
        ---[ end trace 0000000000000000 ]---
        Kernel panic - not syncing: Attempted to kill the idle task!
        ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
      
      Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Tested-by: default avatarNathan Chancellor <nathan@kernel.org>
      Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
      Fixes: 5797b1c1 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
      15930da4
  11. 30 Jan, 2024 1 commit
    • Leonardo Bras's avatar
      workqueue: Avoid using isolated cpus' timers on queue_delayed_work · aae17ebb
      Leonardo Bras authored
      When __queue_delayed_work() is called, it chooses a cpu for handling the
      timer interrupt. As of today, it will pick either the cpu passed as
      parameter or the last cpu used for this.
      
      This is not good if a system does use CPU isolation, because it can take
      away some valuable cpu time to:
      1 - deal with the timer interrupt,
      2 - schedule-out the desired task,
      3 - queue work on a random workqueue, and
      4 - schedule the desired task back to the cpu.
      
      So to fix this, during __queue_delayed_work(), if cpu isolation is in
      place, pick a random non-isolated cpu to handle the timer interrupt.
      
      As an optimization, if the current cpu is not isolated, use it instead
      of looking for another candidate.
      Signed-off-by: default avatarLeonardo Bras <leobras@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      aae17ebb
  12. 29 Jan, 2024 9 commits
    • Tejun Heo's avatar
      tools/workqueue/wq_dump.py: Add node_nr/max_active dump · 07daa99b
      Tejun Heo authored
      Print out per-node nr/max_active numbers to improve visibility into
      node_nr_active operations.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      07daa99b
    • Tejun Heo's avatar
      workqueue: Implement system-wide nr_active enforcement for unbound workqueues · 5797b1c1
      Tejun Heo authored
      A pool_workqueue (pwq) represents the connection between a workqueue and a
      worker_pool. One of the roles that a pwq plays is enforcement of the
      max_active concurrency limit. Before 636b927e ("workqueue: Make unbound
      workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
      for per-cpu workqueues and per each NUMA node for unbound workqueues, which
      was a natural result of per-cpu workqueues being served by per-cpu pools and
      unbound by per-NUMA pools.
      
      In terms of max_active enforcement, this was, while not perfect, workable.
      For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
      NUMA machines would get max_active that's multiplied by the number of nodes
      but didn't cause huge problems because NUMA machines are relatively rare and
      the node count is usually pretty low.
      
      However, cache layouts are more complex now and sharing a worker pool across
      a whole node didn't really work well for unbound workqueues. Thus, a series
      of commits culminating on 8639eceb ("workqueue: Make unbound workqueues
      to use per-cpu pool_workqueues") implemented more flexible affinity
      mechanism for unbound workqueues which enables using e.g. last-level-cache
      aligned pools. In the process, 636b927e ("workqueue: Make unbound
      workqueues to use per-cpu pool_workqueues") made unbound workqueues use
      per-cpu pwqs like per-cpu workqueues.
      
      While the change was necessary to enable more flexible affinity scopes, this
      came with the side effect of blowing up the effective max_active for unbound
      workqueues. Before, the effective max_active for unbound workqueues was
      multiplied by the number of nodes. After, by the number of CPUs.
      
      636b927e ("workqueue: Make unbound workqueues to use per-cpu
      pool_workqueues") claims that this should generally be okay. It is okay for
      users which self-regulates concurrency level which are the vast majority;
      however, there are enough use cases which actually depend on max_active to
      prevent the level of concurrency from going bonkers including several IO
      handling workqueues that can issue a work item for each in-flight IO. With
      targeted benchmarks, the misbehavior can easily be exposed as reported in
      http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
      
      Unfortunately, there is no way to express what these use cases need using
      per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
      to set max_active too low but as soon as we increase max_active a bit, we
      can end up with unreasonable number of in-flight work items when many CPUs
      issue IOs at the same time. ie. The acceptable lowest max_active is higher
      than the acceptable highest max_active.
      
      Ideally, max_active for an unbound workqueue should be system-wide so that
      the users can regulate the total level of concurrency regardless of node and
      cache layout. The reasons workqueue hasn't implemented that yet are:
      
      - One max_active enforcement decouples from pool boundaires, chaining
        execution after a work item finishes requires inter-pool operations which
        would require lock dancing, which is nasty.
      
      - Sharing a single nr_active count across the whole system can be pretty
        expensive on NUMA machines.
      
      - Per-pwq enforcement had been more or less okay while we were using
        per-node pools.
      
      It looks like we no longer can avoid decoupling max_active enforcement from
      pool boundaries. This patch implements system-wide nr_active mechanism with
      the following design characteristics:
      
      - To avoid sharing a single counter across multiple nodes, the configured
        max_active is split across nodes according to the proportion of each
        workqueue's online effective CPUs per node. e.g. A node with twice more
        online effective CPUs will get twice higher portion of max_active.
      
      - Workqueue used to be able to process a chain of interdependent work items
        which is as long as max_active. We can't do this anymore as max_active is
        distributed across the nodes. Instead, a new parameter min_active is
        introduced which determines the minimum level of concurrency within a node
        regardless of how max_active distribution comes out to be.
      
        It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
        This can lead to higher effective max_weight than configured and also
        deadlocks if a workqueue was depending on being able to handle chains of
        interdependent work items that are longer than 8.
      
        I believe these should be fine given that the number of CPUs in each NUMA
        node is usually higher than 8 and work item chain longer than 8 is pretty
        unlikely. However, if these assumptions turn out to be wrong, we'll need
        to add an interface to adjust min_active.
      
      - Each unbound wq has an array of struct wq_node_nr_active which tracks
        per-node nr_active. When its pwq wants to run a work item, it has to
        obtain the matching node's nr_active. If over the node's max_active, the
        pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
        the completion path round-robins the pending pwqs activating the first
        inactive work item of each, which involves some pool lock dancing and
        kicking other pools. It's not the simplest code but doesn't look too bad.
      
      v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
      
          - wq_adjust_max_active() is now protected by wq->mutex instead of
            wq_pool_mutex.
      
      v3: - wq_node_max_active() used to calculate per-node max_active on the fly
            based on system-wide CPU online states. Lai pointed out that this can
            lead to skewed distributions for workqueues with restricted cpumasks.
            Update the max_active distribution to use per-workqueue effective
            online CPU counts instead of system-wide and cache the calculation
            results in node_nr_active->max.
      
      v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarNaohiro Aota <Naohiro.Aota@wdc.com>
      Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
      Fixes: 636b927e ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      5797b1c1
    • Tejun Heo's avatar
      workqueue: Introduce struct wq_node_nr_active · 91ccc6e7
      Tejun Heo authored
      Currently, for both percpu and unbound workqueues, max_active applies
      per-cpu, which is a recent change for unbound workqueues. The change for
      unbound workqueues was a significant departure from the previous behavior of
      per-node application. It made some use cases create undesirable number of
      concurrent work items and left no good way of fixing them. To address the
      problem, workqueue is implementing a NUMA node segmented global nr_active
      mechanism, which will be explained further in the next patch.
      
      As a preparation, this patch introduces struct wq_node_nr_active. It's a
      data structured allocated for each workqueue and NUMA node pair and
      currently only tracks the workqueue's number of active work items on the
      node. This is split out from the next patch to make it easier to understand
      and review.
      
      Note that there is an extra wq_node_nr_active allocated for the invalid node
      nr_node_ids which is used to track nr_active for pools which don't have NUMA
      node associated such as the default fallback system-wide pool.
      
      This doesn't cause any behavior changes visible to userland yet. The next
      patch will expand to implement the control mechanism on top.
      
      v4: - Fixed out-of-bound access when freeing per-cpu workqueues.
      
      v3: - Use flexible array for wq->node_nr_active as suggested by Lai.
      
      v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
      
          - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
            pwq->max_active check. Restored. As the next patch replaces the
            max_active enforcement mechanism, this doesn't change the end result.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      91ccc6e7
    • Tejun Heo's avatar
      workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling · dd6c3c54
      Tejun Heo authored
      The planned shared nr_active handling for unbound workqueues will make
      pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
      other pool locks, which is necessary as retirement of an nr_active count
      from one pool may need kick off an inactive work item in another pool.
      
      This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
      end of work item handling so that work item state changes stay atomic.
      process_one_work() which is the other user of pwq_dec_nr_in_flight() already
      calls it at the end of work item handling. Comments are added to both call
      sites and pwq_dec_nr_in_flight().
      
      This shouldn't cause any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      dd6c3c54
    • Tejun Heo's avatar
      workqueue: RCU protect wq->dfl_pwq and implement accessors for it · 9f66cff2
      Tejun Heo authored
      wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
      currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
      which doesn't require RCU access. However, we want to be able to access
      wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
      can be made easier to read by making the two pwq fields behave in the same
      way.
      
      - Make wq->dfl_pwq RCU protected.
      
      - Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
        and ->cpu_pwq. The former returns the double pointer that can be used
        access and update the pwqs. The latter performs locking check and
        dereferences the double pointer.
      
      - pwq accesses and updates are converted to use unbound_pwq[_slot]().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      9f66cff2
    • Tejun Heo's avatar
      workqueue: Make wq_adjust_max_active() round-robin pwqs while activating · c5404d4e
      Tejun Heo authored
      wq_adjust_max_active() needs to activate work items after max_active is
      increased. Previously, it did that by visiting each pwq once activating all
      that could be activated. While this makes sense with per-pwq nr_active,
      nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
      want to round-robin through pwqs to be fairer.
      
      In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
      while activating. While the activation ordering changes, this shouldn't
      cause user-noticeable behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      c5404d4e
    • Tejun Heo's avatar
      workqueue: Move nr_active handling into helpers · 1c270b79
      Tejun Heo authored
      __queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
      open-coding nr_active handling, which is fine given that the operations are
      trivial. However, the planned unbound nr_active update will make them more
      complicated, so let's move them into helpers.
      
      - pwq_tryinc_nr_active() is added. It increments nr_active if under
        max_active limit and return a boolean indicating whether inc was
        successful. Note that the function is structured to accommodate future
        changes. __queue_work() is updated to use the new helper.
      
      - pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
        thus no longer assumes that nr_active is under max_active and returns a
        boolean to indicate whether a work item has been activated.
      
      - wq_adjust_max_active() no longer tests directly whether a work item can be
        activated. Instead, it's updated to use the return value of
        pwq_activate_first_inactive() to tell whether a work item has been
        activated.
      
      - nr_active decrement and activating the first inactive work item is
        factored into pwq_dec_nr_active().
      
      v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
            now we're calling the function unconditionally from
            pwq_activate_first_inactive().
      
      v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      1c270b79
    • Tejun Heo's avatar
      workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work() · 4c638030
      Tejun Heo authored
      To prepare for unbound nr_active handling improvements, move work activation
      part of pwq_activate_inactive_work() into __pwq_activate_work() and add
      pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.
      
      pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
      pwq_activate_work(). The latter conversion is functionally identical. For
      the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
      testing. This is temporary and will be removed by the next patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      4c638030
    • Tejun Heo's avatar
      workqueue: Factor out pwq_is_empty() · afa87ce8
      Tejun Heo authored
      "!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
      multiple times. Let's factor it out into pwq_is_empty().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      afa87ce8