1. 21 Jul, 2022 2 commits
  2. 13 Jul, 2022 2 commits
    • John Keeping's avatar
      sched/core: Always flush pending blk_plug · 401e4963
      John Keeping authored
      With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
      normal priority tasks (SCHED_OTHER, nice level zero):
      
      	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
      	      Not tainted 5.15.49-rt46 #1
      	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
      	Workqueue: writeback wb_workfn (flush-7:0)
      	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
      	[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
      	[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
      	+(rt_mutex_slowlock.constprop.0+0xac/0x174)
      	[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
      	[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
      	[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
      	[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
      	[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
      	[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
      	[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
      	[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
      	[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
      	[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
      	Exception stack(0xc10e3fb0 to 0xc10e3ff8)
      	3fa0:                                     00000000 00000000 00000000 00000000
      	3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      	3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
      
      	INFO: task tar:2083 blocked for more than 491 seconds.
      	      Not tainted 5.15.49-rt46 #1
      	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      	task:tar             state:D stack:    0 pid: 2083 ppid:  2082 flags:0x00000000
      	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
      	[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
      	[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
      	[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
      	[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
      	[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
      	[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
      	[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
      	[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
      	[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
      	[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
      	[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
      	Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
      	bfa0:                   01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
      	bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
      	bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc
      
      Here the kworker is waiting on msdos_sb_info::s_lock which is held by
      tar which is in turn waiting for a buffer which is locked waiting to be
      flushed, but this operation is plugged in the kworker.
      
      The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
      return false on !RT and thus the behaviour changes for RT.
      
      It seems that the intent here is to skip blk_flush_plug() in the case
      where a non-preemptible lock (such as a spinlock) has been converted to
      a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
      schedule flag.  But sched_submit_work() is only called from schedule()
      which is never called in this scenario, so the check can simply be
      deleted.
      
      Looking at the history of the -rt patchset, in fact this change was
      present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
      of a larger patch [1] most of which was replaced by commit b4bfa3fc
      ("sched/core: Rework the __schedule() preempt argument").
      
      As described in [1]:
      
         The schedule process must distinguish between blocking on a regular
         sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
         and rwlock):
         - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
           even if blocked on a lock. This can not deadlock because this also
           happens for non-RT.
           There should be a warning if the scheduling point is within a RCU read
           section.
      
         - spinlock and rwlock must not flush block requests. This will deadlock
           if the callback attempts to acquire a lock which is already acquired.
           Similarly to being preempted, there should be no warning if the
           scheduling point is within a RCU read section.
      
      and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
      issue.
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patchesSigned-off-by: default avatarJohn Keeping <john@metanate.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
      401e4963
    • Vincent Guittot's avatar
      sched/fair: fix case with reduced capacity CPU · c82a6962
      Vincent Guittot authored
      The capacity of the CPU available for CFS tasks can be reduced because of
      other activities running on the latter. In such case, it's worth trying to
      move CFS tasks on a CPU with more available capacity.
      
      The rework of the load balance has filtered the case when the CPU is
      classified to be fully busy but its capacity is reduced.
      
      Check if CPU's capacity is reduced while gathering load balance statistic
      and classify it group_misfit_task instead of group_fully_busy so we can
      try to move the load on another CPU.
      Reported-by: default avatarDavid Chen <david.chen@nutanix.com>
      Reported-by: default avatarZhang Qiao <zhangqiao22@huawei.com>
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarDavid Chen <david.chen@nutanix.com>
      Tested-by: default avatarZhang Qiao <zhangqiao22@huawei.com>
      Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org
      c82a6962
  3. 04 Jul, 2022 2 commits
  4. 28 Jun, 2022 14 commits
    • Vincent Donnefort's avatar
      sched/fair: Remove the energy margin in feec() · b812fc97
      Vincent Donnefort authored
      find_energy_efficient_cpu() integrates a margin to protect tasks from
      bouncing back and forth from a CPU to another. This margin is set as being
      6% of the total current energy estimated on the system. This however does
      not work for two reasons:
      
      1. The energy estimation is not a good absolute value:
      
      compute_energy() used in feec() is a good estimation for task placement as
      it allows to compare the energy with and without a task. The computed
      delta will give a good overview of the cost for a certain task placement.
      It, however, doesn't work as an absolute estimation for the total energy
      of the system. First it adds the contribution to idle CPUs into the
      energy, second it mixes util_avg with util_est values. util_avg contains
      the near history for a CPU usage, it doesn't tell at all what the current
      utilization is. A system that has been quite busy in the near past will
      hold a very high energy and then a high margin preventing any task
      migration to a lower capacity CPU, wasting energy. It even creates a
      negative feedback loop: by holding the tasks on a less efficient CPU, the
      margin contributes in keeping the energy high.
      
      2. The margin handicaps small tasks:
      
      On a system where the workload is composed mostly of small tasks (which is
      often the case on Android), the overall energy will be high enough to
      create a margin none of those tasks can cross. On a Pixel4, a small
      utilization of 5% on all the CPUs creates a global estimated energy of 140
      joules, as per the Energy Model declaration of that same device. This
      means, after applying the 6% margin that any migration must save more than
      8 joules to happen. No task with a utilization lower than 40 would then be
      able to migrate away from the biggest CPU of the system.
      
      The 6% of the overall system energy was brought by the following patch:
      
       (eb92692b sched/fair: Speed-up energy-aware wake-ups)
      
      It was previously 6% of the prev_cpu energy. Also, the following one
      made this margin value conditional on the clusters where the task fits:
      
       (8d4c97c1 sched/fair: Only compute base_energy_pd if necessary)
      
      We could simply revert that margin change to what it was, but the original
      version didn't have strong grounds neither and as demonstrated in (1.) the
      estimated energy isn't a good absolute value. Instead, removing it
      completely. It is indeed, made possible by recent changes that improved
      energy estimation comparison fairness (sched/fair: Remove task_util from
      effective utilization in feec()) (PM: EM: Increase energy calculation
      precision) and task utilization stabilization (sched/fair: Decay task
      util_avg during migration)
      
      Without a margin, we could have feared bouncing between CPUs. But running
      LISA's eas_behaviour test coverage on three different platforms (Hikey960,
      RB-5 and DB-845) showed no issue.
      
      Removing the energy margin enables more energy-optimized placements for a
      more energy efficient system.
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarVincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Tested-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Link: https://lkml.kernel.org/r/20220621090414.433602-8-vdonnefort@google.com
      b812fc97
    • Vincent Donnefort's avatar
      sched/fair: Remove task_util from effective utilization in feec() · 3e8c6c9a
      Vincent Donnefort authored
      The energy estimation in find_energy_efficient_cpu() (feec()) relies on
      the computation of the effective utilization for each CPU of a perf domain
      (PD). This effective utilization is then used as an estimation of the busy
      time for this pd. The function effective_cpu_util() which gives this value,
      scales the utilization relative to IRQ pressure on the CPU to take into
      account that the IRQ time is hidden from the task clock. The IRQ scaling is
      as follow:
      
         effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util
      
      Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
      the CPU and irq the IRQ avg time.
      
      If now we take as an example a task placement which doesn't raise the OPP
      on the candidate CPU, we can write the energy delta as:
      
        delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
                                   effective_cpu_util(cpu_util))
              = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util
      
      We end-up with an energy delta depending on the IRQ avg time, which is a
      problem: first the time spent on IRQs by a CPU has no effect on the
      additional energy that would be consumed by a task. Second, we don't want
      to favour a CPU with a higher IRQ avg time value.
      
      Nonetheless, we need to take the IRQ avg time into account. If a task
      placement raises the PD's frequency, it will increase the energy cost for
      the entire time where the CPU is busy. A solution is to only use
      effective_cpu_util() with the CPU contribution part. The task contribution
      is added separately and scaled according to prev_cpu's IRQ time.
      
      No change for the FREQUENCY_UTIL component of the energy estimation. We
      still want to get the actual frequency that would be selected after the
      task placement.
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarVincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Tested-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com
      3e8c6c9a
    • Dietmar Eggemann's avatar
      sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() · 9b340131
      Dietmar Eggemann authored
      The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
      invariant after Energy Model creation, i.e. it is not updated after
      CPU hotplug operations.
      
      That's why the PD mask is used in conjunction with the cpu_online_mask
      (or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
      multiple times (in compute_energy()) during a run-queue selection
      for a task.
      
      cpu_online_mask may change during this time which can lead to wrong
      energy calculations.
      
      To be able to avoid this, use the select_rq_mask per-cpu cpumask to
      create a cpumask out of PD cpumask and cpu_online_mask and pass it
      through the function calls of the EAS run-queue selection path.
      
      The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
      (find_energy_efficient_cpu()) is now ANDed not only with the SD mask
      but also with the cpu_online_mask. This is fine since this cpumask
      has to be in syc with the one used for energy computation
      (compute_energy()).
      An exclusive cpuset setup with at least one asymmetric CPU capacity
      island (hence the additional AND with the SD cpumask) is the obvious
      exception here.
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com
      9b340131
    • Dietmar Eggemann's avatar
      sched/fair: Rename select_idle_mask to select_rq_mask · ec4fc801
      Dietmar Eggemann authored
      On 21/06/2022 11:04, Vincent Donnefort wrote:
      > From: Dietmar Eggemann <dietmar.eggemann@arm.com>
      
      https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered
      that this patch doesn't build anymore (on tip sched/core or linux-next)
      because of commit f5b2eeb4 ("sched/fair: Consider CPU affinity when
      allowing NUMA imbalance in find_idlest_group()").
      
      New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to
      select_rq_mask below.
      
      -- >8 --
      
      Decouple the name of the per-cpu cpumask select_idle_mask from its usage
      in select_idle_[cpu/capacity]() of the CFS run-queue selection
      (select_task_rq_fair()).
      
      This is to support the reuse of this cpumask in the Energy Aware
      Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
      selection.
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com
      ec4fc801
    • Dietmar Eggemann's avatar
      sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() · bb447999
      Dietmar Eggemann authored
      effective_cpu_util() already has a `int cpu' parameter which allows to
      retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
      this function via an arch_scale_cpu_capacity(cpu).
      
      A lot of code calling effective_cpu_util() (or the shim
      sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
      arch_scale_cpu_capacity() already.
      But not having to pass it into effective_cpu_util() will make the EAS
      wake-up code easier, especially when the maximum CPU capacity reduced
      by the thermal pressure is passed through the EAS wake-up functions.
      
      Due to the asymmetric CPU capacity support of arm/arm64 architectures,
      arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
      per_cpu(cpu_scale, cpu) on such a system.
      On all other architectures it is a a compile-time constant
      (SCHED_CAPACITY_SCALE).
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
      bb447999
    • Vincent Donnefort's avatar
      sched/fair: Decay task PELT values during wakeup migration · e2f3e35f
      Vincent Donnefort authored
      Before being migrated to a new CPU, a task sees its PELT values
      synchronized with rq last_update_time. Once done, that same task will also
      have its sched_avg last_update_time reset. This means the time between
      the migration and the last clock update will not be accounted for in
      util_avg and a discontinuity will appear. This issue is amplified by the
      PELT clock scaling. It takes currently one tick after the CPU being idle
      to let clock_pelt catching up clock_task.
      
      This is especially problematic for asymmetric CPU capacity systems which
      need stable util_avg signals for task placement and energy estimation.
      
      Ideally, this problem would be solved by updating the runqueue clocks
      before the migration. But that would require taking the runqueue lock
      which is quite expensive [1]. Instead estimate the missing time and update
      the task util_avg with that value.
      
      To that end, we need sched_clock_cpu() but it is a costly function. Limit
      the usage to the case where the source CPU is idle as we know this is when
      the clock is having the biggest risk of being outdated.
      
      See comment in migrate_se_pelt_lag() for more details about how the PELT
      value is estimated. Notice though this estimation doesn't take into account
      IRQ and Paravirt time.
      
      [1] https://lkml.kernel.org/r/20190709115759.10451-1-chris.redpath@arm.comSigned-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarVincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Tested-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Link: https://lkml.kernel.org/r/20220621090414.433602-3-vdonnefort@google.com
      e2f3e35f
    • Vincent Donnefort's avatar
      sched/fair: Provide u64 read for 32-bits arch helper · d05b4305
      Vincent Donnefort authored
      Introducing macro helpers u64_u32_{store,load}() to factorize lockless
      accesses to u64 variables for 32-bits architectures.
      
      Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
      accommodate the later where the copy lies outside of the structure
      (cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
      use the _copy() version of those helpers.
      
      Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
      therefore, have a small penalty for 32-bits machines in set_task_rq_fair()
      and init_cfs_rq().
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarVincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Tested-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com
      d05b4305
    • Chen Yu's avatar
      sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg · 70fb5ccf
      Chen Yu authored
      [Problem Statement]
      select_idle_cpu() might spend too much time searching for an idle CPU,
      when the system is overloaded.
      
      The following histogram is the time spent in select_idle_cpu(),
      when running 224 instances of netperf on a system with 112 CPUs
      per LLC domain:
      
      @usecs:
      [0]                  533 |                                                    |
      [1]                 5495 |                                                    |
      [2, 4)             12008 |                                                    |
      [4, 8)            239252 |                                                    |
      [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
      [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
      [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
      [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
      [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
      [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
      [512, 1K)        2600472 |@@@@@@@@@                                           |
      [1K, 2K)          927912 |@@@                                                 |
      [2K, 4K)          218720 |                                                    |
      [4K, 8K)           98161 |                                                    |
      [8K, 16K)          37722 |                                                    |
      [16K, 32K)          6715 |                                                    |
      [32K, 64K)           477 |                                                    |
      [64K, 128K)            7 |                                                    |
      
      netperf latency usecs:
      =======
      case            	load    	    Lat_99th	    std%
      TCP_RR          	thread-224	      257.39	(  0.21)
      
      The time spent in select_idle_cpu() is visible to netperf and might have a negative
      impact.
      
      [Symptom analysis]
      The patch [1] from Mel Gorman has been applied to track the efficiency
      of select_idle_sibling. Copy the indicators here:
      
      SIS Search Efficiency(se_eff%):
              A ratio expressed as a percentage of runqueues scanned versus
              idle CPUs found. A 100% efficiency indicates that the target,
              prev or recent CPU of a task was idle at wakeup. The lower the
              efficiency, the more runqueues were scanned before an idle CPU
              was found.
      
      SIS Domain Search Efficiency(dom_eff%):
              Similar, except only for the slower SIS
      	patch.
      
      SIS Fast Success Rate(fast_rate%):
              Percentage of SIS that used target, prev or
      	recent CPUs.
      
      SIS Success rate(success_rate%):
              Percentage of scans that found an idle CPU.
      
      The test is based on Aubrey's schedtests tool, including netperf, hackbench,
      schbench and tbench.
      
      Test on vanilla kernel:
      schedstat_parse.py -f netperf_vanilla.log
      case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
      TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
      TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
      TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
      TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
      TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
      TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
      TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
      TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
      UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
      UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
      UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
      UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
      UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
      UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
      UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
      UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860
      
      schedstat_parse.py -f schbench_vanilla.log
      (each group has 28 tasks)
      case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
      normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
      normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
      normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
      normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
      normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
      normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
      normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
      normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549
      
      schedstat_parse.py -f hackbench_vanilla.log
      (each group has 28 tasks)
      case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
      process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
      process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
      process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
      process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
      process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
      process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
      process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
      process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
      process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
      process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
      process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
      process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
      process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
      process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
      process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
      process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
      threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
      threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
      threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
      threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
      threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
      threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
      threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
      threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
      threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
      threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
      threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
      threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
      threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
      threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
      threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
      threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312
      
      schedstat_parse.py -f tbench_vanilla.log
      case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
      loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
      loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
      loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
      loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
      loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
      loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
      loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
      loopback	 224 threads	        0.000	       0.000	       0.000	       0.000
      
      According to the test above, if the system becomes busy, the
      SIS Search Efficiency(se_eff%) drops significantly. Although some
      benchmarks would finally find an idle CPU(success_rate% = 100%), it is
      doubtful whether it is worth it to search the whole LLC domain.
      
      [Proposal]
      It would be ideal to have a crystal ball to answer this question:
      How many CPUs must a wakeup path walk down, before it can find an idle
      CPU? Many potential metrics could be used to predict the number.
      One candidate is the sum of util_avg in this LLC domain. The benefit
      of choosing util_avg is that it is a metric of accumulated historic
      activity, which seems to be smoother than instantaneous metrics
      (such as rq->nr_running). Besides, choosing the sum of util_avg
      would help predict the load of the LLC domain more precisely, because
      SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
      time.
      
      In summary, the lower the util_avg is, the more select_idle_cpu()
      should scan for idle CPU, and vice versa. When the sum of util_avg
      in this LLC domain hits 85% or above, the scan stops. The reason to
      choose 85% as the threshold is that this is the imbalance_pct(117)
      when a LLC sched group is overloaded.
      
      Introduce the quadratic function:
      
      y = SCHED_CAPACITY_SCALE - p * x^2
      and y'= y / SCHED_CAPACITY_SCALE
      
      x is the ratio of sum_util compared to the CPU capacity:
      x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
      y' is the ratio of CPUs to be scanned in the LLC domain,
      and the number of CPUs to scan is calculated by:
      
      nr_scan = llc_weight * y'
      
      Choosing quadratic function is because:
      [1] Compared to the linear function, it scans more aggressively when the
          sum_util is low.
      [2] Compared to the exponential function, it is easier to calculate.
      [3] It seems that there is no accurate mapping between the sum of util_avg
          and the number of CPUs to be scanned. Use heuristic scan for now.
      
      For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
      sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
      scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...
      
      For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
      sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
      scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...
      
      Furthermore, to minimize the overhead of calculating the metrics in
      select_idle_cpu(), borrow the statistics from periodic load balance.
      As mentioned by Abel, on a platform with 112 CPUs per LLC, the
      sum_util calculated by periodic load balance after 112 ms would
      decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
      in reflecting the latest utilization. But it is a trade-off.
      Checking the util_avg in newidle load balance would be more frequent,
      but it brings overhead - multiple CPUs write/read the per-LLC shared
      variable and introduces cache contention. Tim also mentioned that,
      it is allowed to be non-optimal in terms of scheduling for the
      short-term variations, but if there is a long-term trend in the load
      behavior, the scheduler can adjust for that.
      
      When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
      calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
      Mel suggested, SIS_UTIL should be enabled by default.
      
      This patch is based on the util_avg, which is very sensitive to the
      CPU frequency invariance. There is an issue that, when the max frequency
      has been clamp, the util_avg would decay insanely fast when
      the CPU is idle. Commit addca285 ("cpufreq: intel_pstate: Handle no_turbo
      in frequency invariance") could be used to mitigate this symptom, by adjusting
      the arch_max_freq_ratio when turbo is disabled. But this issue is still
      not thoroughly fixed, because the current code is unaware of the user-specified
      max CPU frequency.
      
      [Test result]
      
      netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
      175% 200% of CPU number respectively. Hackbench and schbench were launched
      by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
      
      The following is the benchmark result comparison between
      baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
      indicates better performance.
      
      Each netperf test is a:
      netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
      netperf.throughput
      =======
      case            	load    	baseline(std%)	compare%( std%)
      TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
      TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
      TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
      TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
      TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
      TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
      TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
      TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
      UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
      UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
      UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
      UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
      UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
      UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
      UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
      UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)
      
      Take the 224 threads as an example, the SIS search metrics changes are
      illustrated below:
      
          vanilla                    patched
         4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
           38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
        128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
         5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg
      
      There is -87.9% less CPU scans after patched, which indicates lower overhead.
      Besides, with this patch applied, there is -13% less rq lock contention
      in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
      .try_to_wake_up.default_wake_function.woken_wake_function.
      This might help explain the performance improvement - Because this patch allows
      the waking task to remain on the previous CPU, rather than grabbing other CPUs'
      lock.
      
      Each hackbench test is a:
      hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
      hackbench.throughput
      =========
      case            	load    	baseline(std%)	compare%( std%)
      process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
      process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
      process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
      process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
      process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
      process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
      process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
      process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
      threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
      threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
      threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
      threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
      threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
      threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
      threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
      threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)
      
      Each tbench test is a:
      tbench -t 100 $job 127.0.0.1
      tbench.throughput
      ======
      case            	load    	baseline(std%)	compare%( std%)
      loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
      loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
      loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
      loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
      loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
      loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
      loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
      loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)
      
      Each schbench test is a:
      schbench -m $job -t 28 -r 100 -s 30000 -c 30000
      schbench.latency_90%_us
      ========
      case            	load    	baseline(std%)	compare%( std%)
      normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
      normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
      normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
      normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)
      
      *Consider the Standard Deviation, this -7.36% regression might not be valid.
      
      Also, a OLTP workload with a commercial RDBMS has been tested, and there
      is no significant change.
      
      There were concerns that unbalanced tasks among CPUs would cause problems.
      For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
      bound to CPU0~CPU6, while CPU7 is idle:
      
                CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
      util_avg  1024    1024    1024    1024    1024    1024    1024    0
      
      Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
      select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
      But according to Mel, it is unlikely the CPU7 will be idle all the time
      because CPU7 could pull some tasks via CPU_NEWLY_IDLE.
      
      lkp(kernel test robot) has reported a regression on stress-ng.sock on a
      very busy system. According to the sched_debug statistics, it might be caused
      by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
      might introduce more context switch, especially involuntary preemption, which
      impacts a busy stress-ng. This regression has shown that, not all benchmarks
      in every scenario benefit from idle CPU scan limit, and it needs further
      investigation.
      
      Besides, there is slight regression in hackbench's 16 groups case when the
      LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
      in an LLC domain with 16 CPUs. Because the cost to search for an idle one
      among 16 CPUs is negligible. The current patch aims to propose a generic
      solution and only considers the util_avg. Something like the below could
      be applied on top of the current patch to fulfill the requirement:
      
      	if (llc_weight <= 16)
      		nr_scan = nr_scan * 32 / llc_weight;
      
      For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
      The smaller the CPU number this LLC domain has, the larger nr_scan will be
      expanded. This needs further investigation.
      
      There is also ongoing work[2] from Abel to filter out the busy CPUs during
      wakeup, to further speed up the idle CPU scan. And it could be a following-up
      optimization on top of this change.
      Suggested-by: default avatarTim Chen <tim.c.chen@intel.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarYicong Yang <yangyicong@hisilicon.com>
      Tested-by: default avatarMohini Narkhede <mohini.narkhede@intel.com>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
      70fb5ccf
    • Christian Göttsche's avatar
      sched: only perform capability check on privileged operation · 700a7833
      Christian Göttsche authored
      sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler()
      a CAP_SYS_NICE audit event unconditionally, even when the requested
      operation does not require that capability / is unprivileged, i.e. for
      reducing niceness.
      This is relevant in connection with SELinux, where a capability check
      results in a policy decision and by default a denial message on
      insufficient permission is issued.
      It can lead to three undesired cases:
        1. A denial message is generated, even in case the operation was an
           unprivileged one and thus the syscall succeeded, creating noise.
        2. To avoid the noise from 1. the policy writer adds a rule to ignore
           those denial messages, hiding future syscalls, where the task
           performs an actual privileged operation, leading to hidden limited
           functionality of that task.
        3. To avoid the noise from 1. the policy writer adds a rule to allow
           the task the capability CAP_SYS_NICE, while it does not need it,
           violating the principle of least privilege.
      
      Conduct privilged/unprivileged categorization first and perform a
      capable test (and at most once) only if needed.
      Signed-off-by: default avatarChristian Göttsche <cgzones@googlemail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com
      700a7833
    • Zhang Qiao's avatar
      sched: Remove unused function group_first_cpu() · c64b551f
      Zhang Qiao authored
      As of commit afe06efd ("sched: Extend scheduler's asym packing")
      group_first_cpu() became an unused function, remove it.
      Signed-off-by: default avatarZhang Qiao <zhangqiao22@huawei.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Link: https://lore.kernel.org/r/20220617181151.29980-3-zhangqiao22@huawei.com
      c64b551f
    • Zhang Qiao's avatar
    • Michael Jeanson's avatar
      selftests/rseq: check if libc rseq support is registered · d1a997ba
      Michael Jeanson authored
      When checking for libc rseq support in the library constructor, don't
      only depend on the symbols presence, check that the registration was
      completed.
      
      This targets a scenario where the libc has rseq support but it is not
      wired for the current architecture in 'bits/rseq.h', we want to fallback
      to our internal registration mechanism.
      Signed-off-by: default avatarMichael Jeanson <mjeanson@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/r/20220614154830.1367382-4-mjeanson@efficios.com
      d1a997ba
    • Michael Jeanson's avatar
      selftests/rseq: riscv: fix 'literal-suffix' warning · d47c0cc9
      Michael Jeanson authored
      This header is also used in librseq where it can be included in C++
      code, add a space between literals and string macros.
      Signed-off-by: default avatarMichael Jeanson <mjeanson@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/r/20220614154830.1367382-3-mjeanson@efficios.com
      d47c0cc9
    • Michael Jeanson's avatar
      selftests/rseq: riscv: use rseq_get_abi() helper · 4f339492
      Michael Jeanson authored
      Make the RISC-V rseq selftests compatible with glibc-2.35 by using the
      rseq_get_abi() helper.
      Signed-off-by: default avatarMichael Jeanson <mjeanson@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/r/20220614154830.1367382-2-mjeanson@efficios.com
      4f339492
  5. 13 Jun, 2022 10 commits
    • Tianchen Ding's avatar
      sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle · f3dd3f67
      Tianchen Ding authored
      Wakelist can help avoid cache bouncing and offload the overhead of waker
      cpu. So far, using wakelist within the same llc only happens on
      WF_ON_CPU, and this limitation could be removed to further improve
      wakeup performance.
      
      The commit 518cd623 ("sched: Only queue remote wakeups when
      crossing cache boundaries") disabled queuing tasks on wakelist when
      the cpus share llc. This is because, at that time, the scheduler must
      send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
      supports TIF_POLLING, so this is not a problem now when the wakee cpu is
      in idle polling.
      
      Benefits:
        Queuing the task on idle cpu can help improving performance on waker cpu
        and utilization on wakee cpu, and further improve locality because
        the wakee cpu can handle its own rq. This patch helps improving rt on
        our real java workloads where wakeup happens frequently.
      
        Consider the normal condition (CPU0 and CPU1 share same llc)
        Before this patch:
      
               CPU0                                       CPU1
      
          select_task_rq()                                idle
          rq_lock(CPU1->rq)
          enqueue_task(CPU1->rq)
          notify CPU1 (by sending IPI or CPU1 polling)
      
                                                          resched()
      
        After this patch:
      
               CPU0                                       CPU1
      
          select_task_rq()                                idle
          add to wakelist of CPU1
          notify CPU1 (by sending IPI or CPU1 polling)
      
                                                          rq_lock(CPU1->rq)
                                                          enqueue_task(CPU1->rq)
                                                          resched()
      
        We see CPU0 can finish its work earlier. It only needs to put task to
        wakelist and return.
        While CPU1 is idle, so let itself handle its own runqueue data.
      
      This patch brings no difference about IPI.
        This patch only takes effect when the wakee cpu is:
        1) idle polling
        2) idle not polling
      
        For 1), there will be no IPI with or without this patch.
      
        For 2), there will always be an IPI before or after this patch.
        Before this patch: waker cpu will enqueue task and check preempt. Since
        "idle" will be sure to be preempted, waker cpu must send a resched IPI.
        After this patch: waker cpu will put the task to the wakelist of wakee
        cpu, and send an IPI.
      
      Benchmark:
      We've tested schbench, unixbench, and hachbench on both x86 and arm64.
      
      On x86 (Intel Xeon Platinum 8269CY):
        schbench -m 2 -t 8
      
          Latency percentiles (usec)              before        after
              50.0000th:                             8            6
              75.0000th:                            10            7
              90.0000th:                            11            8
              95.0000th:                            12            8
              *99.0000th:                           13           10
              99.5000th:                            15           11
              99.9000th:                            18           14
      
        Unixbench with full threads (104)
                                                  before        after
          Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
          Double-Precision Whetstone              617119.3      617298.5   0.03%
          Execl Throughput                         27667.3       27627.3  -0.14%
          File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
          File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
          File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
          Pipe Throughput                      145535622.8   145323033.2  -0.15%
          Pipe-based Context Switching           3221686.4     3583975.4  11.25%
          Process Creation                        101347.1      103345.4   1.97%
          Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
          Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
          System Call Overhead                   5300604.8     5312213.6   0.22%
      
        hackbench -g 1 -l 100000
                                                  before        after
          Time                                     3.246        2.251
      
      On arm64 (Ampere Altra):
        schbench -m 2 -t 8
      
          Latency percentiles (usec)              before        after
              50.0000th:                            14           10
              75.0000th:                            19           14
              90.0000th:                            22           16
              95.0000th:                            23           16
              *99.0000th:                           24           17
              99.5000th:                            24           17
              99.9000th:                            28           25
      
        Unixbench with full threads (80)
                                                  before        after
          Dhrystone 2 using register variables  3536194249    3537019613   0.02%
          Double-Precision Whetstone              629383.6      629431.6   0.01%
          Execl Throughput                         65920.5       65846.2  -0.11%
          File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026b.8   0.03%
          File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
          File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
          Pipe Throughput                      133542875.3   131619389.8  -1.44%
          Pipe-based Context Switching           3215356.1     3576945.1  11.25%
          Process Creation                        108520.5      120184.6  10.75%
          Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
          Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
          System Call Overhead                   4429998.9     44350061.7   0.11%
      
        hackbench -g 1 -l 100000
                                                  before        after
          Time                                     4.217        2.916
      
      Our patch has improvement on schbench, hackbench
      and Pipe-based Context Switching of unixbench
      when there exists idle cpus,
      and no obvious regression on other tests of unixbench.
      This can help improve rt in scenes where wakeup happens frequently.
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
      f3dd3f67
    • Tianchen Ding's avatar
      sched: Fix the check of nr_running at queue wakelist · 28156108
      Tianchen Ding authored
      The commit 2ebb1771 ("sched/core: Offload wakee task activation if it
      the wakee is descheduling") checked rq->nr_running <= 1 to avoid task
      stacking when WF_ON_CPU.
      
      Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu
      (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through
      the deactivate_task() in __schedule(), thus p has been accounted out of
      rq->nr_running. As such, the task being the only runnable task on the rq
      implies reading rq->nr_running == 0 at that point.
      
      The benchmark result is in [1].
      
      [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/Suggested-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
      28156108
    • Josh Don's avatar
      sched: Allow newidle balancing to bail out of load_balance · 792b9f65
      Josh Don authored
      While doing newidle load balancing, it is possible for new tasks to
      arrive, such as with pending wakeups. newidle_balance() already accounts
      for this by exiting the sched_domain load_balance() iteration if it
      detects these cases. This is very important for minimizing wakeup
      latency.
      
      However, if we are already in load_balance(), we may stay there for a
      while before returning back to newidle_balance(). This is most
      exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
      very straightforward workaround to this is to adjust should_we_balance()
      to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
      detected.
      
      This was tested with the following reproduction:
      - two threads that take turns sleeping and waking each other up are
        affined to two cores
      - a large number of threads with 100% utilization are pinned to all
        other cores
      
      Without this patch, wakeup latency was ~120us for the pair of threads,
      almost entirely spent in load_balance(). With this patch, wakeup latency
      is ~6us.
      Signed-off-by: default avatarJosh Don <joshdon@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com
      792b9f65
    • Yajun Deng's avatar
      sched/deadline: Use proc_douintvec_minmax() limit minimum value · 2ed81e76
      Yajun Deng authored
      sysctl_sched_dl_period_max and sysctl_sched_dl_period_min are unsigned
      integer, but proc_dointvec() wouldn't return error even if we set a
      negative number.
      
      Use proc_douintvec_minmax() instead of proc_dointvec(). Add extra1 for
      sysctl_sched_dl_period_max and extra2 for sysctl_sched_dl_period_min.
      
      It's just an optimization for match data and proc_handler in struct
      ctl_table. The 'if (period < min || period > max)' in __checkparam_dl()
      will work fine even if there hasn't this patch.
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Link: https://lore.kernel.org/r/20220607101807.249965-1-yajun.deng@linux.dev
      2ed81e76
    • Chengming Zhou's avatar
      sched/fair: Optimize and simplify rq leaf_cfs_rq_list · 51bf903b
      Chengming Zhou authored
      We notice the rq leaf_cfs_rq_list has two problems when do bugfix
      backports and some test profiling.
      
      1. cfs_rqs under throttled subtree could be added to the list, and
         make their fully decayed ancestors on the list, even though not needed.
      
      2. #1 also make the leaf_cfs_rq_list management complex and error prone,
         this is the list of related bugfix so far:
      
         commit 31bc6aea ("sched/fair: Optimize update_blocked_averages()")
         commit fe61468b ("sched/fair: Fix enqueue_task_fair warning")
         commit b34cb07d ("sched/fair: Fix enqueue_task_fair() warning some more")
         commit 39f23ce0 ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list")
         commit 0258bdfa ("sched/fair: Fix unfairness caused by missing load decay")
         commit a7b359fc ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
         commit fdaba61e ("sched/fair: Ensure that the CFS parent is added after unthrottling")
         commit 2630cde2 ("sched/fair: Add ancestors of unthrottled undecayed cfs_rq")
      
      commit 31bc6aea ("sched/fair: Optimize update_blocked_averages()")
      delete every cfs_rq under throttled subtree from rq->leaf_cfs_rq_list,
      and delete the throttled_hierarchy() test in update_blocked_averages(),
      which optimized update_blocked_averages().
      
      But those later bugfix add cfs_rqs under throttled subtree back to
      rq->leaf_cfs_rq_list again, with their fully decayed ancestors, for
      the integrity of rq->leaf_cfs_rq_list.
      
      This patch takes another method, skip all cfs_rqs under throttled
      hierarchy when list_add_leaf_cfs_rq(), to completely make cfs_rqs
      under throttled subtree off the leaf_cfs_rq_list.
      
      So we don't need to consider throttled related things in
      enqueue_entity(), unthrottle_cfs_rq() and enqueue_task_fair(),
      which simplify the code a lot. Also optimize update_blocked_averages()
      since cfs_rqs under throttled hierarchy and their ancestors
      won't be on the leaf_cfs_rq_list.
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20220601021848.76943-1-zhouchengming@bytedance.com
      51bf903b
    • K Prateek Nayak's avatar
      sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group() · f5b2eeb4
      K Prateek Nayak authored
      In the case of systems containing multiple LLCs per socket, like
      AMD Zen systems, users want to spread bandwidth hungry applications
      across multiple LLCs. Stream is one such representative workload where
      the best performance is obtained by limiting one stream thread per LLC.
      To ensure this, users are known to pin the tasks to a specify a subset
      of the CPUs consisting of one CPU per LLC while running such bandwidth
      hungry tasks.
      
      Suppose we kickstart a multi-threaded task like stream with 8 threads
      using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
      server where each socket contains 128 CPUs
      (0-63,128-191 in one socket, 64-127,192-255 in another socket)
      
      Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
      
      Here each CPU in the list is from a different LLC and 4 of those LLCs
      are on one socket, while the other 4 are on another socket.
      
      Ideally we would prefer that each stream thread runs on a different
      CPU from the allowed list of CPUs. However, the current heuristics in
      find_idlest_group() do not allow this during the initial placement.
      
      Suppose the first socket (0-63,128-191) is our local group from which
      we are kickstarting the stream tasks. The first four stream threads
      will be placed in this socket. When it comes to placing the 5th
      thread, all the allowed CPUs are from the local group (0,16,32,48)
      would have been taken.
      
      However, the current scheduler code simply checks if the number of
      tasks in the local group is fewer than the allowed numa-imbalance
      threshold. This threshold was previously 25% of the NUMA domain span
      (in this case threshold = 32) but after the v6 of Mel's patchset
      "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
      Commit: e496132e ("sched/fair: Adjust the allowed NUMA imbalance
      when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
      in the NUMA domain, for processors with multiple LLCs.
      (in this case threshold = 8).
      
      For this example, the number of tasks will always be within threshold
      and thus all the 8 stream threads will be woken up on the first socket
      thereby resulting in sub-optimal performance.
      
      The following sched_wakeup_new tracepoint output shows the initial
      placement of tasks in the current tip/sched/core on the Zen3 machine:
      
      stream-5313    [016] d..2.   627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
      stream-5313    [016] d..2.   627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
      stream-5313    [016] d..2.   627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
      stream-5313    [016] d..2.   627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
      stream-5313    [016] d..2.   627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
      stream-5313    [016] d..2.   627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
      stream-5313    [016] d..2.   627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016
      
      Once the first four threads are distributed among the allowed CPUs of
      socket one, the rest of the treads start piling on these same CPUs
      when clearly there are CPUs on the second socket that can be used.
      
      Following the initial pile up on a small number of CPUs, though the
      load-balancer eventually kicks in, it takes a while to get to {4}{4}
      and even {4}{4} isn't stable as we observe a bunch of ping ponging
      between {4}{4} to {5}{3} and back before a stable state is reached
      much later (1 Stream thread per allowed CPU) and no more migration is
      required.
      
      We can detect this piling and avoid it by checking if the number of
      allowed CPUs in the local group are fewer than the number of tasks
      running in the local group and use this information to spread the
      5th task out into the next socket (after all, the goal in this
      slowpath is to find the idlest group and the idlest CPU during the
      initial placement!).
      
      The following sched_wakeup_new tracepoint output shows the initial
      placement of tasks after adding this fix on the Zen3 machine:
      
      stream-4485    [016] d..2.   230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
      stream-4485    [016] d..2.   230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
      stream-4485    [016] d..2.   230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
      stream-4485    [016] d..2.   230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
      stream-4485    [016] d..2.   230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
      stream-4485    [016] d..2.   230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
      stream-4485    [016] d..2.   230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064
      
      We see that threads are using all of the allowed CPUs and there is
      no pileup.
      
      No output is generated for tracepoint sched_migrate_task with this
      patch due to a perfect initial placement which removes the need
      for balancing later on - both across NUMA boundaries and within
      NUMA boundaries for stream.
      
      Following are the results from running 8 Stream threads with and
      without pinning on a dual socket Zen3 Machine (2 x 64C/128T):
      
      During the testing of this patch, the tip sched/core was at
      commit: 089c02ae "ftrace: Use preemption model accessors for trace
      header printout"
      
      Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
      
      	           5.18.0-rc1               5.18.0-rc1                5.18.0-rc1
                     tip sched/core           tip sched/core            tip sched/core
                       (no pinning)                + pinning              + this-patch
      								       + pinning
      
       Copy:   109364.74 (0.00 pct)     94220.50 (-13.84 pct)    158301.28 (44.74 pct)
      Scale:   109670.26 (0.00 pct)     90210.59 (-17.74 pct)    149525.64 (36.34 pct)
        Add:   129029.01 (0.00 pct)    101906.00 (-21.02 pct)    186658.17 (44.66 pct)
      Triad:   127260.05 (0.00 pct)    106051.36 (-16.66 pct)    184327.30 (44.84 pct)
      
      Pinning currently hurts the performance compared to unbound case on
      tip/sched/core. With the addition of this patch, we are able to
      outperform tip/sched/core by a good margin with pinning.
      
      Following are the results from running 16 Stream threads with and
      without pinning on a dual socket IceLake Machine (2 x 32C/64T):
      
      NUMA Topology of Intel Skylake machine:
      Node 1: 0,2,4,6 ... 126 (Even numbers)
      Node 2: 1,3,5,7 ... 127 (Odd numbers)
      
      Pinning is done using: numactl -C 0-15 ./stream16
      
      	           5.18.0-rc1               5.18.0-rc1                5.18.0-rc1
                     tip sched/core           tip sched/core            tip sched/core
                       (no pinning)                 +pinning              + this-patch
      								       + pinning
      
       Copy:    85815.31 (0.00 pct)     149819.21 (74.58 pct)    156807.48 (82.72 pct)
      Scale:    64795.60 (0.00 pct)      97595.07 (50.61 pct)     99871.96 (54.13 pct)
        Add:    71340.68 (0.00 pct)     111549.10 (56.36 pct)    114598.33 (60.63 pct)
      Triad:    68890.97 (0.00 pct)     111635.16 (62.04 pct)    114589.24 (66.33 pct)
      
      In case of Icelake machine, with single LLC per socket, pinning across
      the two sockets reduces cache contention, thus showing great
      improvement in pinned case which is further benefited by this patch.
      Signed-off-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@amd.com
      f5b2eeb4
    • Mel Gorman's avatar
      sched/numa: Adjust imb_numa_nr to a better approximation of memory channels · 026b98a9
      Mel Gorman authored
      For a single LLC per node, a NUMA imbalance is allowed up until 25%
      of CPUs sharing a node could be active. One intent of the cut-off is
      to avoid an imbalance of memory channels but there is no topological
      information based on active memory channels. Furthermore, there can
      be differences between nodes depending on the number of populated
      DIMMs.
      
      A cut-off of 25% was arbitrary but generally worked. It does have a severe
      corner cases though when an parallel workload is using 25% of all available
      CPUs over-saturates memory channels. This can happen due to the initial
      forking of tasks that get pulled more to one node after early wakeups
      (e.g. a barrier synchronisation) that is not quickly corrected by the
      load balancer. The LB may fail to act quickly as the parallel tasks are
      considered to be poor migrate candidates due to locality or cache hotness.
      
      On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
      assuming all memory channels are populated and is used as the new cut-off
      point. A minimum of 1 is specified to allow a communicating pair to
      remain local even for CPUs with low numbers of cores. For modern AMDs,
      there are multiple LLCs and are not affected.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net
      026b98a9
    • Mel Gorman's avatar
      sched/numa: Apply imbalance limitations consistently · cb29a5c1
      Mel Gorman authored
      The imbalance limitations are applied inconsistently at fork time
      and at runtime. At fork, a new task can remain local until there are
      too many running tasks even if the degree of imbalance is larger than
      NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance
      figure used during load balancing is different to the one used at NUMA
      placement. Load balancing uses the number of tasks that must move to
      restore imbalance where as NUMA balancing uses the total imbalance.
      
      In combination, it is possible for a parallel workload that uses a small
      number of CPUs without applying scheduler policies to have very variable
      run-to-run performance.
      
      [lkp@intel.com: Fix build breakage for arc-allyesconfig]
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220520103519.1863-4-mgorman@techsingularity.net
      cb29a5c1
    • Mel Gorman's avatar
      sched/numa: Do not swap tasks between nodes when spare capacity is available · 13ede331
      Mel Gorman authored
      If a destination node has spare capacity but there is an imbalance then
      two tasks are selected for swapping. If the tasks have no numa group
      or are within the same NUMA group, it's simply shuffling tasks around
      without having any impact on the compute imbalance. Instead, it's just
      punishing one task to help another.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220520103519.1863-3-mgorman@techsingularity.net
      13ede331
    • Mel Gorman's avatar
      sched/numa: Initialise numa_migrate_retry · 70ce3ea9
      Mel Gorman authored
      On clone, numa_migrate_retry is inherited from the parent which means
      that the first NUMA placement of a task is non-deterministic. This
      affects when load balancing recognises numa tasks and whether to
      migrate "regular", "remote" or "all" tasks between NUMA scheduler
      domains.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220520103519.1863-2-mgorman@techsingularity.net
      70ce3ea9
  6. 12 Jun, 2022 10 commits