1. 04 Aug, 2021 2 commits
  2. 28 Jun, 2021 5 commits
  3. 24 Jun, 2021 5 commits
    • Beata Michalska's avatar
      sched/doc: Update the CPU capacity asymmetry bits · adf3c31e
      Beata Michalska authored
      Update the documentation bits referring to capacity aware scheduling
      with regards to newly introduced SD_ASYM_CPUCAPACITY_FULL sched_domain
      flag.
      Signed-off-by: default avatarBeata Michalska <beata.michalska@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lore.kernel.org/r/20210603140627.8409-4-beata.michalska@arm.com
      adf3c31e
    • Beata Michalska's avatar
      sched/topology: Rework CPU capacity asymmetry detection · c744dc4a
      Beata Michalska authored
      Currently the CPU capacity asymmetry detection, performed through
      asym_cpu_capacity_level, tries to identify the lowest topology level
      at which the highest CPU capacity is being observed, not necessarily
      finding the level at which all possible capacity values are visible
      to all CPUs, which might be bit problematic for some possible/valid
      asymmetric topologies i.e.:
      
      DIE      [                                ]
      MC       [                       ][       ]
      
      CPU       [0] [1] [2] [3] [4] [5]  [6] [7]
      Capacity  |.....| |.....| |.....|  |.....|
      	     L	     M       B        B
      
      Where:
       arch_scale_cpu_capacity(L) = 512
       arch_scale_cpu_capacity(M) = 871
       arch_scale_cpu_capacity(B) = 1024
      
      In this particular case, the asymmetric topology level will point
      at MC, as all possible CPU masks for that level do cover the CPU
      with the highest capacity. It will work just fine for the first
      cluster, not so much for the second one though (consider the
      find_energy_efficient_cpu which might end up attempting the energy
      aware wake-up for a domain that does not see any asymmetry at all)
      
      Rework the way the capacity asymmetry levels are being detected,
      allowing to point to the lowest topology level (for a given CPU), where
      full set of available CPU capacities is visible to all CPUs within given
      domain. As a result, the per-cpu sd_asym_cpucapacity might differ across
      the domains. This will have an impact on EAS wake-up placement in a way
      that it might see different range of CPUs to be considered, depending on
      the given current and target CPUs.
      
      Additionally, those levels, where any range of asymmetry (not
      necessarily full) is being detected will get identified as well.
      The selected asymmetric topology level will be denoted by
      SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels'
      would receive the already used SD_ASYM_CPUCAPACITY flag. This allows
      maintaining the current behaviour for asymmetric topologies, with
      misfit migration operating correctly on lower levels, if applicable,
      as any asymmetry is enough to trigger the misfit migration.
      The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not
      relate to the full asymmetry level denoted by the sd_asym_cpucapacity
      pointer.
      
      Detecting the CPU capacity asymmetry is being based on a set of
      available CPU capacities for all possible CPUs. This data is being
      generated upon init and updated once CPU topology changes are being
      detected (through arch_update_cpu_topology). As such, any changes
      to identified CPU capacities (like initializing cpufreq) need to be
      explicitly advertised by corresponding archs to trigger rebuilding
      the data.
      
      Additional -dflags- parameter, used when building sched domains, has
      been removed as well, as the asymmetry flags are now being set directly
      in sd_init.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Suggested-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarBeata Michalska <beata.michalska@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Tested-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lore.kernel.org/r/20210603140627.8409-3-beata.michalska@arm.com
      c744dc4a
    • Beata Michalska's avatar
      sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag · 2309a05d
      Beata Michalska authored
      Introducing new, complementary to SD_ASYM_CPUCAPACITY, sched_domain
      topology flag, to distinguish between shed_domains where any CPU
      capacity asymmetry is detected (SD_ASYM_CPUCAPACITY) and ones where
      a full set of CPU capacities is visible to all domain members
      (SD_ASYM_CPUCAPACITY_FULL).
      
      With the distinction between full and partial CPU capacity asymmetry,
      brought in by the newly introduced flag, the scope of the original
      SD_ASYM_CPUCAPACITY flag gets shifted, still maintaining the existing
      behaviour when one is detected on a given sched domain, allowing
      misfit migrations within sched domains that do not observe full range
      of CPU capacities but still do have members with different capacity
      values. It loses though it's meaning when it comes to the lowest CPU
      asymmetry sched_domain level per-cpu pointer, which is to be now
      denoted by SD_ASYM_CPUCAPACITY_FULL flag.
      Signed-off-by: default avatarBeata Michalska <beata.michalska@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lore.kernel.org/r/20210603140627.8409-2-beata.michalska@arm.com
      2309a05d
    • Zhaoyang Huang's avatar
      psi: Fix race between psi_trigger_create/destroy · 8f91efd8
      Zhaoyang Huang authored
      Race detected between psi_trigger_destroy/create as shown below, which
      cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry
      and psi_system->poll_timer->entry->next. Under this modification, the
      race window is removed by initialising poll_wait and poll_timer in
      group_init which are executed only once at beginning.
      
        psi_trigger_destroy()                   psi_trigger_create()
      
        mutex_lock(trigger_lock);
        rcu_assign_pointer(poll_task, NULL);
        mutex_unlock(trigger_lock);
      					  mutex_lock(trigger_lock);
      					  if (!rcu_access_pointer(group->poll_task)) {
      					    timer_setup(poll_timer, poll_timer_fn, 0);
      					    rcu_assign_pointer(poll_task, task);
      					  }
      					  mutex_unlock(trigger_lock);
      
        synchronize_rcu();
        del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by
                                        psi_trigger_create()
      
      So, trigger_lock/RCU correctly protects destruction of
      group->poll_task but misses this race affecting poll_timer and
      poll_wait.
      
      Fixes: 461daba0 ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
      Co-developed-by: default avatarziwei.dai <ziwei.dai@unisoc.com>
      Signed-off-by: default avatarziwei.dai <ziwei.dai@unisoc.com>
      Co-developed-by: default avatarke.wang <ke.wang@unisoc.com>
      Signed-off-by: default avatarke.wang <ke.wang@unisoc.com>
      Signed-off-by: default avatarZhaoyang Huang <zhaoyang.huang@unisoc.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com
      8f91efd8
    • Huaixin Chang's avatar
      sched/fair: Introduce the burstable CFS controller · f4183717
      Huaixin Chang authored
      The CFS bandwidth controller limits CPU requests of a task group to
      quota during each period. However, parallel workloads might be bursty
      so that they get throttled even when their average utilization is under
      quota. And they are latency sensitive at the same time so that
      throttling them is undesired.
      
      We borrow time now against our future underrun, at the cost of increased
      interference against the other system users. All nicely bounded.
      
      Traditional (UP-EDF) bandwidth control is something like:
      
        (U = \Sum u_i) <= 1
      
      This guaranteeds both that every deadline is met and that the system is
      stable. After all, if U were > 1, then for every second of walltime,
      we'd have to run more than a second of program time, and obviously miss
      our deadline, but the next deadline will be further out still, there is
      never time to catch up, unbounded fail.
      
      This work observes that a workload doesn't always executes the full
      quota; this enables one to describe u_i as a statistical distribution.
      
      For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
      (the traditional WCET). This effectively allows u to be smaller,
      increasing the efficiency (we can pack more tasks in the system), but at
      the cost of missing deadlines when all the odds line up. However, it
      does maintain stability, since every overrun must be paired with an
      underrun as long as our x is above the average.
      
      That is, suppose we have 2 tasks, both specify a p(95) value, then we
      have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
      everything is good. At the same time we have a p(5)p(5) = 0.25% chance
      both tasks will exceed their quota at the same time (guaranteed deadline
      fail). Somewhere in between there's a threshold where one exceeds and
      the other doesn't underrun enough to compensate; this depends on the
      specific CDFs.
      
      At the same time, we can say that the worst case deadline miss, will be
      \Sum e_i; that is, there is a bounded tardiness (under the assumption
      that x+e is indeed WCET).
      
      The benefit of burst is seen when testing with schbench. Default value of
      kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
      
      	mkdir /sys/fs/cgroup/cpu/test
      	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
      	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
      	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
      
      	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
      
      The average CPU usage is at 80%. I run this for 10 times, and got long tail
      latency for 6 times and got throttled for 8 times.
      
      Tail latencies are shown below, and it wasn't the worst case.
      
      	Latency percentiles (usec)
      		50.0000th: 19872
      		75.0000th: 21344
      		90.0000th: 22176
      		95.0000th: 22496
      		*99.0000th: 22752
      		99.5000th: 22752
      		99.9000th: 22752
      		min=0, max=22727
      	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
      
      The interferenece when using burst is valued by the possibilities for
      missing the deadline and the average WCET. Test results showed that when
      there many cgroups or CPU is under utilized, the interference is
      limited. More details are shown in:
      https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/Co-developed-by: default avatarShanpei Chen <shanpeic@linux.alibaba.com>
      Signed-off-by: default avatarShanpei Chen <shanpeic@linux.alibaba.com>
      Co-developed-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarHuaixin Chang <changhuaixin@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
      f4183717
  4. 22 Jun, 2021 3 commits
    • Qais Yousef's avatar
      sched/uclamp: Fix uclamp_tg_restrict() · 0213b708
      Qais Yousef authored
      Now cpu.uclamp.min acts as a protection, we need to make sure that the
      uclamp request of the task is within the allowed range of the cgroup,
      that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and
      tg->uclamp[UCLAMP_MAX].
      
      As reported by Xuewen [1] we can have some corner cases where there's
      inversion between uclamp requested by task (p) and the uclamp values of
      the taskgroup it's attached to (tg). Following table demonstrates
      2 corner cases:
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	CASE 1
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  60%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	CASE 2
      	-----------+-----+------+-----------
      	uclamp_min | 0%  | 30%  |  30%
      	-----------+-----+------+-----------
      	uclamp_max | 20% | 50%  |  20%
      	-----------+-----+------+-----------
      
      With this fix we get:
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	CASE 1
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  50%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	CASE 2
      	-----------+-----+------+-----------
      	uclamp_min | 0%  | 30%  |  30%
      	-----------+-----+------+-----------
      	uclamp_max | 20% | 50%  |  30%
      	-----------+-----+------+-----------
      
      Additionally uclamp_update_active_tasks() must now unconditionally
      update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for
      instance could have an impact on the effective UCLAMP_MIN of the tasks.
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	old
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  50%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	*new*
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   | *60%*
      	-----------+-----+------+-----------
      	uclamp_max | 80% |*70%* | *70%*
      	-----------+-----+------+-----------
      
      [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/
      
      Fixes: 0c18f2ec ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min")
      Reported-by: default avatarXuewen Yan <xuewen.yan94@gmail.com>
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
      0213b708
    • Vincent Donnefort's avatar
      sched/rt: Fix Deadline utilization tracking during policy change · d7d60709
      Vincent Donnefort authored
      DL keeps track of the utilization on a per-rq basis with the structure
      avg_dl. This utilization is updated during task_tick_dl(),
      put_prev_task_dl() and set_next_task_dl(). However, when the current
      running task changes its policy, set_next_task_dl() which would usually
      take care of updating the utilization when the rq starts running DL
      tasks, will not see a such change, leaving the avg_dl structure outdated.
      When that very same task will be dequeued later, put_prev_task_dl() will
      then update the utilization, based on a wrong last_update_time, leading to
      a huge spike in the DL utilization signal.
      
      The signal would eventually recover from this issue after few ms. Even
      if no DL tasks are run, avg_dl is also updated in
      __update_blocked_others(). But as the CPU capacity depends partly on the
      avg_dl, this issue has nonetheless a significant impact on the scheduler.
      
      Fix this issue by ensuring a load update when a running task changes
      its policy to DL.
      
      Fixes: 3727e0e1 ("sched/dl: Add dl_rq utilization tracking")
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/1624271872-211872-3-git-send-email-vincent.donnefort@arm.com
      d7d60709
    • Vincent Donnefort's avatar
      sched/rt: Fix RT utilization tracking during policy change · fecfcbc2
      Vincent Donnefort authored
      RT keeps track of the utilization on a per-rq basis with the structure
      avg_rt. This utilization is updated during task_tick_rt(),
      put_prev_task_rt() and set_next_task_rt(). However, when the current
      running task changes its policy, set_next_task_rt() which would usually
      take care of updating the utilization when the rq starts running RT tasks,
      will not see a such change, leaving the avg_rt structure outdated. When
      that very same task will be dequeued later, put_prev_task_rt() will then
      update the utilization, based on a wrong last_update_time, leading to a
      huge spike in the RT utilization signal.
      
      The signal would eventually recover from this issue after few ms. Even if
      no RT tasks are run, avg_rt is also updated in __update_blocked_others().
      But as the CPU capacity depends partly on the avg_rt, this issue has
      nonetheless a significant impact on the scheduler.
      
      Fix this issue by ensuring a load update when a running task changes
      its policy to RT.
      
      Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
      fecfcbc2
  5. 18 Jun, 2021 8 commits
  6. 17 Jun, 2021 6 commits
    • Peter Zijlstra's avatar
      sched/fair: Age the average idle time · 94aafc3e
      Peter Zijlstra authored
      This is a partial forward-port of Peter Ziljstra's work first posted
      at:
      
         https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/
      
      Currently select_idle_cpu()'s proportional scheme uses the average idle
      time *for when we are idle*, that is temporally challenged.  When a CPU
      is not at all idle, we'll happily continue using whatever value we did
      see when the CPU goes idle. To fix this, introduce a separate average
      idle and age it (the existing value still makes sense for things like
      new-idle balancing, which happens when we do go idle).
      
      The overall goal is to not spend more time scanning for idle CPUs than
      we're idle for. Otherwise we're inhibiting work. This means that we need to
      consider the cost over all the wake-ups between consecutive idle periods.
      To track this, the scan cost is subtracted from the estimated average
      idle time.
      
      The impact of this patch is related to workloads that have domains that
      are fully busy or overloaded. Without the patch, the scan depth may be
      too high because a CPU is not reaching idle.
      
      Due to the nature of the patch, this is a regression magnet. It
      potentially wins when domains are almost fully busy or overloaded --
      at that point searches are likely to fail but idle is not being aged
      as CPUs are active so search depth is too large and useless. It will
      potentially show regressions when there are idle CPUs and a deep search is
      beneficial. This tbench result on a 2-socket broadwell machine partially
      illustates the problem
      
                                5.13.0-rc2             5.13.0-rc2
                                   vanilla     sched-avgidle-v1r5
      Hmean     1        445.02 (   0.00%)      451.36 *   1.42%*
      Hmean     2        830.69 (   0.00%)      846.03 *   1.85%*
      Hmean     4       1350.80 (   0.00%)     1505.56 *  11.46%*
      Hmean     8       2888.88 (   0.00%)     2586.40 * -10.47%*
      Hmean     16      5248.18 (   0.00%)     5305.26 *   1.09%*
      Hmean     32      8914.03 (   0.00%)     9191.35 *   3.11%*
      Hmean     64     10663.10 (   0.00%)    10192.65 *  -4.41%*
      Hmean     128    18043.89 (   0.00%)    18478.92 *   2.41%*
      Hmean     256    16530.89 (   0.00%)    17637.16 *   6.69%*
      Hmean     320    16451.13 (   0.00%)    17270.97 *   4.98%*
      
      Note that 8 was a regression point where a deeper search would have helped
      but it gains for high thread counts when searches are useless. Hackbench
      is a more extreme example although not perfect as the tasks idle rapidly
      
      hackbench-process-pipes
                                5.13.0-rc2             5.13.0-rc2
                                   vanilla     sched-avgidle-v1r5
      Amean     1        0.3950 (   0.00%)      0.3887 (   1.60%)
      Amean     4        0.9450 (   0.00%)      0.9677 (  -2.40%)
      Amean     7        1.4737 (   0.00%)      1.4890 (  -1.04%)
      Amean     12       2.3507 (   0.00%)      2.3360 *   0.62%*
      Amean     21       4.0807 (   0.00%)      4.0993 *  -0.46%*
      Amean     30       5.6820 (   0.00%)      5.7510 *  -1.21%*
      Amean     48       8.7913 (   0.00%)      8.7383 (   0.60%)
      Amean     79      14.3880 (   0.00%)     13.9343 *   3.15%*
      Amean     110     21.2233 (   0.00%)     19.4263 *   8.47%*
      Amean     141     28.2930 (   0.00%)     25.1003 *  11.28%*
      Amean     172     34.7570 (   0.00%)     30.7527 *  11.52%*
      Amean     203     41.0083 (   0.00%)     36.4267 *  11.17%*
      Amean     234     47.7133 (   0.00%)     42.0623 *  11.84%*
      Amean     265     53.0353 (   0.00%)     47.7720 *   9.92%*
      Amean     296     60.0170 (   0.00%)     53.4273 *  10.98%*
      Stddev    1        0.0052 (   0.00%)      0.0025 (  51.57%)
      Stddev    4        0.0357 (   0.00%)      0.0370 (  -3.75%)
      Stddev    7        0.0190 (   0.00%)      0.0298 ( -56.64%)
      Stddev    12       0.0064 (   0.00%)      0.0095 ( -48.38%)
      Stddev    21       0.0065 (   0.00%)      0.0097 ( -49.28%)
      Stddev    30       0.0185 (   0.00%)      0.0295 ( -59.54%)
      Stddev    48       0.0559 (   0.00%)      0.0168 (  69.92%)
      Stddev    79       0.1559 (   0.00%)      0.0278 (  82.17%)
      Stddev    110      1.1728 (   0.00%)      0.0532 (  95.47%)
      Stddev    141      0.7867 (   0.00%)      0.0968 (  87.69%)
      Stddev    172      1.0255 (   0.00%)      0.0420 (  95.91%)
      Stddev    203      0.8106 (   0.00%)      0.1384 (  82.92%)
      Stddev    234      1.1949 (   0.00%)      0.1328 (  88.89%)
      Stddev    265      0.9231 (   0.00%)      0.0820 (  91.11%)
      Stddev    296      1.0456 (   0.00%)      0.1327 (  87.31%)
      
      Again, higher thread counts benefit and the standard deviation
      shows that results are also a lot more stable when the idle
      time is aged.
      
      The patch potentially matters when a socket was multiple LLCs as the
      maximum search depth is lower. However, some of the test results were
      suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and
      other results were not dramatically different to other mcahines.
      
      Given the nature of the patch, Peter's full series is not being forward
      ported as each part should stand on its own. Preferably they would be
      merged at different times to reduce the risk of false bisections.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
      94aafc3e
    • Lukasz Luba's avatar
      sched/cpufreq: Consider reduced CPU capacity in energy calculation · 8f1b971b
      Lukasz Luba authored
      Energy Aware Scheduling (EAS) needs to predict the decisions made by
      SchedUtil. The map_util_freq() exists to do that.
      
      There are corner cases where the max allowed frequency might be reduced
      (due to thermal). SchedUtil as a CPUFreq governor, is aware of that
      but EAS is not. This patch aims to address it.
      
      SchedUtil stores the maximum allowed frequency in
      'sugov_policy::next_freq' field. EAS has to predict that value, which is
      the real used frequency. That value is made after a call to
      cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
      In the existing code EAS is not able to predict that real frequency.
      This leads to energy estimation errors.
      
      To avoid wrong energy estimation in EAS (due to frequency miss prediction)
      make sure that the step which calculates Performance Domain frequency,
      is also aware of the allowed CPU capacity.
      
      Furthermore, modify map_util_freq() to not extend the frequency value.
      Instead, use map_util_perf() to extend the util value in both places:
      SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
      In the end, we achieve the same desirable behavior for both subsystems
      and alignment in regards to the real CPU frequency.
      Signed-off-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part)
      Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
      8f1b971b
    • Lukasz Luba's avatar
      sched/fair: Take thermal pressure into account while estimating energy · 489f1645
      Lukasz Luba authored
      Energy Aware Scheduling (EAS) needs to be able to predict the frequency
      requests made by the SchedUtil governor to properly estimate energy used
      in the future. It has to take into account CPUs utilization and forecast
      Performance Domain (PD) frequency. There is a corner case when the max
      allowed frequency might be reduced due to thermal. SchedUtil is aware of
      that reduced frequency, so it should be taken into account also in EAS
      estimations.
      
      SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
      a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
      to 'policy::max'. SchedUtil is responsible to respect that upper limit
      while setting the frequency through CPUFreq drivers. This effective
      frequency is stored internally in 'sugov_policy::next_freq' and EAS has
      to predict that value.
      
      In the existing code the raw value of arch_scale_cpu_capacity() is used
      for clamping the returned CPU utilization from effective_cpu_util().
      This patch fixes issue with too big single CPU utilization, by introducing
      clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
      capacity reduced by thermal pressure raw value.
      
      Thanks to knowledge about allowed CPU capacity, we don't get too big value
      for a single CPU utilization, which is then added to the util sum. The
      util sum is used as a source of information for estimating whole PD energy.
      To avoid wrong energy estimation in EAS (due to capped frequency), make
      sure that the calculation of util sum is aware of allowed CPU capacity.
      
      This thermal pressure might be visible in scenarios where the CPUs are not
      heavily loaded, but some other component (like GPU) drastically reduced
      available power budget and increased the SoC temperature. Thus, we still
      use EAS for task placement and CPUs are not over-utilized.
      Signed-off-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lore.kernel.org/r/20210614191128.22735-1-lukasz.luba@arm.com
      489f1645
    • Lukasz Luba's avatar
      thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure · 2ad8ccc1
      Lukasz Luba authored
      The thermal pressure signal gives information to the scheduler about
      reduced CPU capacity due to thermal. It is based on a value stored in
      a per-cpu 'thermal_pressure' variable. The online CPUs will get the
      new value there, while the offline won't. Unfortunately, when the CPU
      is back online, the value read from per-cpu variable might be wrong
      (stale data).  This might affect the scheduler decisions, since it
      sees the CPU capacity differently than what is actually available.
      
      Fix it by making sure that all online+offline CPUs would get the
      proper value in their per-cpu variable when thermal framework sets
      capping.
      
      Fixes: f12e4f66 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping")
      Signed-off-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Link: https://lore.kernel.org/r/20210614191030.22241-1-lukasz.luba@arm.com
      2ad8ccc1
    • Dietmar Eggemann's avatar
      sched/fair: Return early from update_tg_cfs_load() if delta == 0 · 83c5e9d5
      Dietmar Eggemann authored
      In case the _avg delta is 0 there is no need to update se's _avg
      (level n) nor cfs_rq's _avg (level n-1). These values stay the same.
      
      Since cfs_rq's _avg isn't changed, i.e. no load is propagated down,
      cfs_rq's _sum should stay the same as well.
      
      So bail out after se's _sum has been updated.
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20210601083616.804229-1-dietmar.eggemann@arm.com
      83c5e9d5
    • Vincent Guittot's avatar
      sched/pelt: Check that *_avg are null when *_sum are · 9e077b52
      Vincent Guittot authored
      Check that we never break the rule that pelt's avg values are null if
      pelt's sum are.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Acked-by: default avatarOdin Ugedal <odin@uged.al>
      Link: https://lore.kernel.org/r/20210601155328.19487-1-vincent.guittot@linaro.org
      9e077b52
  7. 14 Jun, 2021 1 commit
    • Odin Ugedal's avatar
      sched/fair: Correctly insert cfs_rq's to list on unthrottle · a7b359fc
      Odin Ugedal authored
      Fix an issue where fairness is decreased since cfs_rq's can end up not
      being decayed properly. For two sibling control groups with the same
      priority, this can often lead to a load ratio of 99/1 (!!).
      
      This happens because when a cfs_rq is throttled, all the descendant
      cfs_rq's will be removed from the leaf list. When they initial cfs_rq
      is unthrottled, it will currently only re add descendant cfs_rq's if
      they have one or more entities enqueued. This is not a perfect
      heuristic.
      
      Instead, we insert all cfs_rq's that contain one or more enqueued
      entities, or it its load is not completely decayed.
      
      Can often lead to situations like this for equally weighted control
      groups:
      
        $ ps u -C stress
        USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
        root       10009 88.8  0.0   3676   100 pts/1    R+   11:04   0:13 stress --cpu 1
        root       10023  3.0  0.0   3676   104 pts/1    R+   11:04   0:00 stress --cpu 1
      
      Fixes: 31bc6aea ("sched/fair: Optimize update_blocked_averages()")
      [vingo: !SMP build fix]
      Signed-off-by: default avatarOdin Ugedal <odin@uged.al>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20210612112815.61678-1-odin@uged.al
      a7b359fc
  8. 13 Jun, 2021 4 commits
    • Linus Torvalds's avatar
      Linux 5.13-rc6 · 009c9aa5
      Linus Torvalds authored
      009c9aa5
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.13-2021-06-13' of... · e4e45343
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.13-2021-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Correct buffer copying when peeking events
      
       - Sync cpufeatures/disabled-features.h header with the kernel sources
      
      * tag 'perf-tools-fixes-for-v5.13-2021-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        tools headers cpufeatures: Sync with the kernel sources
        perf session: Correct buffer copying when peeking events
      e4e45343
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 960f0716
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       "Highlights include:
      
        Stable fixes:
      
         - Fix use-after-free in nfs4_init_client()
      
        Bugfixes:
      
         - Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
      
         - Fix second deadlock in nfs4_evict_inode()
      
         - nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP
      
         - Fix setting of the NFS_CAP_SECURITY_LABEL capability"
      
      * tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        NFSv4: Fix second deadlock in nfs4_evict_inode()
        NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
        NFS: FMODE_READ and friends are C macros, not enum types
        NFS: Fix a potential NULL dereference in nfs_get_client()
        NFS: Fix use-after-free in nfs4_init_client()
        NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate
        NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
      960f0716
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 331a6edb
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Four reasonably small fixes to the core for scsi host allocation
        failure paths.
      
        The root problem is that we're not freeing the memory allocated by
        dev_set_name(), which involves a rejig of may of the free on error
        paths to do put_device() instead of kfree which, in turn, has several
        other knock on ramifications and inspection turned up a few other
        lurking bugs"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: core: Only put parent device if host state differs from SHOST_CREATED
        scsi: core: Put .shost_dev in failure path if host state changes to RUNNING
        scsi: core: Fix failure handling of scsi_add_host_with_dma()
        scsi: core: Fix error handling of scsi_host_alloc()
      331a6edb
  9. 12 Jun, 2021 6 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 8ecfa36c
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - A pair of XIP fixes: one to fix alternatives, and one to turn off the
         rest of the features that require code modification
      
       - A fix to a type that was causing some alternatives to break
      
       - A build fix for BUILTIN_DTB
      
      * tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Fix BUILTIN_DTB for sifive and microchip soc
        riscv: alternative: fix typo in macro name
        riscv: code patching only works on !XIP_KERNEL
        riscv: xip: support runtime trap patching
      8ecfa36c
    • Feng Tang's avatar
      mm: relocate 'write_protect_seq' in struct mm_struct · 2e302543
      Feng Tang authored
      0day robot reported a 9.2% regression for will-it-scale mmap1 test
      case[1], caused by commit 57efa1fe ("mm/gup: prevent gup_fast from
      racing with COW during fork").
      
      Further debug shows the regression is due to that commit changes the
      offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
      cache alignment changes.
      
      From the perf data, the contention for 'mmap_lock' is very severe and
      takes around 95% cpu cycles, and it is a rw_semaphore
      
              struct rw_semaphore {
                      atomic_long_t count;	/* 8 bytes */
                      atomic_long_t owner;	/* 8 bytes */
                      struct optimistic_spin_queue osq; /* spinner MCS lock */
                      ...
      
      Before commit 57efa1fe adds the 'write_protect_seq', it happens to
      have a very optimal cache alignment layout, as Linus explained:
      
       "and before the addition of the 'write_protect_seq' field, the
        mmap_sem was at offset 120 in 'struct mm_struct'.
      
        Which meant that count and owner were in two different cachelines,
        and then when you have contention and spend time in
        rwsem_down_write_slowpath(), this is probably *exactly* the kind
        of layout you want.
      
        Because first the rwsem_write_trylock() will do a cmpxchg on the
        first cacheline (for the optimistic fast-path), and then in the
        case of contention, rwsem_down_write_slowpath() will just access
        the second cacheline.
      
        Which is probably just optimal for a load that spends a lot of
        time contended - new waiters touch that first cacheline, and then
        they queue themselves up on the second cacheline."
      
      After the commit, the rw_semaphore is at offset 128, which means the
      'count' and 'owner' fields are now in the same cacheline, and causes
      more cache bouncing.
      
      Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
      affect its offset:
      
        CONFIG_MMU
        CONFIG_MEMBARRIER
        CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
      
      The layout above is on 64 bits system with 0day's default kernel config
      (similar to RHEL-8.3's config), in which all these 3 options are 'y'.
      And the layout can vary with different kernel configs.
      
      Relayouting a structure is usually a double-edged sword, as sometimes it
      can helps one case, but hurt other cases.  For this case, one solution
      is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
      (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
      hole in 'mm_struct' will not change other fields' alignment, while
      restoring the regression.
      
      Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e302543
    • Linus Torvalds's avatar
      Merge tag 'usb-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 43cb5d49
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a number of tiny USB fixes for 5.13-rc6.
      
        There are more than I would normally like, but there's been a bunch of
        people banging on the gadget and dwc3 and typec code recently for I
        think an Android release, which has resulted in a number of small
        fixes. It's nice to see companies send fixes upstream for this type of
        work, a notable change from years ago.
      
        Anyway, fixes in here are:
      
         - usb-serial device id updates
      
         - usb-serial cp210x driver fixes for broken firmware versions
      
         - typec fixes for crazy charging devices and other reported problems
      
         - dwc3 fixes for reported problems found
      
         - gadget fixes for reported problems
      
         - tiny xhci fixes
      
         - other small fixes for reported issues.
      
         - revert of a problem fix found by linux-next testing
      
        All of these have passed 0-day and linux-next testing with no reported
        problems (the revert for the found linux-next build problem included)"
      
      * tag 'usb-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (44 commits)
        Revert "usb: gadget: fsl: Re-enable driver for ARM SoCs"
        usb: typec: mux: Fix copy-paste mistake in typec_mux_match
        usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path
        usb: gadget: fsl: Re-enable driver for ARM SoCs
        usb: typec: wcove: Use LE to CPU conversion when accessing msg->header
        USB: serial: cp210x: fix CP2102N-A01 modem control
        USB: serial: cp210x: fix alternate function for CP2102N QFN20
        usb: misc: brcmstb-usb-pinmap: check return value after calling platform_get_resource()
        usb: dwc3: ep0: fix NULL pointer exception
        usb: gadget: eem: fix wrong eem header operation
        usb: typec: intel_pmc_mux: Put ACPI device using acpi_dev_put()
        usb: typec: intel_pmc_mux: Add missed error check for devm_ioremap_resource()
        usb: typec: intel_pmc_mux: Put fwnode in error case during ->probe()
        usb: typec: tcpm: Do not finish VDM AMS for retrying Responses
        usb: fix various gadget panics on 10gbps cabling
        usb: fix various gadgets null ptr deref on 10gbps cabling.
        usb: pci-quirks: disable D3cold on xhci suspend for s2idle on AMD Renoir
        usb: f_ncm: only first packet of aggregate needs to start timer
        USB: f_ncm: ncm_bitrate (speed) is unsigned
        MAINTAINERS: usb: add entry for isp1760
        ...
      43cb5d49
    • Linus Torvalds's avatar
      Merge tag 'tty-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · c46fe4aa
      Linus Torvalds authored
      Pull serial driver fix from Greg KH:
       "A single 8250_exar serial driver fix for a reported problem with a
        change that happened in 5.13-rc1.
      
        It has been in linux-next with no reported problems"
      
      * tag 'tty-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        serial: 8250_exar: Avoid NULL pointer dereference at ->exit()
      c46fe4aa
    • Linus Torvalds's avatar
      Merge tag 'staging-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 0d506588
      Linus Torvalds authored
      Pull staging driver fixes from Greg KH:
       "Two tiny staging driver fixes:
      
         - ralink-gdma driver authorship information fixed up
      
         - rtl8723bs driver fix for reported regression
      
        Both have been in linux-next for a while with no reported problems"
      
      * tag 'staging-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: ralink-gdma: Remove incorrect author information
        staging: rtl8723bs: Fix uninitialized variables
      0d506588
    • Linus Torvalds's avatar
      Merge tag 'driver-core-5.13-rc6' of... · 87a7f736
      Linus Torvalds authored
      Merge tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core fix from Greg KH:
       "A single debugfs fix for 5.13-rc6, fixing a bug in
        debugfs_read_file_str() that showed up in 5.13-rc1.
      
        It has been in linux-next for a full week with no
        reported problems"
      
      * tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        debugfs: Fix debugfs_read_file_str()
      87a7f736