Commit 5b77261c authored by Pierre Gondois's avatar Pierre Gondois Committed by Ingo Molnar

sched/topology: Remove the EM_MAX_COMPLEXITY limit

The Energy Aware Scheduler (EAS) estimates the energy consumption
of placing a task on different CPUs. The goal is to minimize this
energy consumption. Estimating the energy of different task placements
is increasingly complex with the size of the platform.

To avoid having a slow wake-up path, EAS is only enabled if this
complexity is low enough.

The current complexity limit was set in:

  b68a4c0d ("sched/topology: Disable EAS on inappropriate platforms")

... based on the first implementation of EAS, which was re-computing
the power of the whole platform for each task placement scenario, see:

  390031e4 ("sched/fair: Introduce an energy estimation helper function")

... but the complexity of EAS was reduced in:

  eb92692b ("sched/fair: Speed-up energy-aware wake-ups")

... and find_energy_efficient_cpu() (feec) algorithm was updated in:

  3e8c6c9a ("sched/fair: Remove task_util from effective utilization in feec()")

find_energy_efficient_cpu() (feec) is now doing:

	feec()
	\_ for_each_pd(pd) [0]
	  // get max_spare_cap_cpu and compute_prev_delta
	  \_ for_each_cpu(pd) [1]

	  \_ eenv_pd_busy_time(pd) [2]
		\_ for_each_cpu(pd)

	  // compute_energy(pd) without the task
	  \_ eenv_pd_max_util(pd, -1) [3.0]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, -1)
	    \_ for_each_ps(pd)

	  // compute_energy(pd) with the task on prev_cpu
	  \_ eenv_pd_max_util(pd, prev_cpu) [3.1]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, prev_cpu)
	    \_ for_each_ps(pd)

	  // compute_energy(pd) with the task on max_spare_cap_cpu
	  \_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, max_spare_cap_cpu)
	    \_ for_each_ps(pd)

	[3.1] happens only once since prev_cpu is unique. With the same
	      definitions for nr_pd, nr_cpus and nr_ps, the complexity is of:

		nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd]))
		+ ([nr_cpus in pd] + [nr_ps in pd])

		 [0]  * (     [1] + [2]      +       [3.0] + [3.2]                  )
		+ [3.1]

		= nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd])
		+ [nr_cpus in prev pd] + nr_ps

The complexity limit was set to 2048 in:

  b68a4c0d ("sched/topology: Disable EAS on inappropriate platforms")

... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
performance states each". For the same platform, the complexity would
actually be of:

  16 * (4 + 2 * 7) + 1 + 7 = 296

Since the EAS complexity was greatly reduced since the limit was
introduced, bigger platforms can handle EAS.

For instance, a platform with 112 CPUs with 7 performance states
each would not reach it:

  112 * (4 + 2 * 7) + 1 + 7 = 2024

To reflect this improvement in the underlying EAS code, remove
the EAS complexity check.

Note that a limit on the number of CPUs still holds against
EM_MAX_NUM_CPUS to avoid overflows during the energy estimation.

[ mingo: Updates to the changelog. ]
Signed-off-by: default avatarPierre Gondois <Pierre.Gondois@arm.com>
Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
Reviewed-by: default avatarLukasz Luba <lukasz.luba@arm.com>
Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com
parent 7bc26384
...@@ -359,32 +359,9 @@ in milli-Watts or in an 'abstract scale'. ...@@ -359,32 +359,9 @@ in milli-Watts or in an 'abstract scale'.
6.3 - Energy Model complexity 6.3 - Energy Model complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The task wake-up path is very latency-sensitive. When the EM of a platform is EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but
too complex (too many CPUs, too many performance domains, too many performance restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during
states, ...), the cost of using it in the wake-up path can become prohibitive. the energy estimation.
The energy-aware wake-up algorithm has a complexity of:
C = Nd * (Nc + Ns)
with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
A complexity check is performed at the root domain level, when scheduling
domains are built. EAS will not start on a root domain if its C happens to be
higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
time of writing).
If you really want to use EAS but the complexity of your platform's Energy
Model is too high to be used with a single root domain, you're left with only
two possible options:
1. split your system into separate, smaller, root domains using exclusive
cpusets and enable EAS locally on each of them. This option has the
benefit to work out of the box but the drawback of preventing load
balance between root domains, which can result in an unbalanced system
overall;
2. submit patches to reduce the complexity of the EAS wake-up algorithm,
hence enabling it to cope with larger EMs in reasonable time.
6.4 - Schedutil governor 6.4 - Schedutil governor
......
...@@ -348,32 +348,13 @@ static void sched_energy_set(bool has_eas) ...@@ -348,32 +348,13 @@ static void sched_energy_set(bool has_eas)
* 1. an Energy Model (EM) is available; * 1. an Energy Model (EM) is available;
* 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy. * 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
* 3. no SMT is detected. * 3. no SMT is detected.
* 4. the EM complexity is low enough to keep scheduling overheads low; * 4. schedutil is driving the frequency of all CPUs of the rd;
* 5. schedutil is driving the frequency of all CPUs of the rd; * 5. frequency invariance support is present;
* 6. frequency invariance support is present;
*
* The complexity of the Energy Model is defined as:
*
* C = nr_pd * (nr_cpus + nr_ps)
*
* with parameters defined as:
* - nr_pd: the number of performance domains
* - nr_cpus: the number of CPUs
* - nr_ps: the sum of the number of performance states of all performance
* domains (for example, on a system with 2 performance domains,
* with 10 performance states each, nr_ps = 2 * 10 = 20).
*
* It is generally not a good idea to use such a model in the wake-up path on
* very complex platforms because of the associated scheduling overheads. The
* arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
* with per-CPU DVFS and less than 8 performance states each, for example.
*/ */
#define EM_MAX_COMPLEXITY 2048
extern struct cpufreq_governor schedutil_gov; extern struct cpufreq_governor schedutil_gov;
static bool build_perf_domains(const struct cpumask *cpu_map) static bool build_perf_domains(const struct cpumask *cpu_map)
{ {
int i, nr_pd = 0, nr_ps = 0, nr_cpus = cpumask_weight(cpu_map); int i;
struct perf_domain *pd = NULL, *tmp; struct perf_domain *pd = NULL, *tmp;
int cpu = cpumask_first(cpu_map); int cpu = cpumask_first(cpu_map);
struct root_domain *rd = cpu_rq(cpu)->rd; struct root_domain *rd = cpu_rq(cpu)->rd;
...@@ -431,20 +412,6 @@ static bool build_perf_domains(const struct cpumask *cpu_map) ...@@ -431,20 +412,6 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
goto free; goto free;
tmp->next = pd; tmp->next = pd;
pd = tmp; pd = tmp;
/*
* Count performance domains and performance states for the
* complexity check.
*/
nr_pd++;
nr_ps += em_pd_nr_perf_states(pd->em_pd);
}
/* Bail out if the Energy Model complexity is too high. */
if (nr_pd * (nr_ps + nr_cpus) > EM_MAX_COMPLEXITY) {
WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n",
cpumask_pr_args(cpu_map));
goto free;
} }
perf_domain_debug(cpu_map, pd); perf_domain_debug(cpu_map, pd);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment