Commit bfe8eb3b authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Energy scheduling:

   - Consolidate how the max compute capacity is used in the scheduler
     and how we calculate the frequency for a level of utilization.

   - Rework interface between the scheduler and the schedutil governor

   - Simplify the util_est logic

  Deadline scheduler:

   - Work more towards reducing SCHED_DEADLINE starvation of low
     priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks
     monopolize CPU cycles, via the introduction of 'deadline servers'
     (nested/2-level scheduling).

     "Fair servers" to make use of this facility are not introduced yet.

  EEVDF:

   - Introduce O(1) fastpath for EEVDF task selection

  NUMA balancing:

   - Tune the NUMA-balancing vma scanning logic some more, to better
     distribute the probability of a particular vma getting scanned.

  Plus misc fixes, cleanups and updates"

* tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  sched/fair: Fix tg->load when offlining a CPU
  sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()
  sched/fair: Use all little CPUs for CPU-bound workloads
  sched/fair: Simplify util_est
  sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
  arm64/amu: Use capacity_ref_freq() to set AMU ratio
  cpufreq/cppc: Set the frequency used for computing the capacity
  cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}()
  energy_model: Use a fixed reference frequency
  cpufreq/schedutil: Use a fixed reference frequency
  cpufreq: Use the fixed and coherent frequency for scaling capacity
  sched/topology: Add a new arch_scale_freq_ref() method
  freezer,sched: Clean saved_state when restoring it during thaw
  sched/fair: Update min_vruntime for reweight_entity() correctly
  sched/doc: Update documentation after renames and synchronize Chinese version
  sched/cpufreq: Rework iowait boost
  sched/cpufreq: Rework schedutil governor performance estimation
  sched/pelt: Avoid underestimation of task utilization
  sched/timers: Explain why idle task schedules out on remote timer enqueue
  sched/cpuidle: Comment about timers requirements VS idle handler
  ...
parents aac4de46 cdb3033e
......@@ -180,7 +180,7 @@ This is the (partial) list of the hooks:
compat_yield sysctl is turned on; in that case, it places the scheduling
entity at the right-most end of the red-black tree.
- check_preempt_curr(...)
- wakeup_preempt(...)
This function checks if a task that entered the runnable state should
preempt the currently running task.
......@@ -189,10 +189,10 @@ This is the (partial) list of the hooks:
This function chooses the most appropriate task eligible to run next.
- set_curr_task(...)
- set_next_task(...)
This function is called when a task changes its scheduling class or changes
its task group.
This function is called when a task changes its scheduling class, changes
its task group or is scheduled.
- task_tick(...)
......
......@@ -90,8 +90,8 @@ For more detail see:
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
UTIL_EST / UTIL_EST_FASTUP
==========================
UTIL_EST
========
Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
......@@ -99,8 +99,7 @@ though when running their expected utilization will be the same, they suffer a
To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
filter to instantly increase and only decay on decrease.
highest. UTIL_EST filters to instantly increase and only decay on decrease.
A further runqueue wide sum (of runnable tasks) is maintained of:
......
......@@ -80,7 +80,7 @@ p->se.vruntime。一旦p->se.vruntime变得足够大,其它的任务将成为
CFS使用纳秒粒度的计时,不依赖于任何jiffies或HZ的细节。因此CFS并不像之前的调度器那样
有“时间片”的概念,也没有任何启发式的设计。唯一可调的参数(你需要打开CONFIG_SCHED_DEBUG)是:
/sys/kernel/debug/sched/min_granularity_ns
/sys/kernel/debug/sched/base_slice_ns
它可以用来将调度器从“桌面”模式(也就是低时延)调节为“服务器”(也就是高批处理)模式。
它的默认设置是适合桌面的工作负载。SCHED_BATCH也被CFS调度器模块处理。
......@@ -147,7 +147,7 @@ array)。
这个函数的行为基本上是出队,紧接着入队,除非compat_yield sysctl被开启。在那种情况下,
它将调度实体放在红黑树的最右端。
- check_preempt_curr(...)
- wakeup_preempt(...)
这个函数检查进入可运行状态的任务能否抢占当前正在运行的任务。
......@@ -155,9 +155,9 @@ array)。
这个函数选择接下来最适合运行的任务。
- set_curr_task(...)
- set_next_task(...)
这个函数在任务改变调度类或改变任务组时被调用。
这个函数在任务改变调度类,改变任务组时,或者任务被调度时被调用。
- task_tick(...)
......
......@@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
- Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
UTIL_EST / UTIL_EST_FASTUP
==========================
UTIL_EST
========
由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
因此它们在再次运行后会面临(DVFS)的上涨。
为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
(Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
仅在利用率下降时衰减。
UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。
进一步,运行队列的(可运行任务的)利用率之和由下式计算:
......
......@@ -13,6 +13,7 @@
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref
#endif
/* Replace task scheduler's default cpu-invariant accounting */
......
......@@ -23,6 +23,7 @@ void update_freq_counters_refs(void);
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref
#ifdef CONFIG_ACPI_CPPC_LIB
#define arch_init_invariance_cppc topology_init_cpu_capacity_cppc
......
......@@ -82,7 +82,12 @@ int __init parse_acpi_topology(void)
#undef pr_fmt
#define pr_fmt(fmt) "AMU: " fmt
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
/*
* Ensure that amu_scale_freq_tick() will return SCHED_CAPACITY_SCALE until
* the CPU capacity and its associated frequency have been correctly
* initialized.
*/
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
static cpumask_var_t amu_fie_cpus;
......@@ -112,14 +117,14 @@ static inline bool freq_counters_valid(int cpu)
return true;
}
static int freq_inv_set_max_ratio(int cpu, u64 max_rate, u64 ref_rate)
void freq_inv_set_max_ratio(int cpu, u64 max_rate)
{
u64 ratio;
u64 ratio, ref_rate = arch_timer_get_rate();
if (unlikely(!max_rate || !ref_rate)) {
pr_debug("CPU%d: invalid maximum or reference frequency.\n",
WARN_ONCE(1, "CPU%d: invalid maximum or reference frequency.\n",
cpu);
return -EINVAL;
return;
}
/*
......@@ -139,12 +144,10 @@ static int freq_inv_set_max_ratio(int cpu, u64 max_rate, u64 ref_rate)
ratio = div64_u64(ratio, max_rate);
if (!ratio) {
WARN_ONCE(1, "Reference frequency too low.\n");
return -EINVAL;
return;
}
per_cpu(arch_max_freq_scale, cpu) = (unsigned long)ratio;
return 0;
WRITE_ONCE(per_cpu(arch_max_freq_scale, cpu), (unsigned long)ratio);
}
static void amu_scale_freq_tick(void)
......@@ -195,10 +198,7 @@ static void amu_fie_setup(const struct cpumask *cpus)
return;
for_each_cpu(cpu, cpus) {
if (!freq_counters_valid(cpu) ||
freq_inv_set_max_ratio(cpu,
cpufreq_get_hw_max_freq(cpu) * 1000ULL,
arch_timer_get_rate()))
if (!freq_counters_valid(cpu))
return;
}
......
......@@ -9,6 +9,7 @@
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref
/* Replace task scheduler's default cpu-invariant accounting */
#define arch_scale_cpu_capacity topology_get_cpu_scale
......
......@@ -39,6 +39,9 @@
#include <linux/rwsem.h>
#include <linux/wait.h>
#include <linux/topology.h>
#include <linux/dmi.h>
#include <linux/units.h>
#include <asm/unaligned.h>
#include <acpi/cppc_acpi.h>
......@@ -1760,3 +1763,104 @@ unsigned int cppc_get_transition_latency(int cpu_num)
return latency_ns;
}
EXPORT_SYMBOL_GPL(cppc_get_transition_latency);
/* Minimum struct length needed for the DMI processor entry we want */
#define DMI_ENTRY_PROCESSOR_MIN_LENGTH 48
/* Offset in the DMI processor structure for the max frequency */
#define DMI_PROCESSOR_MAX_SPEED 0x14
/* Callback function used to retrieve the max frequency from DMI */
static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
{
const u8 *dmi_data = (const u8 *)dm;
u16 *mhz = (u16 *)private;
if (dm->type == DMI_ENTRY_PROCESSOR &&
dm->length >= DMI_ENTRY_PROCESSOR_MIN_LENGTH) {
u16 val = (u16)get_unaligned((const u16 *)
(dmi_data + DMI_PROCESSOR_MAX_SPEED));
*mhz = val > *mhz ? val : *mhz;
}
}
/* Look up the max frequency in DMI */
static u64 cppc_get_dmi_max_khz(void)
{
u16 mhz = 0;
dmi_walk(cppc_find_dmi_mhz, &mhz);
/*
* Real stupid fallback value, just in case there is no
* actual value set.
*/
mhz = mhz ? mhz : 1;
return KHZ_PER_MHZ * mhz;
}
/*
* If CPPC lowest_freq and nominal_freq registers are exposed then we can
* use them to convert perf to freq and vice versa. The conversion is
* extrapolated as an affine function passing by the 2 points:
* - (Low perf, Low freq)
* - (Nominal perf, Nominal freq)
*/
unsigned int cppc_perf_to_khz(struct cppc_perf_caps *caps, unsigned int perf)
{
s64 retval, offset = 0;
static u64 max_khz;
u64 mul, div;
if (caps->lowest_freq && caps->nominal_freq) {
mul = caps->nominal_freq - caps->lowest_freq;
mul *= KHZ_PER_MHZ;
div = caps->nominal_perf - caps->lowest_perf;
offset = caps->nominal_freq * KHZ_PER_MHZ -
div64_u64(caps->nominal_perf * mul, div);
} else {
if (!max_khz)
max_khz = cppc_get_dmi_max_khz();
mul = max_khz;
div = caps->highest_perf;
}
retval = offset + div64_u64(perf * mul, div);
if (retval >= 0)
return retval;
return 0;
}
EXPORT_SYMBOL_GPL(cppc_perf_to_khz);
unsigned int cppc_khz_to_perf(struct cppc_perf_caps *caps, unsigned int freq)
{
s64 retval, offset = 0;
static u64 max_khz;
u64 mul, div;
if (caps->lowest_freq && caps->nominal_freq) {
mul = caps->nominal_perf - caps->lowest_perf;
div = caps->nominal_freq - caps->lowest_freq;
/*
* We don't need to convert to kHz for computing offset and can
* directly use nominal_freq and lowest_freq as the div64_u64
* will remove the frequency unit.
*/
offset = caps->nominal_perf -
div64_u64(caps->nominal_freq * mul, div);
/* But we need it for computing the perf level. */
div *= KHZ_PER_MHZ;
} else {
if (!max_khz)
max_khz = cppc_get_dmi_max_khz();
mul = caps->highest_perf;
div = max_khz;
}
retval = offset + div64_u64(freq * mul, div);
if (retval >= 0)
return retval;
return 0;
}
EXPORT_SYMBOL_GPL(cppc_khz_to_perf);
......@@ -19,6 +19,7 @@
#include <linux/init.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/units.h>
#define CREATE_TRACE_POINTS
#include <trace/events/thermal_pressure.h>
......@@ -26,7 +27,8 @@
static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;
static DEFINE_PER_CPU(u32, freq_factor) = 1;
DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 1;
EXPORT_PER_CPU_SYMBOL_GPL(capacity_freq_ref);
static bool supports_scale_freq_counters(const struct cpumask *cpus)
{
......@@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure);
* operating on stale data when hot-plug is used for some CPUs. The
* @capped_freq reflects the currently allowed max CPUs frequency due to
* thermal capping. It might be also a boost frequency value, which is bigger
* than the internal 'freq_factor' max frequency. In such case the pressure
* value should simply be removed, since this is an indication that there is
* no thermal throttling. The @capped_freq must be provided in kHz.
* than the internal 'capacity_freq_ref' max frequency. In such case the
* pressure value should simply be removed, since this is an indication that
* there is no thermal throttling. The @capped_freq must be provided in kHz.
*/
void topology_update_thermal_pressure(const struct cpumask *cpus,
unsigned long capped_freq)
......@@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
cpu = cpumask_first(cpus);
max_capacity = arch_scale_cpu_capacity(cpu);
max_freq = per_cpu(freq_factor, cpu);
/* Convert to MHz scale which is used in 'freq_factor' */
capped_freq /= 1000;
max_freq = arch_scale_freq_ref(cpu);
/*
* Handle properly the boost frequencies, which should simply clean
......@@ -279,13 +278,13 @@ void topology_normalize_cpu_scale(void)
capacity_scale = 1;
for_each_possible_cpu(cpu) {
capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity_scale = max(capacity, capacity_scale);
}
pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
for_each_possible_cpu(cpu) {
capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
......@@ -321,15 +320,15 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
cpu_node, raw_capacity[cpu]);
/*
* Update freq_factor for calculating early boot cpu capacities.
* Update capacity_freq_ref for calculating early boot CPU capacities.
* For non-clk CPU DVFS mechanism, there's no way to get the
* frequency value now, assuming they are running at the same
* frequency (by keeping the initial freq_factor value).
* frequency (by keeping the initial capacity_freq_ref value).
*/
cpu_clk = of_clk_get(cpu_node, 0);
if (!PTR_ERR_OR_ZERO(cpu_clk)) {
per_cpu(freq_factor, cpu) =
clk_get_rate(cpu_clk) / 1000;
per_cpu(capacity_freq_ref, cpu) =
clk_get_rate(cpu_clk) / HZ_PER_KHZ;
clk_put(cpu_clk);
}
} else {
......@@ -345,11 +344,16 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
return !ret;
}
void __weak freq_inv_set_max_ratio(int cpu, u64 max_rate)
{
}
#ifdef CONFIG_ACPI_CPPC_LIB
#include <acpi/cppc_acpi.h>
void topology_init_cpu_capacity_cppc(void)
{
u64 capacity, capacity_scale = 0;
struct cppc_perf_caps perf_caps;
int cpu;
......@@ -366,6 +370,10 @@ void topology_init_cpu_capacity_cppc(void)
(perf_caps.highest_perf >= perf_caps.nominal_perf) &&
(perf_caps.highest_perf >= perf_caps.lowest_perf)) {
raw_capacity[cpu] = perf_caps.highest_perf;
capacity_scale = max_t(u64, capacity_scale, raw_capacity[cpu]);
per_cpu(capacity_freq_ref, cpu) = cppc_perf_to_khz(&perf_caps, raw_capacity[cpu]);
pr_debug("cpu_capacity: CPU%d cpu_capacity=%u (raw).\n",
cpu, raw_capacity[cpu]);
continue;
......@@ -376,7 +384,18 @@ void topology_init_cpu_capacity_cppc(void)
goto exit;
}
topology_normalize_cpu_scale();
for_each_possible_cpu(cpu) {
freq_inv_set_max_ratio(cpu,
per_cpu(capacity_freq_ref, cpu) * HZ_PER_KHZ);
capacity = raw_capacity[cpu];
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(cpu));
}
schedule_work(&update_topology_flags_work);
pr_debug("cpu_capacity: cpu_capacity initialization done\n");
......@@ -410,8 +429,11 @@ init_cpu_capacity_callback(struct notifier_block *nb,
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
for_each_cpu(cpu, policy->related_cpus)
per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
for_each_cpu(cpu, policy->related_cpus) {
per_cpu(capacity_freq_ref, cpu) = policy->cpuinfo.max_freq;
freq_inv_set_max_ratio(cpu,
per_cpu(capacity_freq_ref, cpu) * HZ_PER_KHZ);
}
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
......
......@@ -16,7 +16,6 @@
#include <linux/delay.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
#include <linux/dmi.h>
#include <linux/irq_work.h>
#include <linux/kthread.h>
#include <linux/time.h>
......@@ -27,12 +26,6 @@
#include <acpi/cppc_acpi.h>
/* Minimum struct length needed for the DMI processor entry we want */
#define DMI_ENTRY_PROCESSOR_MIN_LENGTH 48
/* Offset in the DMI processor structure for the max frequency */
#define DMI_PROCESSOR_MAX_SPEED 0x14
/*
* This list contains information parsed from per CPU ACPI _CPC and _PSD
* structures: e.g. the highest and lowest supported performance, capabilities,
......@@ -291,97 +284,9 @@ static inline void cppc_freq_invariance_exit(void)
}
#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
/* Callback function used to retrieve the max frequency from DMI */
static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
{
const u8 *dmi_data = (const u8 *)dm;
u16 *mhz = (u16 *)private;
if (dm->type == DMI_ENTRY_PROCESSOR &&
dm->length >= DMI_ENTRY_PROCESSOR_MIN_LENGTH) {
u16 val = (u16)get_unaligned((const u16 *)
(dmi_data + DMI_PROCESSOR_MAX_SPEED));
*mhz = val > *mhz ? val : *mhz;
}
}
/* Look up the max frequency in DMI */
static u64 cppc_get_dmi_max_khz(void)
{
u16 mhz = 0;
dmi_walk(cppc_find_dmi_mhz, &mhz);
/*
* Real stupid fallback value, just in case there is no
* actual value set.
*/
mhz = mhz ? mhz : 1;
return (1000 * mhz);
}
/*
* If CPPC lowest_freq and nominal_freq registers are exposed then we can
* use them to convert perf to freq and vice versa. The conversion is
* extrapolated as an affine function passing by the 2 points:
* - (Low perf, Low freq)
* - (Nominal perf, Nominal perf)
*/
static unsigned int cppc_cpufreq_perf_to_khz(struct cppc_cpudata *cpu_data,
unsigned int perf)
{
struct cppc_perf_caps *caps = &cpu_data->perf_caps;
s64 retval, offset = 0;
static u64 max_khz;
u64 mul, div;
if (caps->lowest_freq && caps->nominal_freq) {
mul = caps->nominal_freq - caps->lowest_freq;
div = caps->nominal_perf - caps->lowest_perf;
offset = caps->nominal_freq - div64_u64(caps->nominal_perf * mul, div);
} else {
if (!max_khz)
max_khz = cppc_get_dmi_max_khz();
mul = max_khz;
div = caps->highest_perf;
}
retval = offset + div64_u64(perf * mul, div);
if (retval >= 0)
return retval;
return 0;
}
static unsigned int cppc_cpufreq_khz_to_perf(struct cppc_cpudata *cpu_data,
unsigned int freq)
{
struct cppc_perf_caps *caps = &cpu_data->perf_caps;
s64 retval, offset = 0;
static u64 max_khz;
u64 mul, div;
if (caps->lowest_freq && caps->nominal_freq) {
mul = caps->nominal_perf - caps->lowest_perf;
div = caps->nominal_freq - caps->lowest_freq;
offset = caps->nominal_perf - div64_u64(caps->nominal_freq * mul, div);
} else {
if (!max_khz)
max_khz = cppc_get_dmi_max_khz();
mul = caps->highest_perf;
div = max_khz;
}
retval = offset + div64_u64(freq * mul, div);
if (retval >= 0)
return retval;
return 0;
}
static int cppc_cpufreq_set_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation)
{
struct cppc_cpudata *cpu_data = policy->driver_data;
unsigned int cpu = policy->cpu;
......@@ -389,7 +294,7 @@ static int cppc_cpufreq_set_target(struct cpufreq_policy *policy,
u32 desired_perf;
int ret = 0;
desired_perf = cppc_cpufreq_khz_to_perf(cpu_data, target_freq);
desired_perf = cppc_khz_to_perf(&cpu_data->perf_caps, target_freq);
/* Return if it is exactly the same perf */
if (desired_perf == cpu_data->perf_ctrls.desired_perf)
return ret;
......@@ -417,7 +322,7 @@ static unsigned int cppc_cpufreq_fast_switch(struct cpufreq_policy *policy,
u32 desired_perf;
int ret;
desired_perf = cppc_cpufreq_khz_to_perf(cpu_data, target_freq);
desired_perf = cppc_khz_to_perf(&cpu_data->perf_caps, target_freq);
cpu_data->perf_ctrls.desired_perf = desired_perf;
ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
......@@ -530,7 +435,7 @@ static int cppc_get_cpu_power(struct device *cpu_dev,
min_step = min_cap / CPPC_EM_CAP_STEP;
max_step = max_cap / CPPC_EM_CAP_STEP;
perf_prev = cppc_cpufreq_khz_to_perf(cpu_data, *KHz);
perf_prev = cppc_khz_to_perf(perf_caps, *KHz);
step = perf_prev / perf_step;
if (step > max_step)
......@@ -550,8 +455,8 @@ static int cppc_get_cpu_power(struct device *cpu_dev,
perf = step * perf_step;
}
*KHz = cppc_cpufreq_perf_to_khz(cpu_data, perf);
perf_check = cppc_cpufreq_khz_to_perf(cpu_data, *KHz);
*KHz = cppc_perf_to_khz(perf_caps, perf);
perf_check = cppc_khz_to_perf(perf_caps, *KHz);
step_check = perf_check / perf_step;
/*
......@@ -561,8 +466,8 @@ static int cppc_get_cpu_power(struct device *cpu_dev,
*/
while ((*KHz == prev_freq) || (step_check != step)) {
perf++;
*KHz = cppc_cpufreq_perf_to_khz(cpu_data, perf);
perf_check = cppc_cpufreq_khz_to_perf(cpu_data, *KHz);
*KHz = cppc_perf_to_khz(perf_caps, perf);
perf_check = cppc_khz_to_perf(perf_caps, *KHz);
step_check = perf_check / perf_step;
}
......@@ -591,7 +496,7 @@ static int cppc_get_cpu_cost(struct device *cpu_dev, unsigned long KHz,
perf_caps = &cpu_data->perf_caps;
max_cap = arch_scale_cpu_capacity(cpu_dev->id);
perf_prev = cppc_cpufreq_khz_to_perf(cpu_data, KHz);
perf_prev = cppc_khz_to_perf(perf_caps, KHz);
perf_step = CPPC_EM_CAP_STEP * perf_caps->highest_perf / max_cap;
step = perf_prev / perf_step;
......@@ -679,10 +584,6 @@ static struct cppc_cpudata *cppc_cpufreq_get_cpu_data(unsigned int cpu)
goto free_mask;
}
/* Convert the lowest and nominal freq from MHz to KHz */
cpu_data->perf_caps.lowest_freq *= 1000;
cpu_data->perf_caps.nominal_freq *= 1000;
list_add(&cpu_data->node, &cpu_data_list);
return cpu_data;
......@@ -724,20 +625,16 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
* Set min to lowest nonlinear perf to avoid any efficiency penalty (see
* Section 8.4.7.1.1.5 of ACPI 6.1 spec)
*/
policy->min = cppc_cpufreq_perf_to_khz(cpu_data,
caps->lowest_nonlinear_perf);
policy->max = cppc_cpufreq_perf_to_khz(cpu_data,
caps->nominal_perf);
policy->min = cppc_perf_to_khz(caps, caps->lowest_nonlinear_perf);
policy->max = cppc_perf_to_khz(caps, caps->nominal_perf);
/*
* Set cpuinfo.min_freq to Lowest to make the full range of performance
* available if userspace wants to use any perf between lowest & lowest
* nonlinear perf
*/
policy->cpuinfo.min_freq = cppc_cpufreq_perf_to_khz(cpu_data,
caps->lowest_perf);
policy->cpuinfo.max_freq = cppc_cpufreq_perf_to_khz(cpu_data,
caps->nominal_perf);
policy->cpuinfo.min_freq = cppc_perf_to_khz(caps, caps->lowest_perf);
policy->cpuinfo.max_freq = cppc_perf_to_khz(caps, caps->nominal_perf);
policy->transition_delay_us = cppc_cpufreq_get_transition_delay_us(cpu);
policy->shared_type = cpu_data->shared_type;
......@@ -773,7 +670,7 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
boost_supported = true;
/* Set policy->cur to max now. The governors will adjust later. */
policy->cur = cppc_cpufreq_perf_to_khz(cpu_data, caps->highest_perf);
policy->cur = cppc_perf_to_khz(caps, caps->highest_perf);
cpu_data->perf_ctrls.desired_perf = caps->highest_perf;
ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
......@@ -863,7 +760,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
delivered_perf = cppc_perf_from_fbctrs(cpu_data, &fb_ctrs_t0,
&fb_ctrs_t1);
return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
return cppc_perf_to_khz(&cpu_data->perf_caps, delivered_perf);
}
static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
......@@ -878,11 +775,9 @@ static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
}
if (state)
policy->max = cppc_cpufreq_perf_to_khz(cpu_data,
caps->highest_perf);
policy->max = cppc_perf_to_khz(caps, caps->highest_perf);
else
policy->max = cppc_cpufreq_perf_to_khz(cpu_data,
caps->nominal_perf);
policy->max = cppc_perf_to_khz(caps, caps->nominal_perf);
policy->cpuinfo.max_freq = policy->max;
ret = freq_qos_update_request(policy->max_freq_req, policy->max);
......@@ -937,7 +832,7 @@ static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu)
if (ret < 0)
return -EIO;
return cppc_cpufreq_perf_to_khz(cpu_data, desired_perf);
return cppc_perf_to_khz(&cpu_data->perf_caps, desired_perf);
}
static void cppc_check_hisi_workaround(void)
......
......@@ -454,7 +454,7 @@ void cpufreq_freq_transition_end(struct cpufreq_policy *policy,
arch_set_freq_scale(policy->related_cpus,
policy->cur,
policy->cpuinfo.max_freq);
arch_scale_freq_ref(policy->cpu));
spin_lock(&policy->transition_lock);
policy->transition_ongoing = false;
......@@ -2174,7 +2174,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
policy->cur = freq;
arch_set_freq_scale(policy->related_cpus, freq,
policy->cpuinfo.max_freq);
arch_scale_freq_ref(policy->cpu));
cpufreq_stats_record_transition(policy, freq);
if (trace_cpu_frequency_enabled()) {
......
......@@ -144,6 +144,8 @@ extern int cppc_set_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls);
extern int cppc_set_enable(int cpu, bool enable);
extern int cppc_get_perf_caps(int cpu, struct cppc_perf_caps *caps);
extern bool cppc_perf_ctrs_in_pcc(void);
extern unsigned int cppc_perf_to_khz(struct cppc_perf_caps *caps, unsigned int perf);
extern unsigned int cppc_khz_to_perf(struct cppc_perf_caps *caps, unsigned int freq);
extern bool acpi_cpc_valid(void);
extern bool cppc_allow_fast_switch(void);
extern int acpi_get_psd_map(unsigned int cpu, struct cppc_cpudata *cpu_data);
......
......@@ -27,6 +27,13 @@ static inline unsigned long topology_get_cpu_scale(int cpu)
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
DECLARE_PER_CPU(unsigned long, capacity_freq_ref);
static inline unsigned long topology_get_freq_ref(int cpu)
{
return per_cpu(capacity_freq_ref, cpu);
}
DECLARE_PER_CPU(unsigned long, arch_freq_scale);
static inline unsigned long topology_get_freq_scale(int cpu)
......@@ -92,6 +99,7 @@ void update_siblings_masks(unsigned int cpu);
void remove_cpu_topology(unsigned int cpuid);
void reset_cpu_topology(void);
int parse_acpi_topology(void);
void freq_inv_set_max_ratio(int cpu, u64 max_rate);
#endif
#endif /* _LINUX_ARCH_TOPOLOGY_H_ */
......@@ -1203,6 +1203,7 @@ void arch_set_freq_scale(const struct cpumask *cpus,
{
}
#endif
/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
......
......@@ -224,7 +224,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
unsigned long max_util, unsigned long sum_util,
unsigned long allowed_cpu_cap)
{
unsigned long freq, scale_cpu;
unsigned long freq, ref_freq, scale_cpu;
struct em_perf_state *ps;
int cpu;
......@@ -241,11 +241,10 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
*/
cpu = cpumask_first(to_cpumask(pd->cpus));
scale_cpu = arch_scale_cpu_capacity(cpu);
ps = &pd->table[pd->nr_perf_states - 1];
ref_freq = arch_scale_freq_ref(cpu);
max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ps->frequency, scale_cpu);
freq = map_util_freq(max_util, ref_freq, scale_cpu);
/*
* Find the lowest performance state of the Energy Model above the
......
......@@ -600,6 +600,9 @@ struct vma_numab_state {
*/
unsigned long pids_active[2];
/* MM scan sequence ID when scan first started after VMA creation */
int start_scan_seq;
/*
* MM scan sequence ID when the VMA was last completely scanned.
* A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
......
......@@ -63,11 +63,13 @@ struct robust_list_head;
struct root_domain;
struct rq;
struct sched_attr;
struct sched_dl_entity;
struct seq_file;
struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
struct task_struct;
struct user_event_mm;
/*
......@@ -413,42 +415,6 @@ struct load_weight {
u32 inv_weight;
};
/**
* struct util_est - Estimation utilization of FAIR tasks
* @enqueued: instantaneous estimated utilization of a task/cpu
* @ewma: the Exponential Weighted Moving Average (EWMA)
* utilization of a task
*
* Support data structure to track an Exponential Weighted Moving Average
* (EWMA) of a FAIR task's utilization. New samples are added to the moving
* average each time a task completes an activation. Sample's weight is chosen
* so that the EWMA will be relatively insensitive to transient changes to the
* task's workload.
*
* The enqueued attribute has a slightly different meaning for tasks and cpus:
* - task: the task's util_avg at last task dequeue time
* - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
* Thus, the util_est.enqueued of a task represents the contribution on the
* estimated utilization of the CPU where that task is currently enqueued.
*
* Only for tasks we track a moving average of the past instantaneous
* estimated utilization. This allows to absorb sporadic drops in utilization
* of an otherwise almost periodic task.
*
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
* updates. When a task is dequeued, its util_est should not be updated if its
* util_avg has not been updated in the meantime.
* This information is mapped into the MSB bit of util_est.enqueued at dequeue
* time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg
* for a task) it is safe to use MSB.
*/
struct util_est {
unsigned int enqueued;
unsigned int ewma;
#define UTIL_EST_WEIGHT_SHIFT 2
#define UTIL_AVG_UNCHANGED 0x80000000
} __attribute__((__aligned__(sizeof(u64))));
/*
* The load/runnable/util_avg accumulates an infinite geometric series
* (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
......@@ -503,9 +469,20 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_avg;
unsigned long util_avg;
struct util_est util_est;
unsigned int util_est;
} ____cacheline_aligned;
/*
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
* updates. When a task is dequeued, its util_est should not be updated if its
* util_avg has not been updated in the meantime.
* This information is mapped into the MSB bit of util_est at dequeue time.
* Since max value of util_est for a task is 1024 (PELT util_avg for a task)
* it is safe to use MSB.
*/
#define UTIL_EST_WEIGHT_SHIFT 2
#define UTIL_AVG_UNCHANGED 0x80000000
struct sched_statistics {
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
......@@ -523,7 +500,7 @@ struct sched_statistics {
u64 block_max;
s64 sum_block_runtime;
u64 exec_max;
s64 exec_max;
u64 slice_max;
u64 nr_migrations_cold;
......@@ -553,7 +530,7 @@ struct sched_entity {
struct load_weight load;
struct rb_node run_node;
u64 deadline;
u64 min_deadline;
u64 min_vruntime;
struct list_head group_node;
unsigned int on_rq;
......@@ -607,6 +584,9 @@ struct sched_rt_entity {
#endif
} __randomize_layout;
typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
struct sched_dl_entity {
struct rb_node rb_node;
......@@ -654,6 +634,7 @@ struct sched_dl_entity {
unsigned int dl_yielded : 1;
unsigned int dl_non_contending : 1;
unsigned int dl_overrun : 1;
unsigned int dl_server : 1;
/*
* Bandwidth enforcement timer. Each -deadline task has its
......@@ -668,7 +649,20 @@ struct sched_dl_entity {
* timer is needed to decrease the active utilization at the correct
* time.
*/
struct hrtimer inactive_timer;
struct hrtimer inactive_timer;
/*
* Bits for DL-server functionality. Also see the comment near
* dl_server_update().
*
* @rq the runqueue this server is for
*
* @server_has_tasks() returns true if @server_pick return a
* runnable task.
*/
struct rq *rq;
dl_server_has_tasks_f server_has_tasks;
dl_server_pick_f server_pick;
#ifdef CONFIG_RT_MUTEXES
/*
......@@ -795,6 +789,7 @@ struct task_struct {
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
struct sched_dl_entity *dl_server;
const struct sched_class *sched_class;
#ifdef CONFIG_SCHED_CORE
......
......@@ -279,6 +279,14 @@ void arch_update_thermal_pressure(const struct cpumask *cpus,
{ }
#endif
#ifndef arch_scale_freq_ref
static __always_inline
unsigned int arch_scale_freq_ref(int cpu)
{
return 0;
}
#endif
static inline int task_node(const struct task_struct *p)
{
return cpu_to_node(task_cpu(p));
......
......@@ -493,33 +493,30 @@ DEFINE_EVENT_SCHEDSTAT(sched_stat_template, sched_stat_blocked,
*/
DECLARE_EVENT_CLASS(sched_stat_runtime,
TP_PROTO(struct task_struct *tsk, u64 runtime, u64 vruntime),
TP_PROTO(struct task_struct *tsk, u64 runtime),
TP_ARGS(tsk, __perf_count(runtime), vruntime),
TP_ARGS(tsk, __perf_count(runtime)),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( u64, runtime )
__field( u64, vruntime )
),
TP_fast_assign(
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
__entry->pid = tsk->pid;
__entry->runtime = runtime;
__entry->vruntime = vruntime;
),
TP_printk("comm=%s pid=%d runtime=%Lu [ns] vruntime=%Lu [ns]",
TP_printk("comm=%s pid=%d runtime=%Lu [ns]",
__entry->comm, __entry->pid,
(unsigned long long)__entry->runtime,
(unsigned long long)__entry->vruntime)
(unsigned long long)__entry->runtime)
);
DEFINE_EVENT(sched_stat_runtime, sched_stat_runtime,
TP_PROTO(struct task_struct *tsk, u64 runtime, u64 vruntime),
TP_ARGS(tsk, runtime, vruntime));
TP_PROTO(struct task_struct *tsk, u64 runtime),
TP_ARGS(tsk, runtime));
/*
* Tracepoint for showing priority inheritance modifying a tasks
......
......@@ -187,6 +187,7 @@ static int __restore_freezer_state(struct task_struct *p, void *arg)
if (state != TASK_RUNNING) {
WRITE_ONCE(p->__state, state);
p->saved_state = TASK_RUNNING;
return 1;
}
......
......@@ -1131,6 +1131,28 @@ static void wake_up_idle_cpu(int cpu)
if (cpu == smp_processor_id())
return;
/*
* Set TIF_NEED_RESCHED and send an IPI if in the non-polling
* part of the idle loop. This forces an exit from the idle loop
* and a round trip to schedule(). Now this could be optimized
* because a simple new idle loop iteration is enough to
* re-evaluate the next tick. Provided some re-ordering of tick
* nohz functions that would need to follow TIF_NR_POLLING
* clearing:
*
* - On most archs, a simple fetch_or on ti::flags with a
* "0" value would be enough to know if an IPI needs to be sent.
*
* - x86 needs to perform a last need_resched() check between
* monitor and mwait which doesn't take timers into account.
* There a dedicated TIF_TIMER flag would be required to
* fetch_or here and be checked along with TIF_NEED_RESCHED
* before mwait().
*
* However, remote timer enqueue is not such a frequent event
* and testing of the above solutions didn't appear to report
* much benefits.
*/
if (set_nr_and_not_polling(rq->idle))
smp_send_reschedule(cpu);
else
......@@ -2124,12 +2146,14 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
enqueue_task(rq, p, flags);
p->on_rq = TASK_ON_RQ_QUEUED;
WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
ASSERT_EXCLUSIVE_WRITER(p->on_rq);
}
void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
ASSERT_EXCLUSIVE_WRITER(p->on_rq);
dequeue_task(rq, p, flags);
}
......@@ -3795,6 +3819,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
rq->idle_stamp = 0;
}
#endif
p->dl_server = NULL;
}
/*
......@@ -4509,10 +4535,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
memset(&p->stats, 0, sizeof(p->stats));
#endif
RB_CLEAR_NODE(&p->dl.rb_node);
init_dl_task_timer(&p->dl);
init_dl_inactive_task_timer(&p->dl);
__dl_clear_params(p);
init_dl_entity(&p->dl);
INIT_LIST_HEAD(&p->rt.run_list);
p->rt.timeout = 0;
......@@ -6004,12 +6027,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = pick_next_task_idle(rq);
}
/*
* This is the fast path; it cannot be a DL server pick;
* therefore even if @p == @prev, ->dl_server must be NULL.
*/
if (p->dl_server)
p->dl_server = NULL;
return p;
}
restart:
put_prev_task_balance(rq, prev, rf);
/*
* We've updated @prev and no longer need the server link, clear it.
* Must be done before ->pick_next_task() because that can (re)set
* ->dl_server.
*/
if (prev->dl_server)
prev->dl_server = NULL;
for_each_class(class) {
p = class->pick_next_task(rq);
if (p)
......@@ -7429,18 +7467,13 @@ int sched_core_idle_cpu(int cpu)
* required to meet deadlines.
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
enum cpu_util_type type,
struct task_struct *p)
unsigned long *min,
unsigned long *max)
{
unsigned long dl_util, util, irq, max;
unsigned long util, irq, scale;
struct rq *rq = cpu_rq(cpu);
max = arch_scale_cpu_capacity(cpu);
if (!uclamp_is_used() &&
type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
}
scale = arch_scale_cpu_capacity(cpu);
/*
* Early check to see if IRQ/steal time saturates the CPU, can be
......@@ -7448,45 +7481,49 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
if (unlikely(irq >= max))
return max;
if (unlikely(irq >= scale)) {
if (min)
*min = scale;
if (max)
*max = scale;
return scale;
}
if (min) {
/*
* The minimum utilization returns the highest level between:
* - the computed DL bandwidth needed with the IRQ pressure which
* steals time to the deadline task.
* - The minimum performance requirement for CFS and/or RT.
*/
*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
/*
* When an RT task is runnable and uclamp is not used, we must
* ensure that the task will run at maximum compute capacity.
*/
if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
*min = max(*min, scale);
}
/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
*
* CFS and RT utilization can be boosted or capped, depending on
* utilization clamp constraints requested by currently RUNNABLE
* tasks.
* When there are no CFS RUNNABLE tasks, clamps are released and
* frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
if (type == FREQUENCY_UTIL)
util = uclamp_rq_util_with(rq, util, p);
dl_util = cpu_util_dl(rq);
util += cpu_util_dl(rq);
/*
* For frequency selection we do not make cpu_util_dl() a permanent part
* of this sum because we want to use cpu_bw_dl() later on, but we need
* to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
* that we select f_max when there is no idle time.
*
* NOTE: numerical errors or stop class might cause us to not quite hit
* saturation when we should -- something for later.
* The maximum hint is a soft bandwidth requirement, which can be lower
* than the actual utilization because of uclamp_max requirements.
*/
if (util + dl_util >= max)
return max;
if (max)
*max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
/*
* OTOH, for energy computation we need the estimated running time, so
* include util_dl and ignore dl_bw.
*/
if (type == ENERGY_UTIL)
util += dl_util;
if (util >= scale)
return scale;
/*
* There is still idle time; further improve the number by using the
......@@ -7497,28 +7534,15 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* U' = irq + --------- * U
* max
*/
util = scale_irq_capacity(util, irq, max);
util = scale_irq_capacity(util, irq, scale);
util += irq;
/*
* Bandwidth required by DEADLINE must always be granted while, for
* FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
* to gracefully reduce the frequency when no tasks show up for longer
* periods of time.
*
* Ideally we would like to set bw_dl as min/guaranteed freq and util +
* bw_dl as requested freq. However, cpufreq is not yet ready for such
* an interface. So, we only do the latter for now.
*/
if (type == FREQUENCY_UTIL)
util += cpu_bw_dl(rq);
return min(max, util);
return min(scale, util);
}
unsigned long sched_cpu_util(int cpu)
{
return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
}
#endif /* CONFIG_SMP */
......
......@@ -47,7 +47,7 @@ struct sugov_cpu {
u64 last_update;
unsigned long util;
unsigned long bw_dl;
unsigned long bw_min;
/* The field below is for single-CPU policies only: */
#ifdef CONFIG_NO_HZ_COMMON
......@@ -114,6 +114,28 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
}
}
/**
* get_capacity_ref_freq - get the reference frequency that has been used to
* correlate frequency and compute capacity for a given cpufreq policy. We use
* the CPU managing it for the arch_scale_freq_ref() call in the function.
* @policy: the cpufreq policy of the CPU in question.
*
* Return: the reference CPU frequency to compute a capacity.
*/
static __always_inline
unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
{
unsigned int freq = arch_scale_freq_ref(policy->cpu);
if (freq)
return freq;
if (arch_scale_freq_invariant())
return policy->cpuinfo.max_freq;
return policy->cur;
}
/**
* get_next_freq - Compute a new frequency for a given cpufreq policy.
* @sg_policy: schedutil policy object to compute the new frequency for.
......@@ -140,10 +162,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned long util, unsigned long max)
{
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
unsigned int freq;
util = map_util_perf(util);
freq = get_capacity_ref_freq(policy);
freq = map_util_freq(util, freq, max);
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
......@@ -153,14 +174,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}
static void sugov_get_util(struct sugov_cpu *sg_cpu)
unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max)
{
/* Add dvfs headroom to actual utilization */
actual = map_util_perf(actual);
/* Actually we don't need to target the max performance */
if (actual < max)
max = actual;
/*
* Ensure at least minimum performance while providing more compute
* capacity when possible.
*/
return max(min, max);
}
static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
{
unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
struct rq *rq = cpu_rq(sg_cpu->cpu);
unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
sg_cpu->bw_dl = cpu_bw_dl(rq);
sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
FREQUENCY_UTIL, NULL);
util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
util = max(util, boost);
sg_cpu->bw_min = min;
sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
}
/**
......@@ -251,18 +289,16 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
* This mechanism is designed to boost high frequently IO waiting tasks, while
* being more conservative on tasks which does sporadic IO operations.
*/
static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
unsigned long max_cap)
{
unsigned long boost;
/* No boost currently required */
if (!sg_cpu->iowait_boost)
return;
return 0;
/* Reset boost if the CPU appears to have been idle enough */
if (sugov_iowait_reset(sg_cpu, time, false))
return;
return 0;
if (!sg_cpu->iowait_boost_pending) {
/*
......@@ -271,7 +307,7 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
sg_cpu->iowait_boost >>= 1;
if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
sg_cpu->iowait_boost = 0;
return;
return 0;
}
}
......@@ -281,10 +317,7 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
* sg_cpu->util is already in capacity scale; convert iowait_boost
* into the same scale so we can compare.
*/
boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
if (sg_cpu->util < boost)
sg_cpu->util = boost;
return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
}
#ifdef CONFIG_NO_HZ_COMMON
......@@ -306,7 +339,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
*/
static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
{
if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
sg_cpu->sg_policy->limits_changed = true;
}
......@@ -314,6 +347,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
u64 time, unsigned long max_cap,
unsigned int flags)
{
unsigned long boost;
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
......@@ -322,8 +357,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
return false;
sugov_get_util(sg_cpu);
sugov_iowait_apply(sg_cpu, time, max_cap);
boost = sugov_iowait_apply(sg_cpu, time, max_cap);
sugov_get_util(sg_cpu, boost);
return true;
}
......@@ -407,8 +442,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;
cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
map_util_perf(sg_cpu->util), max_cap);
cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
sg_cpu->util, max_cap);
sg_cpu->sg_policy->last_freq_update_time = time;
}
......@@ -424,9 +459,10 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
unsigned long boost;
sugov_get_util(j_sg_cpu);
sugov_iowait_apply(j_sg_cpu, time, max_cap);
boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
sugov_get_util(j_sg_cpu, boost);
util = max(j_sg_cpu->util, util);
}
......
This diff is collapsed.
......@@ -628,8 +628,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
struct sched_entity *last, *first;
s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
struct sched_entity *last, *first, *root;
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
......@@ -644,15 +644,20 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SPLIT_NS(cfs_rq->exec_clock));
raw_spin_rq_lock_irqsave(rq, flags);
root = __pick_root_entity(cfs_rq);
if (root)
left_vruntime = root->min_vruntime;
first = __pick_first_entity(cfs_rq);
if (first)
left_vruntime = first->vruntime;
left_deadline = first->deadline;
last = __pick_last_entity(cfs_rq);
if (last)
right_vruntime = last->vruntime;
min_vruntime = cfs_rq->min_vruntime;
raw_spin_rq_unlock_irqrestore(rq, flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
SPLIT_NS(left_deadline));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_vruntime",
SPLIT_NS(left_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
......@@ -679,8 +684,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
cfs_rq->avg.util_est.enqueued);
SEQ_printf(m, " .%-30s: %u\n", "util_est",
cfs_rq->avg.util_est);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
......@@ -1070,8 +1075,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
P(se.avg.util_est.ewma);
PM(se.avg.util_est.enqueued, ~UTIL_AVG_UNCHANGED);
PM(se.avg.util_est, ~UTIL_AVG_UNCHANGED);
#endif
#ifdef CONFIG_UCLAMP_TASK
__PS("uclamp.min", p->uclamp_req[UCLAMP_MIN].value);
......
This diff is collapsed.
......@@ -83,7 +83,6 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(UTIL_EST_FASTUP, true)
SCHED_FEAT(LATENCY_WARN, false)
......
......@@ -258,6 +258,36 @@ static void do_idle(void)
while (!need_resched()) {
rmb();
/*
* Interrupts shouldn't be re-enabled from that point on until
* the CPU sleeping instruction is reached. Otherwise an interrupt
* may fire and queue a timer that would be ignored until the CPU
* wakes from the sleeping instruction. And testing need_resched()
* doesn't tell about pending needed timer reprogram.
*
* Several cases to consider:
*
* - SLEEP-UNTIL-PENDING-INTERRUPT based instructions such as
* "wfi" or "mwait" are fine because they can be entered with
* interrupt disabled.
*
* - sti;mwait() couple is fine because the interrupts are
* re-enabled only upon the execution of mwait, leaving no gap
* in-between.
*
* - ROLLBACK based idle handlers with the sleeping instruction
* called with interrupts enabled are NOT fine. In this scheme
* when the interrupt detects it has interrupted an idle handler,
* it rolls back to its beginning which performs the
* need_resched() check before re-executing the sleeping
* instruction. This can leak a pending needed timer reprogram.
* If such a scheme is really mandatory due to the lack of an
* appropriate CPU sleeping instruction, then a FAST-FORWARD
* must instead be applied: when the interrupt detects it has
* interrupted an idle handler, it must resume to the end of
* this idle handler so that the generic idle loop is iterated
* again to reprogram the tick.
*/
local_irq_disable();
if (cpu_is_offline(cpu)) {
......
......@@ -52,13 +52,13 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
return;
/* Avoid store if the flag has been already reset */
enqueued = avg->util_est.enqueued;
enqueued = avg->util_est;
if (!(enqueued & UTIL_AVG_UNCHANGED))
return;
/* Reset flag to report util_avg has been updated */
enqueued &= ~UTIL_AVG_UNCHANGED;
WRITE_ONCE(avg->util_est.enqueued, enqueued);
WRITE_ONCE(avg->util_est, enqueued);
}
static inline u64 rq_clock_pelt(struct rq *rq)
......
......@@ -1002,24 +1002,15 @@ static void update_curr_rt(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_rt_entity *rt_se = &curr->rt;
u64 delta_exec;
u64 now;
s64 delta_exec;
if (curr->sched_class != &rt_sched_class)
return;
now = rq_clock_task(rq);
delta_exec = now - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
delta_exec = update_curr_common(rq);
if (unlikely(delta_exec <= 0))
return;
schedstat_set(curr->stats.exec_max,
max(curr->stats.exec_max, delta_exec));
trace_sched_stat_runtime(curr, delta_exec, 0);
update_current_exec_runtime(curr, now, delta_exec);
if (!rt_bandwidth_enabled())
return;
......
......@@ -273,8 +273,6 @@ struct rt_bandwidth {
unsigned int rt_period_active;
};
void __dl_clear_params(struct task_struct *p);
static inline int dl_bandwidth_enabled(void)
{
return sysctl_sched_rt_runtime >= 0;
......@@ -315,6 +313,33 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int dl_bw_check_overflow(int cpu);
/*
* SCHED_DEADLINE supports servers (nested scheduling) with the following
* interface:
*
* dl_se::rq -- runqueue we belong to.
*
* dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
* server when it runs out of tasks to run.
*
* dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
* returns NULL.
*
* dl_server_update() -- called from update_curr_common(), propagates runtime
* to the server.
*
* dl_server_start()
* dl_server_stop() -- start/stop the server when it has (no) tasks.
*
* dl_server_init() -- initializes the server.
*/
extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
extern void dl_server_start(struct sched_dl_entity *dl_se);
extern void dl_server_stop(struct sched_dl_entity *dl_se);
extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick);
#ifdef CONFIG_CGROUP_SCHED
struct cfs_rq;
......@@ -2179,6 +2204,10 @@ extern const u32 sched_prio_to_wmult[40];
* MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
* in the runqueue.
*
* NOCLOCK - skip the update_rq_clock() (avoids double updates)
*
* MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
*
* ENQUEUE_HEAD - place at front of runqueue (tail if not specified)
* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
* ENQUEUE_MIGRATED - the task was migrated during wakeup
......@@ -2189,6 +2218,7 @@ extern const u32 sched_prio_to_wmult[40];
#define DEQUEUE_SAVE 0x02 /* Matches ENQUEUE_RESTORE */
#define DEQUEUE_MOVE 0x04 /* Matches ENQUEUE_MOVE */
#define DEQUEUE_NOCLOCK 0x08 /* Matches ENQUEUE_NOCLOCK */
#define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */
#define ENQUEUE_WAKEUP 0x01
#define ENQUEUE_RESTORE 0x02
......@@ -2203,6 +2233,7 @@ extern const u32 sched_prio_to_wmult[40];
#define ENQUEUE_MIGRATED 0x00
#endif
#define ENQUEUE_INITIAL 0x80
#define ENQUEUE_MIGRATING 0x100
#define RETRY_TASK ((void *)-1UL)
......@@ -2212,6 +2243,8 @@ struct affinity_context {
unsigned int flags;
};
extern s64 update_curr_common(struct rq *rq);
struct sched_class {
#ifdef CONFIG_UCLAMP_TASK
......@@ -2425,8 +2458,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
extern void init_dl_entity(struct sched_dl_entity *dl_se);
#define BW_SHIFT 20
#define BW_UNIT (1 << BW_SHIFT)
......@@ -2822,6 +2854,7 @@ DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
double_rq_lock(_T->lock, _T->lock2),
double_rq_unlock(_T->lock, _T->lock2))
extern struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq);
extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq);
......@@ -2961,24 +2994,14 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif
#ifdef CONFIG_SMP
/**
* enum cpu_util_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
* @ENERGY_UTIL: Utilization used during energy calculation
*
* The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
* need to be aggregated differently depending on the usage made of them. This
* enum is used within effective_cpu_util() to differentiate the types of
* utilization expected by the callers, and adjust the aggregation accordingly.
*/
enum cpu_util_type {
FREQUENCY_UTIL,
ENERGY_UTIL,
};
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
enum cpu_util_type type,
struct task_struct *p);
unsigned long *min,
unsigned long *max);
unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max);
/*
* Verify the fitness of task @p to run on @cpu taking into account the
......@@ -3035,59 +3058,6 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
}
/**
* uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
* @rq: The rq to clamp against. Must not be NULL.
* @util: The util value to clamp.
* @p: The task to clamp against. Can be NULL if you want to clamp
* against @rq only.
*
* Clamps the passed @util to the max(@rq, @p) effective uclamp values.
*
* If sched_uclamp_used static key is disabled, then just return the util
* without any clamping since uclamp aggregation at the rq level in the fast
* path is disabled, rendering this operation a NOP.
*
* Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
* will return the correct effective uclamp value of the task even if the
* static key is disabled.
*/
static __always_inline
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
struct task_struct *p)
{
unsigned long min_util = 0;
unsigned long max_util = 0;
if (!static_branch_likely(&sched_uclamp_used))
return util;
if (p) {
min_util = uclamp_eff_value(p, UCLAMP_MIN);
max_util = uclamp_eff_value(p, UCLAMP_MAX);
/*
* Ignore last runnable task's max clamp, as this task will
* reset it. Similarly, no need to read the rq's min clamp.
*/
if (uclamp_rq_is_idle(rq))
goto out;
}
min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
out:
/*
* Since CPU's {min,max}_util clamps are MAX aggregated considering
* RUNNABLE tasks with _different_ clamps, we can end up with an
* inversion. Fix it now when the clamps are applied.
*/
if (unlikely(min_util >= max_util))
return min_util;
return clamp(util, min_util, max_util);
}
/* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq)
{
......@@ -3125,13 +3095,6 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
return SCHED_CAPACITY_SCALE;
}
static inline
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
struct task_struct *p)
{
return util;
}
static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
static inline bool uclamp_is_used(void)
......@@ -3261,16 +3224,6 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
#endif
static inline void update_current_exec_runtime(struct task_struct *curr,
u64 now, u64 delta_exec)
{
curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);
}
#ifdef CONFIG_SCHED_MM_CID
#define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */
......
......@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
{
struct task_struct *curr = rq->curr;
u64 now, delta_exec;
now = rq_clock_task(rq);
delta_exec = now - curr->se.exec_start;
if (unlikely((s64)delta_exec < 0))
delta_exec = 0;
schedstat_set(curr->stats.exec_max,
max(curr->stats.exec_max, delta_exec));
update_current_exec_runtime(curr, now, delta_exec);
update_curr_common(rq);
}
/*
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment