Commit e4cbce4d authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Improve uclamp performance by using a static key for the fast path

 - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
   better power efficiency of RT tasks on battery powered devices.
   (The default is to maximize performance & reduce RT latencies.)

 - Improve utime and stime tracking accuracy, which had a fixed boundary
   of error, which created larger and larger relative errors as the
   values become larger. This is now replaced with more precise
   arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

 - Improve the deadline scheduler, such as making it capacity aware

 - Improve frequency-invariant scheduling

 - Misc cleanups in energy/power aware scheduling

 - Add sched_update_nr_running tracepoint to track changes to nr_running

 - Documentation additions and updates

 - Misc cleanups and smaller fixes

* tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
  sched/doc: Document capacity aware scheduling
  sched: Document arch_scale_*_capacity()
  arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
  Documentation/sysctl: Document uclamp sysctl knobs
  sched/uclamp: Add a new sysctl to control RT default boost value
  sched/uclamp: Fix a deadlock when enabling uclamp static key
  sched: Remove duplicated tick_nohz_full_enabled() check
  sched: Fix a typo in a comment
  sched/uclamp: Remove unnecessary mutex_init()
  arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
  sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
  arch_topology, sched/core: Cleanup thermal pressure definition
  trace/events/sched.h: fix duplicated word
  linux/sched/mm.h: drop duplicated words in comments
  smp: Fix a potential usage of stale nr_cpus
  sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
  sched: nohz: stop passing around unused "ticks" parameter.
  sched: Better document ttwu()
  sched: Add a tracepoint to track rq->nr_running
  ...
parents b34133fe 949bcb81
...@@ -1062,6 +1062,60 @@ Enables/disables scheduler statistics. Enabling this feature ...@@ -1062,6 +1062,60 @@ Enables/disables scheduler statistics. Enabling this feature
incurs a small amount of overhead in the scheduler but is incurs a small amount of overhead in the scheduler but is
useful for debugging and performance tuning. useful for debugging and performance tuning.
sched_util_clamp_min:
=====================
Max allowed *minimum* utilization.
Default value is 1024, which is the maximum possible value.
It means that any requested uclamp.min value cannot be greater than
sched_util_clamp_min, i.e., it is restricted to the range
[0:sched_util_clamp_min].
sched_util_clamp_max:
=====================
Max allowed *maximum* utilization.
Default value is 1024, which is the maximum possible value.
It means that any requested uclamp.max value cannot be greater than
sched_util_clamp_max, i.e., it is restricted to the range
[0:sched_util_clamp_max].
sched_util_clamp_min_rt_default:
================================
By default Linux is tuned for performance. Which means that RT tasks always run
at the highest frequency and most capable (highest capacity) CPU (in
heterogeneous systems).
Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
1024 by default, which effectively boosts the tasks to run at the highest
frequency and biases them to run on the biggest CPU.
This knob allows admins to change the default behavior when uclamp is being
used. In battery powered devices particularly, running at the maximum
capacity and frequency will increase energy consumption and shorten the battery
life.
This knob is only effective for RT tasks which the user hasn't modified their
requested uclamp.min value via sched_setattr() syscall.
This knob will not escape the range constraint imposed by sched_util_clamp_min
defined above.
For example if
sched_util_clamp_min_rt_default = 800
sched_util_clamp_min = 600
Then the boost will be clamped to 600 because 800 is outside of the permissible
range of [0:600]. This could happen for instance if a powersave mode will
restrict all boosts temporarily by modifying sched_util_clamp_min. As soon as
this restriction is lifted, the requested sched_util_clamp_min_rt_default
will take effect.
seccomp seccomp
======= =======
......
...@@ -12,6 +12,7 @@ Linux Scheduler ...@@ -12,6 +12,7 @@ Linux Scheduler
sched-deadline sched-deadline
sched-design-CFS sched-design-CFS
sched-domains sched-domains
sched-capacity
sched-energy sched-energy
sched-nice-design sched-nice-design
sched-rt-group sched-rt-group
......
This diff is collapsed.
...@@ -331,16 +331,8 @@ asymmetric CPU topologies for now. This requirement is checked at run-time by ...@@ -331,16 +331,8 @@ asymmetric CPU topologies for now. This requirement is checked at run-time by
looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling
domains are built. domains are built.
The flag is set/cleared automatically by the scheduler topology code whenever See Documentation/sched/sched-capacity.rst for requirements to be met for this
there are CPUs with different capacities in a root domain. The capacities of flag to be set in the sched_domain hierarchy.
CPUs are provided by arch-specific code through the arch_scale_cpu_capacity()
callback. As an example, arm and arm64 share an implementation of this callback
which uses a combination of CPUFreq data and device-tree bindings to compute the
capacity of CPUs (see drivers/base/arch_topology.c for more details).
So, in order to use EAS on your platform your architecture must implement the
arch_scale_cpu_capacity() callback, and some of the CPUs must have a lower
capacity than others.
Please note that EAS is not fundamentally incompatible with SMP, but no Please note that EAS is not fundamentally incompatible with SMP, but no
significant savings on SMP platforms have been observed yet. This restriction significant savings on SMP platforms have been observed yet. This restriction
......
...@@ -16,8 +16,9 @@ ...@@ -16,8 +16,9 @@
/* Enable topology flag updates */ /* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology #define arch_update_cpu_topology topology_update_cpu_topology
/* Replace task scheduler's default thermal pressure retrieve API */ /* Replace task scheduler's default thermal pressure API */
#define arch_scale_thermal_pressure topology_get_thermal_pressure #define arch_scale_thermal_pressure topology_get_thermal_pressure
#define arch_set_thermal_pressure topology_set_thermal_pressure
#else #else
......
...@@ -34,8 +34,9 @@ void topology_scale_freq_tick(void); ...@@ -34,8 +34,9 @@ void topology_scale_freq_tick(void);
/* Enable topology flag updates */ /* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology #define arch_update_cpu_topology topology_update_cpu_topology
/* Replace task scheduler's default thermal pressure retrieve API */ /* Replace task scheduler's default thermal pressure API */
#define arch_scale_thermal_pressure topology_get_thermal_pressure #define arch_scale_thermal_pressure topology_get_thermal_pressure
#define arch_set_thermal_pressure topology_set_thermal_pressure
#include <asm-generic/topology.h> #include <asm-generic/topology.h>
......
...@@ -74,16 +74,26 @@ static inline u64 mul_u32_u32(u32 a, u32 b) ...@@ -74,16 +74,26 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
#else #else
# include <asm-generic/div64.h> # include <asm-generic/div64.h>
static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div) /*
* Will generate an #DE when the result doesn't fit u64, could fix with an
* __ex_table[] entry when it becomes an issue.
*/
static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
{ {
u64 q; u64 q;
asm ("mulq %2; divq %3" : "=a" (q) asm ("mulq %2; divq %3" : "=a" (q)
: "a" (a), "rm" ((u64)mul), "rm" ((u64)div) : "a" (a), "rm" (mul), "rm" (div)
: "rdx"); : "rdx");
return q; return q;
} }
#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
{
return mul_u64_u64_div_u64(a, mul, div);
}
#define mul_u64_u32_div mul_u64_u32_div #define mul_u64_u32_div mul_u64_u32_div
#endif /* CONFIG_X86_32 */ #endif /* CONFIG_X86_32 */
......
...@@ -193,7 +193,7 @@ static inline void sched_clear_itmt_support(void) ...@@ -193,7 +193,7 @@ static inline void sched_clear_itmt_support(void)
} }
#endif /* CONFIG_SCHED_MC_PRIO */ #endif /* CONFIG_SCHED_MC_PRIO */
#ifdef CONFIG_SMP #if defined(CONFIG_SMP) && defined(CONFIG_X86_64)
#include <asm/cpufeature.h> #include <asm/cpufeature.h>
DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
......
...@@ -56,6 +56,7 @@ ...@@ -56,6 +56,7 @@
#include <linux/cpuidle.h> #include <linux/cpuidle.h>
#include <linux/numa.h> #include <linux/numa.h>
#include <linux/pgtable.h> #include <linux/pgtable.h>
#include <linux/overflow.h>
#include <asm/acpi.h> #include <asm/acpi.h>
#include <asm/desc.h> #include <asm/desc.h>
...@@ -1777,6 +1778,7 @@ void native_play_dead(void) ...@@ -1777,6 +1778,7 @@ void native_play_dead(void)
#endif #endif
#ifdef CONFIG_X86_64
/* /*
* APERF/MPERF frequency ratio computation. * APERF/MPERF frequency ratio computation.
* *
...@@ -1975,6 +1977,7 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq) ...@@ -1975,6 +1977,7 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
static bool intel_set_max_freq_ratio(void) static bool intel_set_max_freq_ratio(void)
{ {
u64 base_freq, turbo_freq; u64 base_freq, turbo_freq;
u64 turbo_ratio;
if (slv_set_max_freq_ratio(&base_freq, &turbo_freq)) if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
goto out; goto out;
...@@ -2000,15 +2003,23 @@ static bool intel_set_max_freq_ratio(void) ...@@ -2000,15 +2003,23 @@ static bool intel_set_max_freq_ratio(void)
/* /*
* Some hypervisors advertise X86_FEATURE_APERFMPERF * Some hypervisors advertise X86_FEATURE_APERFMPERF
* but then fill all MSR's with zeroes. * but then fill all MSR's with zeroes.
* Some CPUs have turbo boost but don't declare any turbo ratio
* in MSR_TURBO_RATIO_LIMIT.
*/ */
if (!base_freq) { if (!base_freq || !turbo_freq) {
pr_debug("Couldn't determine cpu base frequency, necessary for scale-invariant accounting.\n"); pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
return false; return false;
} }
arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
base_freq); if (!turbo_ratio) {
pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
return false;
}
arch_turbo_freq_ratio = turbo_ratio;
arch_set_max_freq_ratio(turbo_disabled()); arch_set_max_freq_ratio(turbo_disabled());
return true; return true;
} }
...@@ -2048,11 +2059,19 @@ static void init_freq_invariance(bool secondary) ...@@ -2048,11 +2059,19 @@ static void init_freq_invariance(bool secondary)
} }
} }
static void disable_freq_invariance_workfn(struct work_struct *work)
{
static_branch_disable(&arch_scale_freq_key);
}
static DECLARE_WORK(disable_freq_invariance_work,
disable_freq_invariance_workfn);
DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE; DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
void arch_scale_freq_tick(void) void arch_scale_freq_tick(void)
{ {
u64 freq_scale; u64 freq_scale = SCHED_CAPACITY_SCALE;
u64 aperf, mperf; u64 aperf, mperf;
u64 acnt, mcnt; u64 acnt, mcnt;
...@@ -2064,19 +2083,32 @@ void arch_scale_freq_tick(void) ...@@ -2064,19 +2083,32 @@ void arch_scale_freq_tick(void)
acnt = aperf - this_cpu_read(arch_prev_aperf); acnt = aperf - this_cpu_read(arch_prev_aperf);
mcnt = mperf - this_cpu_read(arch_prev_mperf); mcnt = mperf - this_cpu_read(arch_prev_mperf);
if (!mcnt)
return;
this_cpu_write(arch_prev_aperf, aperf); this_cpu_write(arch_prev_aperf, aperf);
this_cpu_write(arch_prev_mperf, mperf); this_cpu_write(arch_prev_mperf, mperf);
acnt <<= 2*SCHED_CAPACITY_SHIFT; if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
mcnt *= arch_max_freq_ratio; goto error;
if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
goto error;
freq_scale = div64_u64(acnt, mcnt); freq_scale = div64_u64(acnt, mcnt);
if (!freq_scale)
goto error;
if (freq_scale > SCHED_CAPACITY_SCALE) if (freq_scale > SCHED_CAPACITY_SCALE)
freq_scale = SCHED_CAPACITY_SCALE; freq_scale = SCHED_CAPACITY_SCALE;
this_cpu_write(arch_freq_scale, freq_scale); this_cpu_write(arch_freq_scale, freq_scale);
return;
error:
pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
schedule_work(&disable_freq_invariance_work);
}
#else
static inline void init_freq_invariance(bool secondary)
{
} }
#endif /* CONFIG_X86_64 */
...@@ -54,6 +54,17 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity) ...@@ -54,6 +54,17 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
per_cpu(cpu_scale, cpu) = capacity; per_cpu(cpu_scale, cpu) = capacity;
} }
DEFINE_PER_CPU(unsigned long, thermal_pressure);
void topology_set_thermal_pressure(const struct cpumask *cpus,
unsigned long th_pressure)
{
int cpu;
for_each_cpu(cpu, cpus)
WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
}
static ssize_t cpu_capacity_show(struct device *dev, static ssize_t cpu_capacity_show(struct device *dev,
struct device_attribute *attr, struct device_attribute *attr,
char *buf) char *buf)
......
...@@ -12,6 +12,7 @@ ...@@ -12,6 +12,7 @@
#include <linux/string.h> #include <linux/string.h>
#include <linux/slab.h> #include <linux/slab.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/sched/isolation.h>
#include <linux/cpu.h> #include <linux/cpu.h>
#include <linux/pm_runtime.h> #include <linux/pm_runtime.h>
#include <linux/suspend.h> #include <linux/suspend.h>
...@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, ...@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
const struct pci_device_id *id) const struct pci_device_id *id)
{ {
int error, node, cpu; int error, node, cpu;
int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
struct drv_dev_and_id ddi = { drv, dev, id }; struct drv_dev_and_id ddi = { drv, dev, id };
/* /*
...@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, ...@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
pci_physfn_is_probed(dev)) pci_physfn_is_probed(dev))
cpu = nr_cpu_ids; cpu = nr_cpu_ids;
else else
cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask); cpu = cpumask_any_and(cpumask_of_node(node),
housekeeping_cpumask(hk_flags));
if (cpu < nr_cpu_ids) if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi); error = work_on_cpu(cpu, local_pci_probe, &ddi);
......
...@@ -109,12 +109,31 @@ ...@@ -109,12 +109,31 @@
#endif #endif
/* /*
* Align to a 32 byte boundary equal to the * GCC 4.5 and later have a 32 bytes section alignment for structures.
* alignment gcc 4.5 uses for a struct * Except GCC 4.9, that feels the need to align on 64 bytes.
*/ */
#if __GNUC__ == 4 && __GNUC_MINOR__ == 9
#define STRUCT_ALIGNMENT 64
#else
#define STRUCT_ALIGNMENT 32 #define STRUCT_ALIGNMENT 32
#endif
#define STRUCT_ALIGN() . = ALIGN(STRUCT_ALIGNMENT) #define STRUCT_ALIGN() . = ALIGN(STRUCT_ALIGNMENT)
/*
* The order of the sched class addresses are important, as they are
* used to determine the order of the priority of each sched class in
* relation to each other.
*/
#define SCHED_DATA \
STRUCT_ALIGN(); \
__begin_sched_classes = .; \
*(__idle_sched_class) \
*(__fair_sched_class) \
*(__rt_sched_class) \
*(__dl_sched_class) \
*(__stop_sched_class) \
__end_sched_classes = .;
/* The actual configuration determine if the init/exit sections /* The actual configuration determine if the init/exit sections
* are handled as text/data or they can be discarded (which * are handled as text/data or they can be discarded (which
* often happens at runtime) * often happens at runtime)
...@@ -389,6 +408,7 @@ ...@@ -389,6 +408,7 @@
.rodata : AT(ADDR(.rodata) - LOAD_OFFSET) { \ .rodata : AT(ADDR(.rodata) - LOAD_OFFSET) { \
__start_rodata = .; \ __start_rodata = .; \
*(.rodata) *(.rodata.*) \ *(.rodata) *(.rodata.*) \
SCHED_DATA \
RO_AFTER_INIT_DATA /* Read only after init */ \ RO_AFTER_INIT_DATA /* Read only after init */ \
. = ALIGN(8); \ . = ALIGN(8); \
__start___tracepoints_ptrs = .; \ __start___tracepoints_ptrs = .; \
......
...@@ -39,7 +39,7 @@ static inline unsigned long topology_get_thermal_pressure(int cpu) ...@@ -39,7 +39,7 @@ static inline unsigned long topology_get_thermal_pressure(int cpu)
return per_cpu(thermal_pressure, cpu); return per_cpu(thermal_pressure, cpu);
} }
void arch_set_thermal_pressure(struct cpumask *cpus, void topology_set_thermal_pressure(const struct cpumask *cpus,
unsigned long th_pressure); unsigned long th_pressure);
struct cpu_topology { struct cpu_topology {
......
...@@ -263,6 +263,8 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor) ...@@ -263,6 +263,8 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
} }
#endif /* mul_u64_u32_div */ #endif /* mul_u64_u32_div */
u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
#define DIV64_U64_ROUND_UP(ll, d) \ #define DIV64_U64_ROUND_UP(ll, d) \
({ u64 _tmp = (d); div64_u64((ll) + _tmp - 1, _tmp); }) ({ u64 _tmp = (d); div64_u64((ll) + _tmp - 1, _tmp); })
......
...@@ -153,9 +153,10 @@ struct psi_group { ...@@ -153,9 +153,10 @@ struct psi_group {
unsigned long avg[NR_PSI_STATES - 1][3]; unsigned long avg[NR_PSI_STATES - 1][3];
/* Monitor work control */ /* Monitor work control */
atomic_t poll_scheduled; struct task_struct __rcu *poll_task;
struct kthread_worker __rcu *poll_kworker; struct timer_list poll_timer;
struct kthread_delayed_work poll_work; wait_queue_head_t poll_wait;
atomic_t poll_wakeup;
/* Protects data used by the monitor */ /* Protects data used by the monitor */
struct mutex trigger_lock; struct mutex trigger_lock;
......
...@@ -155,7 +155,7 @@ struct task_group; ...@@ -155,7 +155,7 @@ struct task_group;
* *
* for (;;) { * for (;;) {
* set_current_state(TASK_UNINTERRUPTIBLE); * set_current_state(TASK_UNINTERRUPTIBLE);
* if (!need_sleep) * if (CONDITION)
* break; * break;
* *
* schedule(); * schedule();
...@@ -163,16 +163,16 @@ struct task_group; ...@@ -163,16 +163,16 @@ struct task_group;
* __set_current_state(TASK_RUNNING); * __set_current_state(TASK_RUNNING);
* *
* If the caller does not need such serialisation (because, for instance, the * If the caller does not need such serialisation (because, for instance, the
* condition test and condition change and wakeup are under the same lock) then * CONDITION test and condition change and wakeup are under the same lock) then
* use __set_current_state(). * use __set_current_state().
* *
* The above is typically ordered against the wakeup, which does: * The above is typically ordered against the wakeup, which does:
* *
* need_sleep = false; * CONDITION = 1;
* wake_up_state(p, TASK_UNINTERRUPTIBLE); * wake_up_state(p, TASK_UNINTERRUPTIBLE);
* *
* where wake_up_state() executes a full memory barrier before accessing the * where wake_up_state()/try_to_wake_up() executes a full memory barrier before
* task state. * accessing p->state.
* *
* Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is, * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
* once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
...@@ -375,7 +375,7 @@ struct util_est { ...@@ -375,7 +375,7 @@ struct util_est {
* For cfs_rq, they are the aggregated values of all runnable and blocked * For cfs_rq, they are the aggregated values of all runnable and blocked
* sched_entities. * sched_entities.
* *
* The load/runnable/util_avg doesn't direcly factor frequency scaling and CPU * The load/runnable/util_avg doesn't directly factor frequency scaling and CPU
* capacity scaling. The scaling is done through the rq_clock_pelt that is used * capacity scaling. The scaling is done through the rq_clock_pelt that is used
* for computing those signals (see update_rq_clock_pelt()) * for computing those signals (see update_rq_clock_pelt())
* *
...@@ -687,9 +687,15 @@ struct task_struct { ...@@ -687,9 +687,15 @@ struct task_struct {
struct sched_dl_entity dl; struct sched_dl_entity dl;
#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK
/* Clamp values requested for a scheduling entity */ /*
* Clamp values requested for a scheduling entity.
* Must be updated with task_rq_lock() held.
*/
struct uclamp_se uclamp_req[UCLAMP_CNT]; struct uclamp_se uclamp_req[UCLAMP_CNT];
/* Effective clamp values used for a scheduling entity */ /*
* Effective clamp values used for a scheduling entity.
* Must be updated with task_rq_lock() held.
*/
struct uclamp_se uclamp[UCLAMP_CNT]; struct uclamp_se uclamp[UCLAMP_CNT];
#endif #endif
...@@ -2039,6 +2045,7 @@ const struct sched_avg *sched_trace_rq_avg_dl(struct rq *rq); ...@@ -2039,6 +2045,7 @@ const struct sched_avg *sched_trace_rq_avg_dl(struct rq *rq);
const struct sched_avg *sched_trace_rq_avg_irq(struct rq *rq); const struct sched_avg *sched_trace_rq_avg_irq(struct rq *rq);
int sched_trace_rq_cpu(struct rq *rq); int sched_trace_rq_cpu(struct rq *rq);
int sched_trace_rq_nr_running(struct rq *rq);
const struct cpumask *sched_trace_rd_span(struct root_domain *rd); const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
......
...@@ -14,6 +14,7 @@ enum hk_flags { ...@@ -14,6 +14,7 @@ enum hk_flags {
HK_FLAG_DOMAIN = (1 << 5), HK_FLAG_DOMAIN = (1 << 5),
HK_FLAG_WQ = (1 << 6), HK_FLAG_WQ = (1 << 6),
HK_FLAG_MANAGED_IRQ = (1 << 7), HK_FLAG_MANAGED_IRQ = (1 << 7),
HK_FLAG_KTHREAD = (1 << 8),
}; };
#ifdef CONFIG_CPU_ISOLATION #ifdef CONFIG_CPU_ISOLATION
......
...@@ -43,6 +43,6 @@ extern unsigned long calc_load_n(unsigned long load, unsigned long exp, ...@@ -43,6 +43,6 @@ extern unsigned long calc_load_n(unsigned long load, unsigned long exp,
#define LOAD_INT(x) ((x) >> FSHIFT) #define LOAD_INT(x) ((x) >> FSHIFT)
#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) #define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
extern void calc_global_load(unsigned long ticks); extern void calc_global_load(void);
#endif /* _LINUX_SCHED_LOADAVG_H */ #endif /* _LINUX_SCHED_LOADAVG_H */
...@@ -23,7 +23,7 @@ extern struct mm_struct *mm_alloc(void); ...@@ -23,7 +23,7 @@ extern struct mm_struct *mm_alloc(void);
* will still exist later on and mmget_not_zero() has to be used before * will still exist later on and mmget_not_zero() has to be used before
* accessing it. * accessing it.
* *
* This is a preferred way to to pin @mm for a longer/unbounded amount * This is a preferred way to pin @mm for a longer/unbounded amount
* of time. * of time.
* *
* Use mmdrop() to release the reference acquired by mmgrab(). * Use mmdrop() to release the reference acquired by mmgrab().
...@@ -49,8 +49,6 @@ static inline void mmdrop(struct mm_struct *mm) ...@@ -49,8 +49,6 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm); __mmdrop(mm);
} }
void mmdrop(struct mm_struct *mm);
/* /*
* This has to be called after a get_task_mm()/mmget_not_zero() * This has to be called after a get_task_mm()/mmget_not_zero()
* followed by taking the mmap_lock for writing before modifying the * followed by taking the mmap_lock for writing before modifying the
...@@ -234,7 +232,7 @@ static inline unsigned int memalloc_noio_save(void) ...@@ -234,7 +232,7 @@ static inline unsigned int memalloc_noio_save(void)
* @flags: Flags to restore. * @flags: Flags to restore.
* *
* Ends the implicit GFP_NOIO scope started by memalloc_noio_save function. * Ends the implicit GFP_NOIO scope started by memalloc_noio_save function.
* Always make sure that that the given flags is the return value from the * Always make sure that the given flags is the return value from the
* pairing memalloc_noio_save call. * pairing memalloc_noio_save call.
*/ */
static inline void memalloc_noio_restore(unsigned int flags) static inline void memalloc_noio_restore(unsigned int flags)
...@@ -265,7 +263,7 @@ static inline unsigned int memalloc_nofs_save(void) ...@@ -265,7 +263,7 @@ static inline unsigned int memalloc_nofs_save(void)
* @flags: Flags to restore. * @flags: Flags to restore.
* *
* Ends the implicit GFP_NOFS scope started by memalloc_nofs_save function. * Ends the implicit GFP_NOFS scope started by memalloc_nofs_save function.
* Always make sure that that the given flags is the return value from the * Always make sure that the given flags is the return value from the
* pairing memalloc_nofs_save call. * pairing memalloc_nofs_save call.
*/ */
static inline void memalloc_nofs_restore(unsigned int flags) static inline void memalloc_nofs_restore(unsigned int flags)
......
...@@ -61,9 +61,13 @@ int sched_proc_update_handler(struct ctl_table *table, int write, ...@@ -61,9 +61,13 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
extern unsigned int sysctl_sched_rt_period; extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime; extern int sysctl_sched_rt_runtime;
extern unsigned int sysctl_sched_dl_period_max;
extern unsigned int sysctl_sched_dl_period_min;
#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK
extern unsigned int sysctl_sched_uclamp_util_min; extern unsigned int sysctl_sched_uclamp_util_min;
extern unsigned int sysctl_sched_uclamp_util_max; extern unsigned int sysctl_sched_uclamp_util_max;
extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
#endif #endif
#ifdef CONFIG_CFS_BANDWIDTH #ifdef CONFIG_CFS_BANDWIDTH
......
...@@ -55,6 +55,7 @@ extern asmlinkage void schedule_tail(struct task_struct *prev); ...@@ -55,6 +55,7 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
extern void init_idle(struct task_struct *idle, int cpu); extern void init_idle(struct task_struct *idle, int cpu);
extern int sched_fork(unsigned long clone_flags, struct task_struct *p); extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
extern void sched_post_fork(struct task_struct *p);
extern void sched_dead(struct task_struct *p); extern void sched_dead(struct task_struct *p);
void __noreturn do_task_dead(void); void __noreturn do_task_dead(void);
......
...@@ -217,6 +217,16 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu) ...@@ -217,6 +217,16 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
#endif /* !CONFIG_SMP */ #endif /* !CONFIG_SMP */
#ifndef arch_scale_cpu_capacity #ifndef arch_scale_cpu_capacity
/**
* arch_scale_cpu_capacity - get the capacity scale factor of a given CPU.
* @cpu: the CPU in question.
*
* Return: the CPU scale factor normalized against SCHED_CAPACITY_SCALE, i.e.
*
* max_perf(cpu)
* ----------------------------- * SCHED_CAPACITY_SCALE
* max(max_perf(c) : c \in CPUs)
*/
static __always_inline static __always_inline
unsigned long arch_scale_cpu_capacity(int cpu) unsigned long arch_scale_cpu_capacity(int cpu)
{ {
...@@ -232,6 +242,13 @@ unsigned long arch_scale_thermal_pressure(int cpu) ...@@ -232,6 +242,13 @@ unsigned long arch_scale_thermal_pressure(int cpu)
} }
#endif #endif
#ifndef arch_set_thermal_pressure
static __always_inline
void arch_set_thermal_pressure(const struct cpumask *cpus,
unsigned long th_pressure)
{ }
#endif
static inline int task_node(const struct task_struct *p) static inline int task_node(const struct task_struct *p)
{ {
return cpu_to_node(task_cpu(p)); return cpu_to_node(task_cpu(p));
......
...@@ -91,7 +91,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_waking, ...@@ -91,7 +91,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_waking,
/* /*
* Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG. * Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG.
* It it not always called from the waking context. * It is not always called from the waking context.
*/ */
DEFINE_EVENT(sched_wakeup_template, sched_wakeup, DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
TP_PROTO(struct task_struct *p), TP_PROTO(struct task_struct *p),
...@@ -634,6 +634,18 @@ DECLARE_TRACE(sched_overutilized_tp, ...@@ -634,6 +634,18 @@ DECLARE_TRACE(sched_overutilized_tp,
TP_PROTO(struct root_domain *rd, bool overutilized), TP_PROTO(struct root_domain *rd, bool overutilized),
TP_ARGS(rd, overutilized)); TP_ARGS(rd, overutilized));
DECLARE_TRACE(sched_util_est_cfs_tp,
TP_PROTO(struct cfs_rq *cfs_rq),
TP_ARGS(cfs_rq));
DECLARE_TRACE(sched_util_est_se_tp,
TP_PROTO(struct sched_entity *se),
TP_ARGS(se));
DECLARE_TRACE(sched_update_nr_running_tp,
TP_PROTO(struct rq *rq, int change),
TP_ARGS(rq, change));
#endif /* _TRACE_SCHED_H */ #endif /* _TRACE_SCHED_H */
/* This part must be outside protection */ /* This part must be outside protection */
......
...@@ -492,8 +492,23 @@ config HAVE_SCHED_AVG_IRQ ...@@ -492,8 +492,23 @@ config HAVE_SCHED_AVG_IRQ
depends on SMP depends on SMP
config SCHED_THERMAL_PRESSURE config SCHED_THERMAL_PRESSURE
bool "Enable periodic averaging of thermal pressure" bool
default y if ARM && ARM_CPU_TOPOLOGY
default y if ARM64
depends on SMP depends on SMP
depends on CPU_FREQ_THERMAL
help
Select this option to enable thermal pressure accounting in the
scheduler. Thermal pressure is the value conveyed to the scheduler
that reflects the reduction in CPU compute capacity resulted from
thermal throttling. Thermal throttling occurs when the performance of
a CPU is capped due to high operating temperatures.
If selected, the scheduler will be able to balance tasks accordingly,
i.e. put less load on throttled CPUs than on non/less throttled ones.
This requires the architecture to implement
arch_set_thermal_pressure() and arch_get_thermal_pressure().
config BSD_PROCESS_ACCT config BSD_PROCESS_ACCT
bool "BSD Process Accounting" bool "BSD Process Accounting"
......
...@@ -2302,6 +2302,7 @@ static __latent_entropy struct task_struct *copy_process( ...@@ -2302,6 +2302,7 @@ static __latent_entropy struct task_struct *copy_process(
write_unlock_irq(&tasklist_lock); write_unlock_irq(&tasklist_lock);
proc_fork_connector(p); proc_fork_connector(p);
sched_post_fork(p);
cgroup_post_fork(p, args); cgroup_post_fork(p, args);
perf_event_fork(p); perf_event_fork(p);
......
...@@ -27,6 +27,7 @@ ...@@ -27,6 +27,7 @@
#include <linux/ptrace.h> #include <linux/ptrace.h>
#include <linux/uaccess.h> #include <linux/uaccess.h>
#include <linux/numa.h> #include <linux/numa.h>
#include <linux/sched/isolation.h>
#include <trace/events/sched.h> #include <trace/events/sched.h>
...@@ -383,7 +384,8 @@ struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data), ...@@ -383,7 +384,8 @@ struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
* The kernel thread should not inherit these properties. * The kernel thread should not inherit these properties.
*/ */
sched_setscheduler_nocheck(task, SCHED_NORMAL, &param); sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
set_cpus_allowed_ptr(task, cpu_all_mask); set_cpus_allowed_ptr(task,
housekeeping_cpumask(HK_FLAG_KTHREAD));
} }
kfree(create); kfree(create);
return task; return task;
...@@ -608,7 +610,7 @@ int kthreadd(void *unused) ...@@ -608,7 +610,7 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */ /* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd"); set_task_comm(tsk, "kthreadd");
ignore_signals(tsk); ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask); set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_FLAG_KTHREAD));
set_mems_allowed(node_states[N_MEMORY]); set_mems_allowed(node_states[N_MEMORY]);
current->flags |= PF_NOFREEZE; current->flags |= PF_NOFREEZE;
......
This diff is collapsed.
...@@ -121,6 +121,30 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p, ...@@ -121,6 +121,30 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
if (later_mask && if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) { cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
unsigned long cap, max_cap = 0;
int cpu, max_cpu = -1;
if (!static_branch_unlikely(&sched_asym_cpucapacity))
return 1;
/* Ensure the capacity of the CPUs fits the task. */
for_each_cpu(cpu, later_mask) {
if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);
cap = capacity_orig_of(cpu);
if (cap > max_cap ||
(cpu == task_cpu(p) && cap == max_cap)) {
max_cap = cap;
max_cpu = cpu;
}
}
}
if (cpumask_empty(later_mask))
cpumask_set_cpu(max_cpu, later_mask);
return 1; return 1;
} else { } else {
int best_cpu = cpudl_maximum(cp); int best_cpu = cpudl_maximum(cp);
......
...@@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs, ...@@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
unsigned long dl_util, util, irq; unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu); struct rq *rq = cpu_rq(cpu);
if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) && if (!uclamp_is_used() &&
type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) { type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max; return max;
} }
......
...@@ -519,50 +519,6 @@ void account_idle_ticks(unsigned long ticks) ...@@ -519,50 +519,6 @@ void account_idle_ticks(unsigned long ticks)
account_idle_time(cputime); account_idle_time(cputime);
} }
/*
* Perform (stime * rtime) / total, but avoid multiplication overflow by
* losing precision when the numbers are big.
*/
static u64 scale_stime(u64 stime, u64 rtime, u64 total)
{
u64 scaled;
for (;;) {
/* Make sure "rtime" is the bigger of stime/rtime */
if (stime > rtime)
swap(rtime, stime);
/* Make sure 'total' fits in 32 bits */
if (total >> 32)
goto drop_precision;
/* Does rtime (and thus stime) fit in 32 bits? */
if (!(rtime >> 32))
break;
/* Can we just balance rtime/stime rather than dropping bits? */
if (stime >> 31)
goto drop_precision;
/* We can grow stime and shrink rtime and try to make them both fit */
stime <<= 1;
rtime >>= 1;
continue;
drop_precision:
/* We drop from rtime, it has more bits than stime */
rtime >>= 1;
total >>= 1;
}
/*
* Make sure gcc understands that this is a 32x32->64 multiply,
* followed by a 64/32->64 divide.
*/
scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total);
return scaled;
}
/* /*
* Adjust tick based cputime random precision against scheduler runtime * Adjust tick based cputime random precision against scheduler runtime
* accounting. * accounting.
...@@ -622,7 +578,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, ...@@ -622,7 +578,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
goto update; goto update;
} }
stime = scale_stime(stime, rtime, stime + utime); stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
update: update:
/* /*
......
...@@ -54,15 +54,49 @@ static inline struct dl_bw *dl_bw_of(int i) ...@@ -54,15 +54,49 @@ static inline struct dl_bw *dl_bw_of(int i)
static inline int dl_bw_cpus(int i) static inline int dl_bw_cpus(int i)
{ {
struct root_domain *rd = cpu_rq(i)->rd; struct root_domain *rd = cpu_rq(i)->rd;
int cpus = 0; int cpus;
RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
"sched RCU must be held"); "sched RCU must be held");
if (cpumask_subset(rd->span, cpu_active_mask))
return cpumask_weight(rd->span);
cpus = 0;
for_each_cpu_and(i, rd->span, cpu_active_mask) for_each_cpu_and(i, rd->span, cpu_active_mask)
cpus++; cpus++;
return cpus; return cpus;
} }
static inline unsigned long __dl_bw_capacity(int i)
{
struct root_domain *rd = cpu_rq(i)->rd;
unsigned long cap = 0;
RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
"sched RCU must be held");
for_each_cpu_and(i, rd->span, cpu_active_mask)
cap += capacity_orig_of(i);
return cap;
}
/*
* XXX Fix: If 'rq->rd == def_root_domain' perform AC against capacity
* of the CPU the task is running on rather rd's \Sum CPU capacity.
*/
static inline unsigned long dl_bw_capacity(int i)
{
if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
capacity_orig_of(i) == SCHED_CAPACITY_SCALE) {
return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT;
} else {
return __dl_bw_capacity(i);
}
}
#else #else
static inline struct dl_bw *dl_bw_of(int i) static inline struct dl_bw *dl_bw_of(int i)
{ {
...@@ -73,6 +107,11 @@ static inline int dl_bw_cpus(int i) ...@@ -73,6 +107,11 @@ static inline int dl_bw_cpus(int i)
{ {
return 1; return 1;
} }
static inline unsigned long dl_bw_capacity(int i)
{
return SCHED_CAPACITY_SCALE;
}
#endif #endif
static inline static inline
...@@ -1098,7 +1137,7 @@ void init_dl_task_timer(struct sched_dl_entity *dl_se) ...@@ -1098,7 +1137,7 @@ void init_dl_task_timer(struct sched_dl_entity *dl_se)
* cannot use the runtime, and so it replenishes the task. This rule * cannot use the runtime, and so it replenishes the task. This rule
* works fine for implicit deadline tasks (deadline == period), and the * works fine for implicit deadline tasks (deadline == period), and the
* CBS was designed for implicit deadline tasks. However, a task with * CBS was designed for implicit deadline tasks. However, a task with
* constrained deadline (deadine < period) might be awakened after the * constrained deadline (deadline < period) might be awakened after the
* deadline, but before the next period. In this case, replenishing the * deadline, but before the next period. In this case, replenishing the
* task would allow it to run for runtime / deadline. As in this case * task would allow it to run for runtime / deadline. As in this case
* deadline < period, CBS enables a task to run for more than the * deadline < period, CBS enables a task to run for more than the
...@@ -1604,6 +1643,7 @@ static int ...@@ -1604,6 +1643,7 @@ static int
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
{ {
struct task_struct *curr; struct task_struct *curr;
bool select_rq;
struct rq *rq; struct rq *rq;
if (sd_flag != SD_BALANCE_WAKE) if (sd_flag != SD_BALANCE_WAKE)
...@@ -1623,10 +1663,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) ...@@ -1623,10 +1663,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
* other hand, if it has a shorter deadline, we * other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important. * try to make it stay here, it might be important.
*/ */
if (unlikely(dl_task(curr)) && select_rq = unlikely(dl_task(curr)) &&
(curr->nr_cpus_allowed < 2 || (curr->nr_cpus_allowed < 2 ||
!dl_entity_preempt(&p->dl, &curr->dl)) && !dl_entity_preempt(&p->dl, &curr->dl)) &&
(p->nr_cpus_allowed > 1)) { p->nr_cpus_allowed > 1;
/*
* Take the capacity of the CPU into account to
* ensure it fits the requirement of the task.
*/
if (static_branch_unlikely(&sched_asym_cpucapacity))
select_rq |= !dl_task_fits_capacity(p, cpu);
if (select_rq) {
int target = find_later_rq(p); int target = find_later_rq(p);
if (target != -1 && if (target != -1 &&
...@@ -2430,8 +2479,8 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p, ...@@ -2430,8 +2479,8 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
} }
} }
const struct sched_class dl_sched_class = { const struct sched_class dl_sched_class
.next = &rt_sched_class, __attribute__((section("__dl_sched_class"))) = {
.enqueue_task = enqueue_task_dl, .enqueue_task = enqueue_task_dl,
.dequeue_task = dequeue_task_dl, .dequeue_task = dequeue_task_dl,
.yield_task = yield_task_dl, .yield_task = yield_task_dl,
...@@ -2551,11 +2600,12 @@ void sched_dl_do_global(void) ...@@ -2551,11 +2600,12 @@ void sched_dl_do_global(void)
int sched_dl_overflow(struct task_struct *p, int policy, int sched_dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr) const struct sched_attr *attr)
{ {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
u64 period = attr->sched_period ?: attr->sched_deadline; u64 period = attr->sched_period ?: attr->sched_deadline;
u64 runtime = attr->sched_runtime; u64 runtime = attr->sched_runtime;
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
int cpus, err = -1; int cpus, err = -1, cpu = task_cpu(p);
struct dl_bw *dl_b = dl_bw_of(cpu);
unsigned long cap;
if (attr->sched_flags & SCHED_FLAG_SUGOV) if (attr->sched_flags & SCHED_FLAG_SUGOV)
return 0; return 0;
...@@ -2570,15 +2620,17 @@ int sched_dl_overflow(struct task_struct *p, int policy, ...@@ -2570,15 +2620,17 @@ int sched_dl_overflow(struct task_struct *p, int policy,
* allocated bandwidth of the container. * allocated bandwidth of the container.
*/ */
raw_spin_lock(&dl_b->lock); raw_spin_lock(&dl_b->lock);
cpus = dl_bw_cpus(task_cpu(p)); cpus = dl_bw_cpus(cpu);
cap = dl_bw_capacity(cpu);
if (dl_policy(policy) && !task_has_dl_policy(p) && if (dl_policy(policy) && !task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, 0, new_bw)) { !__dl_overflow(dl_b, cap, 0, new_bw)) {
if (hrtimer_active(&p->dl.inactive_timer)) if (hrtimer_active(&p->dl.inactive_timer))
__dl_sub(dl_b, p->dl.dl_bw, cpus); __dl_sub(dl_b, p->dl.dl_bw, cpus);
__dl_add(dl_b, new_bw, cpus); __dl_add(dl_b, new_bw, cpus);
err = 0; err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) && } else if (dl_policy(policy) && task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { !__dl_overflow(dl_b, cap, p->dl.dl_bw, new_bw)) {
/* /*
* XXX this is slightly incorrect: when the task * XXX this is slightly incorrect: when the task
* utilization decreases, we should delay the total * utilization decreases, we should delay the total
...@@ -2634,6 +2686,14 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr) ...@@ -2634,6 +2686,14 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
attr->sched_flags = dl_se->flags; attr->sched_flags = dl_se->flags;
} }
/*
* Default limits for DL period; on the top end we guard against small util
* tasks still getting rediculous long effective runtimes, on the bottom end we
* guard against timer DoS.
*/
unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
unsigned int sysctl_sched_dl_period_min = 100; /* 100 us */
/* /*
* This function validates the new parameters of a -deadline task. * This function validates the new parameters of a -deadline task.
* We ask for the deadline not being zero, and greater or equal * We ask for the deadline not being zero, and greater or equal
...@@ -2646,6 +2706,8 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr) ...@@ -2646,6 +2706,8 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
*/ */
bool __checkparam_dl(const struct sched_attr *attr) bool __checkparam_dl(const struct sched_attr *attr)
{ {
u64 period, max, min;
/* special dl tasks don't actually use any parameter */ /* special dl tasks don't actually use any parameter */
if (attr->sched_flags & SCHED_FLAG_SUGOV) if (attr->sched_flags & SCHED_FLAG_SUGOV)
return true; return true;
...@@ -2669,12 +2731,21 @@ bool __checkparam_dl(const struct sched_attr *attr) ...@@ -2669,12 +2731,21 @@ bool __checkparam_dl(const struct sched_attr *attr)
attr->sched_period & (1ULL << 63)) attr->sched_period & (1ULL << 63))
return false; return false;
period = attr->sched_period;
if (!period)
period = attr->sched_deadline;
/* runtime <= deadline <= period (if period != 0) */ /* runtime <= deadline <= period (if period != 0) */
if ((attr->sched_period != 0 && if (period < attr->sched_deadline ||
attr->sched_period < attr->sched_deadline) ||
attr->sched_deadline < attr->sched_runtime) attr->sched_deadline < attr->sched_runtime)
return false; return false;
max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
if (period < min || period > max)
return false;
return true; return true;
} }
...@@ -2715,19 +2786,19 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr) ...@@ -2715,19 +2786,19 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed) int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed)
{ {
unsigned long flags, cap;
unsigned int dest_cpu; unsigned int dest_cpu;
struct dl_bw *dl_b; struct dl_bw *dl_b;
bool overflow; bool overflow;
int cpus, ret; int ret;
unsigned long flags;
dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed); dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);
rcu_read_lock_sched(); rcu_read_lock_sched();
dl_b = dl_bw_of(dest_cpu); dl_b = dl_bw_of(dest_cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags); raw_spin_lock_irqsave(&dl_b->lock, flags);
cpus = dl_bw_cpus(dest_cpu); cap = dl_bw_capacity(dest_cpu);
overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw); overflow = __dl_overflow(dl_b, cap, 0, p->dl.dl_bw);
if (overflow) { if (overflow) {
ret = -EBUSY; ret = -EBUSY;
} else { } else {
...@@ -2737,6 +2808,8 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo ...@@ -2737,6 +2808,8 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
* We will free resources in the source root_domain * We will free resources in the source root_domain
* later on (see set_cpus_allowed_dl()). * later on (see set_cpus_allowed_dl()).
*/ */
int cpus = dl_bw_cpus(dest_cpu);
__dl_add(dl_b, p->dl.dl_bw, cpus); __dl_add(dl_b, p->dl.dl_bw, cpus);
ret = 0; ret = 0;
} }
...@@ -2769,16 +2842,15 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, ...@@ -2769,16 +2842,15 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
bool dl_cpu_busy(unsigned int cpu) bool dl_cpu_busy(unsigned int cpu)
{ {
unsigned long flags; unsigned long flags, cap;
struct dl_bw *dl_b; struct dl_bw *dl_b;
bool overflow; bool overflow;
int cpus;
rcu_read_lock_sched(); rcu_read_lock_sched();
dl_b = dl_bw_of(cpu); dl_b = dl_bw_of(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags); raw_spin_lock_irqsave(&dl_b->lock, flags);
cpus = dl_bw_cpus(cpu); cap = dl_bw_capacity(cpu);
overflow = __dl_overflow(dl_b, cpus, 0, 0); overflow = __dl_overflow(dl_b, cap, 0, 0);
raw_spin_unlock_irqrestore(&dl_b->lock, flags); raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched(); rcu_read_unlock_sched();
......
...@@ -22,8 +22,6 @@ ...@@ -22,8 +22,6 @@
*/ */
#include "sched.h" #include "sched.h"
#include <trace/events/sched.h>
/* /*
* Targeted preemption latency for CPU-bound tasks: * Targeted preemption latency for CPU-bound tasks:
* *
...@@ -3094,7 +3092,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, ...@@ -3094,7 +3092,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
do { do {
u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib; u32 divider = get_pelt_divider(&se->avg);
se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider); se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
} while (0); } while (0);
...@@ -3440,16 +3438,18 @@ static inline void ...@@ -3440,16 +3438,18 @@ static inline void
update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{ {
long delta = gcfs_rq->avg.util_avg - se->avg.util_avg; long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
/* u32 divider;
* cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
* See ___update_load_avg() for details.
*/
u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
/* Nothing to update */ /* Nothing to update */
if (!delta) if (!delta)
return; return;
/*
* cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
* See ___update_load_avg() for details.
*/
divider = get_pelt_divider(&cfs_rq->avg);
/* Set new sched_entity's utilization */ /* Set new sched_entity's utilization */
se->avg.util_avg = gcfs_rq->avg.util_avg; se->avg.util_avg = gcfs_rq->avg.util_avg;
se->avg.util_sum = se->avg.util_avg * divider; se->avg.util_sum = se->avg.util_avg * divider;
...@@ -3463,16 +3463,18 @@ static inline void ...@@ -3463,16 +3463,18 @@ static inline void
update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{ {
long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg; long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
/* u32 divider;
* cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
* See ___update_load_avg() for details.
*/
u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
/* Nothing to update */ /* Nothing to update */
if (!delta) if (!delta)
return; return;
/*
* cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
* See ___update_load_avg() for details.
*/
divider = get_pelt_divider(&cfs_rq->avg);
/* Set new sched_entity's runnable */ /* Set new sched_entity's runnable */
se->avg.runnable_avg = gcfs_rq->avg.runnable_avg; se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
se->avg.runnable_sum = se->avg.runnable_avg * divider; se->avg.runnable_sum = se->avg.runnable_avg * divider;
...@@ -3500,7 +3502,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq ...@@ -3500,7 +3502,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
* cfs_rq->avg.period_contrib can be used for both cfs_rq and se. * cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
* See ___update_load_avg() for details. * See ___update_load_avg() for details.
*/ */
divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib; divider = get_pelt_divider(&cfs_rq->avg);
if (runnable_sum >= 0) { if (runnable_sum >= 0) {
/* /*
...@@ -3646,7 +3648,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) ...@@ -3646,7 +3648,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
if (cfs_rq->removed.nr) { if (cfs_rq->removed.nr) {
unsigned long r; unsigned long r;
u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib; u32 divider = get_pelt_divider(&cfs_rq->avg);
raw_spin_lock(&cfs_rq->removed.lock); raw_spin_lock(&cfs_rq->removed.lock);
swap(cfs_rq->removed.util_avg, removed_util); swap(cfs_rq->removed.util_avg, removed_util);
...@@ -3701,7 +3703,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s ...@@ -3701,7 +3703,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
* cfs_rq->avg.period_contrib can be used for both cfs_rq and se. * cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
* See ___update_load_avg() for details. * See ___update_load_avg() for details.
*/ */
u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib; u32 divider = get_pelt_divider(&cfs_rq->avg);
/* /*
* When we attach the @se to the @cfs_rq, we must align the decay * When we attach the @se to the @cfs_rq, we must align the decay
...@@ -3922,6 +3924,8 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq, ...@@ -3922,6 +3924,8 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
enqueued = cfs_rq->avg.util_est.enqueued; enqueued = cfs_rq->avg.util_est.enqueued;
enqueued += _task_util_est(p); enqueued += _task_util_est(p);
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
trace_sched_util_est_cfs_tp(cfs_rq);
} }
/* /*
...@@ -3952,6 +3956,8 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) ...@@ -3952,6 +3956,8 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p)); ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
trace_sched_util_est_cfs_tp(cfs_rq);
/* /*
* Skip update of task's estimated utilization when the task has not * Skip update of task's estimated utilization when the task has not
* yet completed an activation, e.g. being migrated. * yet completed an activation, e.g. being migrated.
...@@ -4017,6 +4023,8 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) ...@@ -4017,6 +4023,8 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
ue.ewma >>= UTIL_EST_WEIGHT_SHIFT; ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
done: done:
WRITE_ONCE(p->se.avg.util_est, ue); WRITE_ONCE(p->se.avg.util_est, ue);
trace_sched_util_est_se_tp(&p->se);
} }
static inline int task_fits_capacity(struct task_struct *p, long capacity) static inline int task_fits_capacity(struct task_struct *p, long capacity)
...@@ -5618,14 +5626,14 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) ...@@ -5618,14 +5626,14 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
} }
dequeue_throttle: /* At this point se is NULL and we are at root level*/
if (!se)
sub_nr_running(rq, 1); sub_nr_running(rq, 1);
/* balance early to pull high priority tasks */ /* balance early to pull high priority tasks */
if (unlikely(!was_sched_idle && sched_idle_rq(rq))) if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
rq->next_balance = jiffies; rq->next_balance = jiffies;
dequeue_throttle:
util_est_dequeue(&rq->cfs, p, task_sleep); util_est_dequeue(&rq->cfs, p, task_sleep);
hrtick_update(rq); hrtick_update(rq);
} }
...@@ -7161,7 +7169,7 @@ static void yield_task_fair(struct rq *rq) ...@@ -7161,7 +7169,7 @@ static void yield_task_fair(struct rq *rq)
set_skip_buddy(se); set_skip_buddy(se);
} }
static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preempt) static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
{ {
struct sched_entity *se = &p->se; struct sched_entity *se = &p->se;
...@@ -8049,7 +8057,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) ...@@ -8049,7 +8057,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
}; };
} }
static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu) static unsigned long scale_rt_capacity(int cpu)
{ {
struct rq *rq = cpu_rq(cpu); struct rq *rq = cpu_rq(cpu);
unsigned long max = arch_scale_cpu_capacity(cpu); unsigned long max = arch_scale_cpu_capacity(cpu);
...@@ -8081,7 +8089,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu) ...@@ -8081,7 +8089,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu) static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{ {
unsigned long capacity = scale_rt_capacity(sd, cpu); unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups; struct sched_group *sdg = sd->groups;
cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu); cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
...@@ -8703,8 +8711,14 @@ static bool update_pick_idlest(struct sched_group *idlest, ...@@ -8703,8 +8711,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
case group_has_spare: case group_has_spare:
/* Select group with most idle CPUs */ /* Select group with most idle CPUs */
if (idlest_sgs->idle_cpus >= sgs->idle_cpus) if (idlest_sgs->idle_cpus > sgs->idle_cpus)
return false;
/* Select group with lowest group_util */
if (idlest_sgs->idle_cpus == sgs->idle_cpus &&
idlest_sgs->group_util <= sgs->group_util)
return false; return false;
break; break;
} }
...@@ -10027,7 +10041,12 @@ static void kick_ilb(unsigned int flags) ...@@ -10027,7 +10041,12 @@ static void kick_ilb(unsigned int flags)
{ {
int ilb_cpu; int ilb_cpu;
nohz.next_balance++; /*
* Increase nohz.next_balance only when if full ilb is triggered but
* not if we only update stats.
*/
if (flags & NOHZ_BALANCE_KICK)
nohz.next_balance = jiffies+1;
ilb_cpu = find_new_ilb(); ilb_cpu = find_new_ilb();
...@@ -10348,6 +10367,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, ...@@ -10348,6 +10367,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
} }
} }
/*
* next_balance will be updated only when there is a need.
* When the CPU is attached to null domain for ex, it will not be
* updated.
*/
if (likely(update_next_balance))
nohz.next_balance = next_balance;
/* Newly idle CPU doesn't need an update */ /* Newly idle CPU doesn't need an update */
if (idle != CPU_NEWLY_IDLE) { if (idle != CPU_NEWLY_IDLE) {
update_blocked_averages(this_cpu); update_blocked_averages(this_cpu);
...@@ -10368,14 +10395,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, ...@@ -10368,14 +10395,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
if (has_blocked_load) if (has_blocked_load)
WRITE_ONCE(nohz.has_blocked, 1); WRITE_ONCE(nohz.has_blocked, 1);
/*
* next_balance will be updated only when there is a need.
* When the CPU is attached to null domain for ex, it will not be
* updated.
*/
if (likely(update_next_balance))
nohz.next_balance = next_balance;
return ret; return ret;
} }
...@@ -11118,8 +11137,8 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task ...@@ -11118,8 +11137,8 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
/* /*
* All the scheduling class methods: * All the scheduling class methods:
*/ */
const struct sched_class fair_sched_class = { const struct sched_class fair_sched_class
.next = &idle_sched_class, __attribute__((section("__fair_sched_class"))) = {
.enqueue_task = enqueue_task_fair, .enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair, .dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair, .yield_task = yield_task_fair,
...@@ -11292,3 +11311,9 @@ const struct cpumask *sched_trace_rd_span(struct root_domain *rd) ...@@ -11292,3 +11311,9 @@ const struct cpumask *sched_trace_rd_span(struct root_domain *rd)
#endif #endif
} }
EXPORT_SYMBOL_GPL(sched_trace_rd_span); EXPORT_SYMBOL_GPL(sched_trace_rd_span);
int sched_trace_rq_nr_running(struct rq *rq)
{
return rq ? rq->nr_running : -1;
}
EXPORT_SYMBOL_GPL(sched_trace_rq_nr_running);
...@@ -453,11 +453,6 @@ prio_changed_idle(struct rq *rq, struct task_struct *p, int oldprio) ...@@ -453,11 +453,6 @@ prio_changed_idle(struct rq *rq, struct task_struct *p, int oldprio)
BUG(); BUG();
} }
static unsigned int get_rr_interval_idle(struct rq *rq, struct task_struct *task)
{
return 0;
}
static void update_curr_idle(struct rq *rq) static void update_curr_idle(struct rq *rq)
{ {
} }
...@@ -465,8 +460,8 @@ static void update_curr_idle(struct rq *rq) ...@@ -465,8 +460,8 @@ static void update_curr_idle(struct rq *rq)
/* /*
* Simple, special scheduling class for the per-CPU idle tasks: * Simple, special scheduling class for the per-CPU idle tasks:
*/ */
const struct sched_class idle_sched_class = { const struct sched_class idle_sched_class
/* .next is NULL */ __attribute__((section("__idle_sched_class"))) = {
/* no enqueue/yield_task for idle tasks */ /* no enqueue/yield_task for idle tasks */
/* dequeue is not valid, we print a debug message there: */ /* dequeue is not valid, we print a debug message there: */
...@@ -486,8 +481,6 @@ const struct sched_class idle_sched_class = { ...@@ -486,8 +481,6 @@ const struct sched_class idle_sched_class = {
.task_tick = task_tick_idle, .task_tick = task_tick_idle,
.get_rr_interval = get_rr_interval_idle,
.prio_changed = prio_changed_idle, .prio_changed = prio_changed_idle,
.switched_to = switched_to_idle, .switched_to = switched_to_idle,
.update_curr = update_curr_idle, .update_curr = update_curr_idle,
......
...@@ -140,7 +140,8 @@ static int __init housekeeping_nohz_full_setup(char *str) ...@@ -140,7 +140,8 @@ static int __init housekeeping_nohz_full_setup(char *str)
{ {
unsigned int flags; unsigned int flags;
flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | HK_FLAG_MISC; flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
HK_FLAG_MISC | HK_FLAG_KTHREAD;
return housekeeping_setup(str, flags); return housekeeping_setup(str, flags);
} }
......
...@@ -347,7 +347,7 @@ static inline void calc_global_nohz(void) { } ...@@ -347,7 +347,7 @@ static inline void calc_global_nohz(void) { }
* *
* Called from the global timer code. * Called from the global timer code.
*/ */
void calc_global_load(unsigned long ticks) void calc_global_load(void)
{ {
unsigned long sample_window; unsigned long sample_window;
long active, delta; long active, delta;
......
...@@ -28,8 +28,6 @@ ...@@ -28,8 +28,6 @@
#include "sched.h" #include "sched.h"
#include "pelt.h" #include "pelt.h"
#include <trace/events/sched.h>
/* /*
* Approximate: * Approximate:
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period) * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
...@@ -83,8 +81,6 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3) ...@@ -83,8 +81,6 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
return c1 + c2 + c3; return c1 + c2 + c3;
} }
#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
/* /*
* Accumulate the three separate parts of the sum; d1 the remainder * Accumulate the three separate parts of the sum; d1 the remainder
* of the last (incomplete) period, d2 the span of full periods and d3 * of the last (incomplete) period, d2 the span of full periods and d3
...@@ -264,7 +260,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa, ...@@ -264,7 +260,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
static __always_inline void static __always_inline void
___update_load_avg(struct sched_avg *sa, unsigned long load) ___update_load_avg(struct sched_avg *sa, unsigned long load)
{ {
u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib; u32 divider = get_pelt_divider(sa);
/* /*
* Step 2: update *_avg. * Step 2: update *_avg.
......
...@@ -37,6 +37,11 @@ update_irq_load_avg(struct rq *rq, u64 running) ...@@ -37,6 +37,11 @@ update_irq_load_avg(struct rq *rq, u64 running)
} }
#endif #endif
static inline u32 get_pelt_divider(struct sched_avg *avg)
{
return LOAD_AVG_MAX - 1024 + avg->period_contrib;
}
/* /*
* When a task is dequeued, its estimated utilization should not be update if * When a task is dequeued, its estimated utilization should not be update if
* its util_avg has not been updated at least once. * its util_avg has not been updated at least once.
......
...@@ -190,7 +190,6 @@ static void group_init(struct psi_group *group) ...@@ -190,7 +190,6 @@ static void group_init(struct psi_group *group)
INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
mutex_init(&group->avgs_lock); mutex_init(&group->avgs_lock);
/* Init trigger-related members */ /* Init trigger-related members */
atomic_set(&group->poll_scheduled, 0);
mutex_init(&group->trigger_lock); mutex_init(&group->trigger_lock);
INIT_LIST_HEAD(&group->triggers); INIT_LIST_HEAD(&group->triggers);
memset(group->nr_triggers, 0, sizeof(group->nr_triggers)); memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
...@@ -199,7 +198,7 @@ static void group_init(struct psi_group *group) ...@@ -199,7 +198,7 @@ static void group_init(struct psi_group *group)
memset(group->polling_total, 0, sizeof(group->polling_total)); memset(group->polling_total, 0, sizeof(group->polling_total));
group->polling_next_update = ULLONG_MAX; group->polling_next_update = ULLONG_MAX;
group->polling_until = 0; group->polling_until = 0;
rcu_assign_pointer(group->poll_kworker, NULL); rcu_assign_pointer(group->poll_task, NULL);
} }
void __init psi_init(void) void __init psi_init(void)
...@@ -547,47 +546,38 @@ static u64 update_triggers(struct psi_group *group, u64 now) ...@@ -547,47 +546,38 @@ static u64 update_triggers(struct psi_group *group, u64 now)
return now + group->poll_min_period; return now + group->poll_min_period;
} }
/* /* Schedule polling if it's not already scheduled. */
* Schedule polling if it's not already scheduled. It's safe to call even from
* hotpath because even though kthread_queue_delayed_work takes worker->lock
* spinlock that spinlock is never contended due to poll_scheduled atomic
* preventing such competition.
*/
static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay) static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
{ {
struct kthread_worker *kworker; struct task_struct *task;
/* Do not reschedule if already scheduled */ /*
if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) != 0) * Do not reschedule if already scheduled.
* Possible race with a timer scheduled after this check but before
* mod_timer below can be tolerated because group->polling_next_update
* will keep updates on schedule.
*/
if (timer_pending(&group->poll_timer))
return; return;
rcu_read_lock(); rcu_read_lock();
kworker = rcu_dereference(group->poll_kworker); task = rcu_dereference(group->poll_task);
/* /*
* kworker might be NULL in case psi_trigger_destroy races with * kworker might be NULL in case psi_trigger_destroy races with
* psi_task_change (hotpath) which can't use locks * psi_task_change (hotpath) which can't use locks
*/ */
if (likely(kworker)) if (likely(task))
kthread_queue_delayed_work(kworker, &group->poll_work, delay); mod_timer(&group->poll_timer, jiffies + delay);
else
atomic_set(&group->poll_scheduled, 0);
rcu_read_unlock(); rcu_read_unlock();
} }
static void psi_poll_work(struct kthread_work *work) static void psi_poll_work(struct psi_group *group)
{ {
struct kthread_delayed_work *dwork;
struct psi_group *group;
u32 changed_states; u32 changed_states;
u64 now; u64 now;
dwork = container_of(work, struct kthread_delayed_work, work);
group = container_of(dwork, struct psi_group, poll_work);
atomic_set(&group->poll_scheduled, 0);
mutex_lock(&group->trigger_lock); mutex_lock(&group->trigger_lock);
now = sched_clock(); now = sched_clock();
...@@ -623,6 +613,35 @@ static void psi_poll_work(struct kthread_work *work) ...@@ -623,6 +613,35 @@ static void psi_poll_work(struct kthread_work *work)
mutex_unlock(&group->trigger_lock); mutex_unlock(&group->trigger_lock);
} }
static int psi_poll_worker(void *data)
{
struct psi_group *group = (struct psi_group *)data;
struct sched_param param = {
.sched_priority = 1,
};
sched_setscheduler_nocheck(current, SCHED_FIFO, &param);
while (true) {
wait_event_interruptible(group->poll_wait,
atomic_cmpxchg(&group->poll_wakeup, 1, 0) ||
kthread_should_stop());
if (kthread_should_stop())
break;
psi_poll_work(group);
}
return 0;
}
static void poll_timer_fn(struct timer_list *t)
{
struct psi_group *group = from_timer(group, t, poll_timer);
atomic_set(&group->poll_wakeup, 1);
wake_up_interruptible(&group->poll_wait);
}
static void record_times(struct psi_group_cpu *groupc, int cpu, static void record_times(struct psi_group_cpu *groupc, int cpu,
bool memstall_tick) bool memstall_tick)
{ {
...@@ -1099,22 +1118,20 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, ...@@ -1099,22 +1118,20 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
mutex_lock(&group->trigger_lock); mutex_lock(&group->trigger_lock);
if (!rcu_access_pointer(group->poll_kworker)) { if (!rcu_access_pointer(group->poll_task)) {
struct sched_param param = { struct task_struct *task;
.sched_priority = 1,
};
struct kthread_worker *kworker;
kworker = kthread_create_worker(0, "psimon"); task = kthread_create(psi_poll_worker, group, "psimon");
if (IS_ERR(kworker)) { if (IS_ERR(task)) {
kfree(t); kfree(t);
mutex_unlock(&group->trigger_lock); mutex_unlock(&group->trigger_lock);
return ERR_CAST(kworker); return ERR_CAST(task);
} }
sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param); atomic_set(&group->poll_wakeup, 0);
kthread_init_delayed_work(&group->poll_work, init_waitqueue_head(&group->poll_wait);
psi_poll_work); wake_up_process(task);
rcu_assign_pointer(group->poll_kworker, kworker); timer_setup(&group->poll_timer, poll_timer_fn, 0);
rcu_assign_pointer(group->poll_task, task);
} }
list_add(&t->node, &group->triggers); list_add(&t->node, &group->triggers);
...@@ -1132,7 +1149,7 @@ static void psi_trigger_destroy(struct kref *ref) ...@@ -1132,7 +1149,7 @@ static void psi_trigger_destroy(struct kref *ref)
{ {
struct psi_trigger *t = container_of(ref, struct psi_trigger, refcount); struct psi_trigger *t = container_of(ref, struct psi_trigger, refcount);
struct psi_group *group = t->group; struct psi_group *group = t->group;
struct kthread_worker *kworker_to_destroy = NULL; struct task_struct *task_to_destroy = NULL;
if (static_branch_likely(&psi_disabled)) if (static_branch_likely(&psi_disabled))
return; return;
...@@ -1158,13 +1175,13 @@ static void psi_trigger_destroy(struct kref *ref) ...@@ -1158,13 +1175,13 @@ static void psi_trigger_destroy(struct kref *ref)
period = min(period, div_u64(tmp->win.size, period = min(period, div_u64(tmp->win.size,
UPDATES_PER_WINDOW)); UPDATES_PER_WINDOW));
group->poll_min_period = period; group->poll_min_period = period;
/* Destroy poll_kworker when the last trigger is destroyed */ /* Destroy poll_task when the last trigger is destroyed */
if (group->poll_states == 0) { if (group->poll_states == 0) {
group->polling_until = 0; group->polling_until = 0;
kworker_to_destroy = rcu_dereference_protected( task_to_destroy = rcu_dereference_protected(
group->poll_kworker, group->poll_task,
lockdep_is_held(&group->trigger_lock)); lockdep_is_held(&group->trigger_lock));
rcu_assign_pointer(group->poll_kworker, NULL); rcu_assign_pointer(group->poll_task, NULL);
} }
} }
...@@ -1172,25 +1189,23 @@ static void psi_trigger_destroy(struct kref *ref) ...@@ -1172,25 +1189,23 @@ static void psi_trigger_destroy(struct kref *ref)
/* /*
* Wait for both *trigger_ptr from psi_trigger_replace and * Wait for both *trigger_ptr from psi_trigger_replace and
* poll_kworker RCUs to complete their read-side critical sections * poll_task RCUs to complete their read-side critical sections
* before destroying the trigger and optionally the poll_kworker * before destroying the trigger and optionally the poll_task
*/ */
synchronize_rcu(); synchronize_rcu();
/* /*
* Destroy the kworker after releasing trigger_lock to prevent a * Destroy the kworker after releasing trigger_lock to prevent a
* deadlock while waiting for psi_poll_work to acquire trigger_lock * deadlock while waiting for psi_poll_work to acquire trigger_lock
*/ */
if (kworker_to_destroy) { if (task_to_destroy) {
/* /*
* After the RCU grace period has expired, the worker * After the RCU grace period has expired, the worker
* can no longer be found through group->poll_kworker. * can no longer be found through group->poll_task.
* But it might have been already scheduled before * But it might have been already scheduled before
* that - deschedule it cleanly before destroying it. * that - deschedule it cleanly before destroying it.
*/ */
kthread_cancel_delayed_work_sync(&group->poll_work); del_timer_sync(&group->poll_timer);
atomic_set(&group->poll_scheduled, 0); kthread_stop(task_to_destroy);
kthread_destroy_worker(kworker_to_destroy);
} }
kfree(t); kfree(t);
} }
......
...@@ -2429,8 +2429,8 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task) ...@@ -2429,8 +2429,8 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
return 0; return 0;
} }
const struct sched_class rt_sched_class = { const struct sched_class rt_sched_class
.next = &fair_sched_class, __attribute__((section("__rt_sched_class"))) = {
.enqueue_task = enqueue_task_rt, .enqueue_task = enqueue_task_rt,
.dequeue_task = dequeue_task_rt, .dequeue_task = dequeue_task_rt,
.yield_task = yield_task_rt, .yield_task = yield_task_rt,
......
...@@ -67,6 +67,7 @@ ...@@ -67,6 +67,7 @@
#include <linux/tsacct_kern.h> #include <linux/tsacct_kern.h>
#include <asm/tlb.h> #include <asm/tlb.h>
#include <asm-generic/vmlinux.lds.h>
#ifdef CONFIG_PARAVIRT #ifdef CONFIG_PARAVIRT
# include <asm/paravirt.h> # include <asm/paravirt.h>
...@@ -75,6 +76,8 @@ ...@@ -75,6 +76,8 @@
#include "cpupri.h" #include "cpupri.h"
#include "cpudeadline.h" #include "cpudeadline.h"
#include <trace/events/sched.h>
#ifdef CONFIG_SCHED_DEBUG #ifdef CONFIG_SCHED_DEBUG
# define SCHED_WARN_ON(x) WARN_ONCE(x, #x) # define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
#else #else
...@@ -96,6 +99,7 @@ extern atomic_long_t calc_load_tasks; ...@@ -96,6 +99,7 @@ extern atomic_long_t calc_load_tasks;
extern void calc_global_load_tick(struct rq *this_rq); extern void calc_global_load_tick(struct rq *this_rq);
extern long calc_load_fold_active(struct rq *this_rq, long adjust); extern long calc_load_fold_active(struct rq *this_rq, long adjust);
extern void call_trace_sched_update_nr_running(struct rq *rq, int count);
/* /*
* Helpers for converting nanosecond timing to jiffy resolution * Helpers for converting nanosecond timing to jiffy resolution
*/ */
...@@ -310,11 +314,26 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus) ...@@ -310,11 +314,26 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
__dl_update(dl_b, -((s32)tsk_bw / cpus)); __dl_update(dl_b, -((s32)tsk_bw / cpus));
} }
static inline static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw) u64 old_bw, u64 new_bw)
{ {
return dl_b->bw != -1 && return dl_b->bw != -1 &&
dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw; cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}
/*
* Verify the fitness of task @p to run on @cpu taking into account the
* CPU original capacity and the runtime/deadline ratio of the task.
*
* The function will return true if the CPU original capacity of the
* @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
* task and false otherwise.
*/
static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
{
unsigned long cap = arch_scale_cpu_capacity(cpu);
return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
} }
extern void init_dl_bw(struct dl_bw *dl_b); extern void init_dl_bw(struct dl_bw *dl_b);
...@@ -862,6 +881,8 @@ struct uclamp_rq { ...@@ -862,6 +881,8 @@ struct uclamp_rq {
unsigned int value; unsigned int value;
struct uclamp_bucket bucket[UCLAMP_BUCKETS]; struct uclamp_bucket bucket[UCLAMP_BUCKETS];
}; };
DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
#endif /* CONFIG_UCLAMP_TASK */ #endif /* CONFIG_UCLAMP_TASK */
/* /*
...@@ -1182,6 +1203,16 @@ struct rq_flags { ...@@ -1182,6 +1203,16 @@ struct rq_flags {
#endif #endif
}; };
/*
* Lockdep annotation that avoids accidental unlocks; it's like a
* sticky/continuous lockdep_assert_held().
*
* This avoids code that has access to 'struct rq *rq' (basically everything in
* the scheduler) from accidentally unlocking the rq if they do not also have a
* copy of the (on-stack) 'struct rq_flags rf'.
*
* Also see Documentation/locking/lockdep-design.rst.
*/
static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf) static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{ {
rf->cookie = lockdep_pin_lock(&rq->lock); rf->cookie = lockdep_pin_lock(&rq->lock);
...@@ -1739,7 +1770,6 @@ extern const u32 sched_prio_to_wmult[40]; ...@@ -1739,7 +1770,6 @@ extern const u32 sched_prio_to_wmult[40];
#define RETRY_TASK ((void *)-1UL) #define RETRY_TASK ((void *)-1UL)
struct sched_class { struct sched_class {
const struct sched_class *next;
#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK
int uclamp_enabled; int uclamp_enabled;
...@@ -1748,7 +1778,7 @@ struct sched_class { ...@@ -1748,7 +1778,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq); void (*yield_task) (struct rq *rq);
bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); bool (*yield_to_task)(struct rq *rq, struct task_struct *p);
void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
...@@ -1796,7 +1826,7 @@ struct sched_class { ...@@ -1796,7 +1826,7 @@ struct sched_class {
#ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type); void (*task_change_group)(struct task_struct *p, int type);
#endif #endif
}; } __aligned(STRUCT_ALIGNMENT); /* STRUCT_ALIGN(), vmlinux.lds.h */
static inline void put_prev_task(struct rq *rq, struct task_struct *prev) static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{ {
...@@ -1810,17 +1840,18 @@ static inline void set_next_task(struct rq *rq, struct task_struct *next) ...@@ -1810,17 +1840,18 @@ static inline void set_next_task(struct rq *rq, struct task_struct *next)
next->sched_class->set_next_task(rq, next, false); next->sched_class->set_next_task(rq, next, false);
} }
#ifdef CONFIG_SMP /* Defined in include/asm-generic/vmlinux.lds.h */
#define sched_class_highest (&stop_sched_class) extern struct sched_class __begin_sched_classes[];
#else extern struct sched_class __end_sched_classes[];
#define sched_class_highest (&dl_sched_class)
#endif #define sched_class_highest (__end_sched_classes - 1)
#define sched_class_lowest (__begin_sched_classes - 1)
#define for_class_range(class, _from, _to) \ #define for_class_range(class, _from, _to) \
for (class = (_from); class != (_to); class = class->next) for (class = (_from); class != (_to); class--)
#define for_each_class(class) \ #define for_each_class(class) \
for_class_range(class, sched_class_highest, NULL) for_class_range(class, sched_class_highest, sched_class_lowest)
extern const struct sched_class stop_sched_class; extern const struct sched_class stop_sched_class;
extern const struct sched_class dl_sched_class; extern const struct sched_class dl_sched_class;
...@@ -1930,12 +1961,7 @@ extern int __init sched_tick_offload_init(void); ...@@ -1930,12 +1961,7 @@ extern int __init sched_tick_offload_init(void);
*/ */
static inline void sched_update_tick_dependency(struct rq *rq) static inline void sched_update_tick_dependency(struct rq *rq)
{ {
int cpu; int cpu = cpu_of(rq);
if (!tick_nohz_full_enabled())
return;
cpu = cpu_of(rq);
if (!tick_nohz_full_cpu(cpu)) if (!tick_nohz_full_cpu(cpu))
return; return;
...@@ -1955,6 +1981,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count) ...@@ -1955,6 +1981,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
unsigned prev_nr = rq->nr_running; unsigned prev_nr = rq->nr_running;
rq->nr_running = prev_nr + count; rq->nr_running = prev_nr + count;
if (trace_sched_update_nr_running_tp_enabled()) {
call_trace_sched_update_nr_running(rq, count);
}
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
if (prev_nr < 2 && rq->nr_running >= 2) { if (prev_nr < 2 && rq->nr_running >= 2) {
...@@ -1969,6 +1998,10 @@ static inline void add_nr_running(struct rq *rq, unsigned count) ...@@ -1969,6 +1998,10 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
static inline void sub_nr_running(struct rq *rq, unsigned count) static inline void sub_nr_running(struct rq *rq, unsigned count)
{ {
rq->nr_running -= count; rq->nr_running -= count;
if (trace_sched_update_nr_running_tp_enabled()) {
call_trace_sched_update_nr_running(rq, count);
}
/* Check if we still need preemption */ /* Check if we still need preemption */
sched_update_tick_dependency(rq); sched_update_tick_dependency(rq);
} }
...@@ -2016,6 +2049,16 @@ void arch_scale_freq_tick(void) ...@@ -2016,6 +2049,16 @@ void arch_scale_freq_tick(void)
#endif #endif
#ifndef arch_scale_freq_capacity #ifndef arch_scale_freq_capacity
/**
* arch_scale_freq_capacity - get the frequency scale factor of a given CPU.
* @cpu: the CPU in question.
*
* Return: the frequency scale factor normalized against SCHED_CAPACITY_SCALE, i.e.
*
* f_curr
* ------ * SCHED_CAPACITY_SCALE
* f_max
*/
static __always_inline static __always_inline
unsigned long arch_scale_freq_capacity(int cpu) unsigned long arch_scale_freq_capacity(int cpu)
{ {
...@@ -2349,12 +2392,35 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} ...@@ -2349,12 +2392,35 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK
unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id); unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
/**
* uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
* @rq: The rq to clamp against. Must not be NULL.
* @util: The util value to clamp.
* @p: The task to clamp against. Can be NULL if you want to clamp
* against @rq only.
*
* Clamps the passed @util to the max(@rq, @p) effective uclamp values.
*
* If sched_uclamp_used static key is disabled, then just return the util
* without any clamping since uclamp aggregation at the rq level in the fast
* path is disabled, rendering this operation a NOP.
*
* Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
* will return the correct effective uclamp value of the task even if the
* static key is disabled.
*/
static __always_inline static __always_inline
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
struct task_struct *p) struct task_struct *p)
{ {
unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value); unsigned long min_util;
unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value); unsigned long max_util;
if (!static_branch_likely(&sched_uclamp_used))
return util;
min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
if (p) { if (p) {
min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN)); min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
...@@ -2371,6 +2437,19 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, ...@@ -2371,6 +2437,19 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
return clamp(util, min_util, max_util); return clamp(util, min_util, max_util);
} }
/*
* When uclamp is compiled in, the aggregation at rq level is 'turned off'
* by default in the fast path and only gets turned on once userspace performs
* an operation that requires it.
*
* Returns true if userspace opted-in to use uclamp and aggregation at rq level
* hence is active.
*/
static inline bool uclamp_is_used(void)
{
return static_branch_likely(&sched_uclamp_used);
}
#else /* CONFIG_UCLAMP_TASK */ #else /* CONFIG_UCLAMP_TASK */
static inline static inline
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
...@@ -2378,6 +2457,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, ...@@ -2378,6 +2457,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
{ {
return util; return util;
} }
static inline bool uclamp_is_used(void)
{
return false;
}
#endif /* CONFIG_UCLAMP_TASK */ #endif /* CONFIG_UCLAMP_TASK */
#ifdef arch_scale_freq_capacity #ifdef arch_scale_freq_capacity
......
...@@ -102,12 +102,6 @@ prio_changed_stop(struct rq *rq, struct task_struct *p, int oldprio) ...@@ -102,12 +102,6 @@ prio_changed_stop(struct rq *rq, struct task_struct *p, int oldprio)
BUG(); /* how!?, what priority? */ BUG(); /* how!?, what priority? */
} }
static unsigned int
get_rr_interval_stop(struct rq *rq, struct task_struct *task)
{
return 0;
}
static void update_curr_stop(struct rq *rq) static void update_curr_stop(struct rq *rq)
{ {
} }
...@@ -115,8 +109,8 @@ static void update_curr_stop(struct rq *rq) ...@@ -115,8 +109,8 @@ static void update_curr_stop(struct rq *rq)
/* /*
* Simple, special scheduling class for the per-CPU stop tasks: * Simple, special scheduling class for the per-CPU stop tasks:
*/ */
const struct sched_class stop_sched_class = { const struct sched_class stop_sched_class
.next = &dl_sched_class, __attribute__((section("__stop_sched_class"))) = {
.enqueue_task = enqueue_task_stop, .enqueue_task = enqueue_task_stop,
.dequeue_task = dequeue_task_stop, .dequeue_task = dequeue_task_stop,
...@@ -136,8 +130,6 @@ const struct sched_class stop_sched_class = { ...@@ -136,8 +130,6 @@ const struct sched_class stop_sched_class = {
.task_tick = task_tick_stop, .task_tick = task_tick_stop,
.get_rr_interval = get_rr_interval_stop,
.prio_changed = prio_changed_stop, .prio_changed = prio_changed_stop,
.switched_to = switched_to_stop, .switched_to = switched_to_stop,
.update_curr = update_curr_stop, .update_curr = update_curr_stop,
......
...@@ -1328,7 +1328,7 @@ sd_init(struct sched_domain_topology_level *tl, ...@@ -1328,7 +1328,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd_flags = (*tl->sd_flags)(); sd_flags = (*tl->sd_flags)();
if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS, if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
"wrong sd_flags in topology description\n")) "wrong sd_flags in topology description\n"))
sd_flags &= ~TOPOLOGY_SD_FLAGS; sd_flags &= TOPOLOGY_SD_FLAGS;
/* Apply detected topology flags */ /* Apply detected topology flags */
sd_flags |= dflags; sd_flags |= dflags;
......
...@@ -634,8 +634,7 @@ static int __init nrcpus(char *str) ...@@ -634,8 +634,7 @@ static int __init nrcpus(char *str)
{ {
int nr_cpus; int nr_cpus;
get_option(&str, &nr_cpus); if (get_option(&str, &nr_cpus) && nr_cpus > 0 && nr_cpus < nr_cpu_ids)
if (nr_cpus > 0 && nr_cpus < nr_cpu_ids)
nr_cpu_ids = nr_cpus; nr_cpu_ids = nr_cpus;
return 0; return 0;
......
...@@ -1779,6 +1779,20 @@ static struct ctl_table kern_table[] = { ...@@ -1779,6 +1779,20 @@ static struct ctl_table kern_table[] = {
.mode = 0644, .mode = 0644,
.proc_handler = sched_rt_handler, .proc_handler = sched_rt_handler,
}, },
{
.procname = "sched_deadline_period_max_us",
.data = &sysctl_sched_dl_period_max,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = proc_dointvec,
},
{
.procname = "sched_deadline_period_min_us",
.data = &sysctl_sched_dl_period_min,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = proc_dointvec,
},
{ {
.procname = "sched_rr_timeslice_ms", .procname = "sched_rr_timeslice_ms",
.data = &sysctl_sched_rr_timeslice, .data = &sysctl_sched_rr_timeslice,
...@@ -1801,6 +1815,13 @@ static struct ctl_table kern_table[] = { ...@@ -1801,6 +1815,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644, .mode = 0644,
.proc_handler = sysctl_sched_uclamp_handler, .proc_handler = sysctl_sched_uclamp_handler,
}, },
{
.procname = "sched_util_clamp_min_rt_default",
.data = &sysctl_sched_uclamp_util_min_rt_default,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = sysctl_sched_uclamp_handler,
},
#endif #endif
#ifdef CONFIG_SCHED_AUTOGROUP #ifdef CONFIG_SCHED_AUTOGROUP
{ {
......
...@@ -2193,7 +2193,7 @@ EXPORT_SYMBOL(ktime_get_coarse_ts64); ...@@ -2193,7 +2193,7 @@ EXPORT_SYMBOL(ktime_get_coarse_ts64);
void do_timer(unsigned long ticks) void do_timer(unsigned long ticks)
{ {
jiffies_64 += ticks; jiffies_64 += ticks;
calc_global_load(ticks); calc_global_load();
} }
/** /**
......
...@@ -6,6 +6,7 @@ ...@@ -6,6 +6,7 @@
#include <linux/export.h> #include <linux/export.h>
#include <linux/memblock.h> #include <linux/memblock.h>
#include <linux/numa.h> #include <linux/numa.h>
#include <linux/sched/isolation.h>
/** /**
* cpumask_next - get the next cpu in a cpumask * cpumask_next - get the next cpu in a cpumask
...@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask) ...@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
*/ */
unsigned int cpumask_local_spread(unsigned int i, int node) unsigned int cpumask_local_spread(unsigned int i, int node)
{ {
int cpu; int cpu, hk_flags;
const struct cpumask *mask;
hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
mask = housekeeping_cpumask(hk_flags);
/* Wrap: we always want a cpu. */ /* Wrap: we always want a cpu. */
i %= num_online_cpus(); i %= cpumask_weight(mask);
if (node == NUMA_NO_NODE) { if (node == NUMA_NO_NODE) {
for_each_cpu(cpu, cpu_online_mask) for_each_cpu(cpu, mask) {
if (i-- == 0) if (i-- == 0)
return cpu; return cpu;
}
} else { } else {
/* NUMA first. */ /* NUMA first. */
for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask) for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0) if (i-- == 0)
return cpu; return cpu;
}
for_each_cpu(cpu, cpu_online_mask) { for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */ /* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node))) if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue; continue;
......
...@@ -190,3 +190,44 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder) ...@@ -190,3 +190,44 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
return __iter_div_u64_rem(dividend, divisor, remainder); return __iter_div_u64_rem(dividend, divisor, remainder);
} }
EXPORT_SYMBOL(iter_div_u64_rem); EXPORT_SYMBOL(iter_div_u64_rem);
#ifndef mul_u64_u64_div_u64
u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
{
u64 res = 0, div, rem;
int shift;
/* can a * b overflow ? */
if (ilog2(a) + ilog2(b) > 62) {
/*
* (b * a) / c is equal to
*
* (b / c) * a +
* (b % c) * a / c
*
* if nothing overflows. Can the 1st multiplication
* overflow? Yes, but we do not care: this can only
* happen if the end result can't fit in u64 anyway.
*
* So the code below does
*
* res = (b / c) * a;
* b = b % c;
*/
div = div64_u64_rem(b, c, &rem);
res = div * a;
b = rem;
shift = ilog2(a) + ilog2(b) - 62;
if (shift > 0) {
/* drop precision */
b >>= shift;
c >>= shift;
if (!c)
return res;
}
}
return res + div64_u64(a * b, c);
}
#endif
...@@ -11,6 +11,7 @@ ...@@ -11,6 +11,7 @@
#include <linux/if_arp.h> #include <linux/if_arp.h>
#include <linux/slab.h> #include <linux/slab.h>
#include <linux/sched/signal.h> #include <linux/sched/signal.h>
#include <linux/sched/isolation.h>
#include <linux/nsproxy.h> #include <linux/nsproxy.h>
#include <net/sock.h> #include <net/sock.h>
#include <net/net_namespace.h> #include <net/net_namespace.h>
...@@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue, ...@@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
{ {
struct rps_map *old_map, *map; struct rps_map *old_map, *map;
cpumask_var_t mask; cpumask_var_t mask;
int err, cpu, i; int err, cpu, i, hk_flags;
static DEFINE_MUTEX(rps_map_mutex); static DEFINE_MUTEX(rps_map_mutex);
if (!capable(CAP_NET_ADMIN)) if (!capable(CAP_NET_ADMIN))
...@@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue, ...@@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
return err; return err;
} }
hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
if (cpumask_empty(mask)) {
free_cpumask_var(mask);
return -EINVAL;
}
map = kzalloc(max_t(unsigned int, map = kzalloc(max_t(unsigned int,
RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES), RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
GFP_KERNEL); GFP_KERNEL);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment