Commit d6a3b247 authored by Mauro Carvalho Chehab's avatar Mauro Carvalho Chehab Committed by Jonathan Corbet

docs: scheduler: convert docs to ReST and rename to *.rst

In order to prepare to add them to the Kernel API book,
convert the files to ReST format.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
parent d2238840
...@@ -11,4 +11,4 @@ Description: ...@@ -11,4 +11,4 @@ Description:
example would be, if User A has shares = 1024 and user example would be, if User A has shares = 1024 and user
B has shares = 2048, User B will get twice the CPU B has shares = 2048, User B will get twice the CPU
bandwidth user A will. For more details refer bandwidth user A will. For more details refer
Documentation/scheduler/sched-design-CFS.txt Documentation/scheduler/sched-design-CFS.rst
================================================
Completions - "wait for completion" barrier APIs Completions - "wait for completion" barrier APIs
================================================ ================================================
...@@ -46,7 +47,7 @@ it has to wait for it. ...@@ -46,7 +47,7 @@ it has to wait for it.
To use completions you need to #include <linux/completion.h> and To use completions you need to #include <linux/completion.h> and
create a static or dynamic variable of type 'struct completion', create a static or dynamic variable of type 'struct completion',
which has only two fields: which has only two fields::
struct completion { struct completion {
unsigned int done; unsigned int done;
...@@ -57,7 +58,7 @@ This provides the ->wait waitqueue to place tasks on for waiting (if any), and ...@@ -57,7 +58,7 @@ This provides the ->wait waitqueue to place tasks on for waiting (if any), and
the ->done completion flag for indicating whether it's completed or not. the ->done completion flag for indicating whether it's completed or not.
Completions should be named to refer to the event that is being synchronized on. Completions should be named to refer to the event that is being synchronized on.
A good example is: A good example is::
wait_for_completion(&early_console_added); wait_for_completion(&early_console_added);
...@@ -81,7 +82,7 @@ have taken place, even if these wait functions return prematurely due to a timeo ...@@ -81,7 +82,7 @@ have taken place, even if these wait functions return prematurely due to a timeo
or a signal triggering. or a signal triggering.
Initializing of dynamically allocated completion objects is done via a call to Initializing of dynamically allocated completion objects is done via a call to
init_completion(): init_completion()::
init_completion(&dynamic_object->done); init_completion(&dynamic_object->done);
...@@ -100,7 +101,8 @@ but be aware of other races. ...@@ -100,7 +101,8 @@ but be aware of other races.
For static declaration and initialization, macros are available. For static declaration and initialization, macros are available.
For static (or global) declarations in file scope you can use DECLARE_COMPLETION(): For static (or global) declarations in file scope you can use
DECLARE_COMPLETION()::
static DECLARE_COMPLETION(setup_done); static DECLARE_COMPLETION(setup_done);
DECLARE_COMPLETION(setup_done); DECLARE_COMPLETION(setup_done);
...@@ -111,7 +113,7 @@ initialized to 'not done' and doesn't require an init_completion() call. ...@@ -111,7 +113,7 @@ initialized to 'not done' and doesn't require an init_completion() call.
When a completion is declared as a local variable within a function, When a completion is declared as a local variable within a function,
then the initialization should always use DECLARE_COMPLETION_ONSTACK() then the initialization should always use DECLARE_COMPLETION_ONSTACK()
explicitly, not just to make lockdep happy, but also to make it clear explicitly, not just to make lockdep happy, but also to make it clear
that limited scope had been considered and is intentional: that limited scope had been considered and is intentional::
DECLARE_COMPLETION_ONSTACK(setup_done) DECLARE_COMPLETION_ONSTACK(setup_done)
...@@ -140,11 +142,11 @@ Waiting for completions: ...@@ -140,11 +142,11 @@ Waiting for completions:
------------------------ ------------------------
For a thread to wait for some concurrent activity to finish, it For a thread to wait for some concurrent activity to finish, it
calls wait_for_completion() on the initialized completion structure: calls wait_for_completion() on the initialized completion structure::
void wait_for_completion(struct completion *done) void wait_for_completion(struct completion *done)
A typical usage scenario is: A typical usage scenario is::
CPU#1 CPU#2 CPU#1 CPU#2
...@@ -192,17 +194,17 @@ A common problem that occurs is to have unclean assignment of return types, ...@@ -192,17 +194,17 @@ A common problem that occurs is to have unclean assignment of return types,
so take care to assign return-values to variables of the proper type. so take care to assign return-values to variables of the proper type.
Checking for the specific meaning of return values also has been found Checking for the specific meaning of return values also has been found
to be quite inaccurate, e.g. constructs like: to be quite inaccurate, e.g. constructs like::
if (!wait_for_completion_interruptible_timeout(...)) if (!wait_for_completion_interruptible_timeout(...))
... would execute the same code path for successful completion and for the ... would execute the same code path for successful completion and for the
interrupted case - which is probably not what you want. interrupted case - which is probably not what you want::
int wait_for_completion_interruptible(struct completion *done) int wait_for_completion_interruptible(struct completion *done)
This function marks the task TASK_INTERRUPTIBLE while it is waiting. This function marks the task TASK_INTERRUPTIBLE while it is waiting.
If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise. If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise::
unsigned long wait_for_completion_timeout(struct completion *done, unsigned long timeout) unsigned long wait_for_completion_timeout(struct completion *done, unsigned long timeout)
...@@ -214,7 +216,7 @@ Timeouts are preferably calculated with msecs_to_jiffies() or usecs_to_jiffies() ...@@ -214,7 +216,7 @@ Timeouts are preferably calculated with msecs_to_jiffies() or usecs_to_jiffies()
to make the code largely HZ-invariant. to make the code largely HZ-invariant.
If the returned timeout value is deliberately ignored a comment should probably explain If the returned timeout value is deliberately ignored a comment should probably explain
why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc()). why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc())::
long wait_for_completion_interruptible_timeout(struct completion *done, unsigned long timeout) long wait_for_completion_interruptible_timeout(struct completion *done, unsigned long timeout)
...@@ -225,14 +227,14 @@ jiffies if completion occurred. ...@@ -225,14 +227,14 @@ jiffies if completion occurred.
Further variants include _killable which uses TASK_KILLABLE as the Further variants include _killable which uses TASK_KILLABLE as the
designated tasks state and will return -ERESTARTSYS if it is interrupted, designated tasks state and will return -ERESTARTSYS if it is interrupted,
or 0 if completion was achieved. There is a _timeout variant as well: or 0 if completion was achieved. There is a _timeout variant as well::
long wait_for_completion_killable(struct completion *done) long wait_for_completion_killable(struct completion *done)
long wait_for_completion_killable_timeout(struct completion *done, unsigned long timeout) long wait_for_completion_killable_timeout(struct completion *done, unsigned long timeout)
The _io variants wait_for_completion_io() behave the same as the non-_io The _io variants wait_for_completion_io() behave the same as the non-_io
variants, except for accounting waiting time as 'waiting on IO', which has variants, except for accounting waiting time as 'waiting on IO', which has
an impact on how the task is accounted in scheduling/IO stats: an impact on how the task is accounted in scheduling/IO stats::
void wait_for_completion_io(struct completion *done) void wait_for_completion_io(struct completion *done)
unsigned long wait_for_completion_io_timeout(struct completion *done, unsigned long timeout) unsigned long wait_for_completion_io_timeout(struct completion *done, unsigned long timeout)
...@@ -243,11 +245,11 @@ Signaling completions: ...@@ -243,11 +245,11 @@ Signaling completions:
A thread that wants to signal that the conditions for continuation have been A thread that wants to signal that the conditions for continuation have been
achieved calls complete() to signal exactly one of the waiters that it can achieved calls complete() to signal exactly one of the waiters that it can
continue: continue::
void complete(struct completion *done) void complete(struct completion *done)
... or calls complete_all() to signal all current and future waiters: ... or calls complete_all() to signal all current and future waiters::
void complete_all(struct completion *done) void complete_all(struct completion *done)
...@@ -268,7 +270,7 @@ probably are a design bug. ...@@ -268,7 +270,7 @@ probably are a design bug.
Signaling completion from IRQ context is fine as it will appropriately Signaling completion from IRQ context is fine as it will appropriately
lock with spin_lock_irqsave()/spin_unlock_irqrestore() and it will never lock with spin_lock_irqsave()/spin_unlock_irqrestore() and it will never
sleep. sleep.
try_wait_for_completion()/completion_done(): try_wait_for_completion()/completion_done():
...@@ -276,14 +278,14 @@ try_wait_for_completion()/completion_done(): ...@@ -276,14 +278,14 @@ try_wait_for_completion()/completion_done():
The try_wait_for_completion() function will not put the thread on the wait The try_wait_for_completion() function will not put the thread on the wait
queue but rather returns false if it would need to enqueue (block) the thread, queue but rather returns false if it would need to enqueue (block) the thread,
else it consumes one posted completion and returns true. else it consumes one posted completion and returns true::
bool try_wait_for_completion(struct completion *done) bool try_wait_for_completion(struct completion *done)
Finally, to check the state of a completion without changing it in any way, Finally, to check the state of a completion without changing it in any way,
call completion_done(), which returns false if there are no posted call completion_done(), which returns false if there are no posted
completions that were not yet consumed by waiters (implying that there are completions that were not yet consumed by waiters (implying that there are
waiters) and true otherwise; waiters) and true otherwise::
bool completion_done(struct completion *done) bool completion_done(struct completion *done)
......
:orphan:
===============
Linux Scheduler
===============
.. toctree::
:maxdepth: 1
completion
sched-arch
sched-bwc
sched-deadline
sched-design-CFS
sched-domains
sched-energy
sched-nice-design
sched-rt-group
sched-stats
text_files
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
CPU Scheduler implementation hints for architecture specific code =================================================================
CPU Scheduler implementation hints for architecture specific code
=================================================================
Nick Piggin, 2005 Nick Piggin, 2005
...@@ -35,9 +37,10 @@ Your cpu_idle routines need to obey the following rules: ...@@ -35,9 +37,10 @@ Your cpu_idle routines need to obey the following rules:
4. The only time interrupts need to be disabled when checking 4. The only time interrupts need to be disabled when checking
need_resched is if we are about to sleep the processor until need_resched is if we are about to sleep the processor until
the next interrupt (this doesn't provide any protection of the next interrupt (this doesn't provide any protection of
need_resched, it prevents losing an interrupt). need_resched, it prevents losing an interrupt):
4a. Common problem with this type of sleep appears to be::
4a. Common problem with this type of sleep appears to be:
local_irq_disable(); local_irq_disable();
if (!need_resched()) { if (!need_resched()) {
local_irq_enable(); local_irq_enable();
...@@ -51,10 +54,10 @@ Your cpu_idle routines need to obey the following rules: ...@@ -51,10 +54,10 @@ Your cpu_idle routines need to obey the following rules:
although it may be reasonable to do some background work or enter although it may be reasonable to do some background work or enter
a low CPU priority. a low CPU priority.
5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter - 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
an interrupt sleep, it needs to be cleared then a memory an interrupt sleep, it needs to be cleared then a memory
barrier issued (followed by a test of need_resched with barrier issued (followed by a test of need_resched with
interrupts disabled, as explained in 3). interrupts disabled, as explained in 3).
arch/x86/kernel/process.c has examples of both polling and arch/x86/kernel/process.c has examples of both polling and
sleeping idle functions. sleeping idle functions.
...@@ -71,4 +74,3 @@ sh64 - Is sleeping racy vs interrupts? (See #4a) ...@@ -71,4 +74,3 @@ sh64 - Is sleeping racy vs interrupts? (See #4a)
sparc - IRQs on at this point(?), change local_irq_save to _disable. sparc - IRQs on at this point(?), change local_irq_save to _disable.
- TODO: needs secondary CPUs to disable preempt (See #1) - TODO: needs secondary CPUs to disable preempt (See #1)
=====================
CFS Bandwidth Control CFS Bandwidth Control
===================== =====================
[ This document only discusses CPU bandwidth control for SCHED_NORMAL. [ This document only discusses CPU bandwidth control for SCHED_NORMAL.
The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ] The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ]
CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy. specification of the maximum CPU bandwidth available to a group or hierarchy.
...@@ -27,7 +28,8 @@ cpu.cfs_quota_us: the total available run-time within a period (in microseconds) ...@@ -27,7 +28,8 @@ cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
cpu.cfs_period_us: the length of a period (in microseconds) cpu.cfs_period_us: the length of a period (in microseconds)
cpu.stat: exports throttling statistics [explained further below] cpu.stat: exports throttling statistics [explained further below]
The default values are: The default values are::
cpu.cfs_period_us=100ms cpu.cfs_period_us=100ms
cpu.cfs_quota=-1 cpu.cfs_quota=-1
...@@ -55,7 +57,8 @@ For efficiency run-time is transferred between the global pool and CPU local ...@@ -55,7 +57,8 @@ For efficiency run-time is transferred between the global pool and CPU local
on large systems. The amount transferred each time such an update is required on large systems. The amount transferred each time such an update is required
is described as the "slice". is described as the "slice".
This is tunable via procfs: This is tunable via procfs::
/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms) /proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
Larger slice values will reduce transfer overheads, while smaller values allow Larger slice values will reduce transfer overheads, while smaller values allow
...@@ -66,6 +69,7 @@ Statistics ...@@ -66,6 +69,7 @@ Statistics
A group's bandwidth statistics are exported via 3 fields in cpu.stat. A group's bandwidth statistics are exported via 3 fields in cpu.stat.
cpu.stat: cpu.stat:
- nr_periods: Number of enforcement intervals that have elapsed. - nr_periods: Number of enforcement intervals that have elapsed.
- nr_throttled: Number of times the group has been throttled/limited. - nr_throttled: Number of times the group has been throttled/limited.
- throttled_time: The total time duration (in nanoseconds) for which entities - throttled_time: The total time duration (in nanoseconds) for which entities
...@@ -78,12 +82,15 @@ Hierarchical considerations ...@@ -78,12 +82,15 @@ Hierarchical considerations
The interface enforces that an individual entity's bandwidth is always The interface enforces that an individual entity's bandwidth is always
attainable, that is: max(c_i) <= C. However, over-subscription in the attainable, that is: max(c_i) <= C. However, over-subscription in the
aggregate case is explicitly allowed to enable work-conserving semantics aggregate case is explicitly allowed to enable work-conserving semantics
within a hierarchy. within a hierarchy:
e.g. \Sum (c_i) may exceed C e.g. \Sum (c_i) may exceed C
[ Where C is the parent's bandwidth, and c_i its children ] [ Where C is the parent's bandwidth, and c_i its children ]
There are two ways in which a group may become throttled: There are two ways in which a group may become throttled:
a. it fully consumes its own quota within a period a. it fully consumes its own quota within a period
b. a parent's quota is fully consumed within its period b. a parent's quota is fully consumed within its period
...@@ -92,7 +99,7 @@ be allowed to until the parent's runtime is refreshed. ...@@ -92,7 +99,7 @@ be allowed to until the parent's runtime is refreshed.
Examples Examples
-------- --------
1. Limit a group to 1 CPU worth of runtime. 1. Limit a group to 1 CPU worth of runtime::
If period is 250ms and quota is also 250ms, the group will get If period is 250ms and quota is also 250ms, the group will get
1 CPU worth of runtime every 250ms. 1 CPU worth of runtime every 250ms.
...@@ -100,10 +107,10 @@ Examples ...@@ -100,10 +107,10 @@ Examples
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */ # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
# echo 250000 > cpu.cfs_period_us /* period = 250ms */ # echo 250000 > cpu.cfs_period_us /* period = 250ms */
2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine. 2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
runtime every 500ms. runtime every 500ms::
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */ # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
# echo 500000 > cpu.cfs_period_us /* period = 500ms */ # echo 500000 > cpu.cfs_period_us /* period = 500ms */
...@@ -112,11 +119,10 @@ Examples ...@@ -112,11 +119,10 @@ Examples
3. Limit a group to 20% of 1 CPU. 3. Limit a group to 20% of 1 CPU.
With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU. With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU::
# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */ # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
# echo 50000 > cpu.cfs_period_us /* period = 50ms */ # echo 50000 > cpu.cfs_period_us /* period = 50ms */
By using a small period here we are ensuring a consistent latency By using a small period here we are ensuring a consistent latency
response at the expense of burst capacity. response at the expense of burst capacity.
============= =============
CFS Scheduler CFS Scheduler
============= =============
1. OVERVIEW 1. OVERVIEW
============
CFS stands for "Completely Fair Scheduler," and is the new "desktop" process CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
...@@ -27,6 +28,7 @@ is its actual runtime normalized to the total number of running tasks. ...@@ -27,6 +28,7 @@ is its actual runtime normalized to the total number of running tasks.
2. FEW IMPLEMENTATION DETAILS 2. FEW IMPLEMENTATION DETAILS
==============================
In CFS the virtual runtime is expressed and tracked via the per-task In CFS the virtual runtime is expressed and tracked via the per-task
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
...@@ -49,6 +51,7 @@ algorithm variants to recognize sleepers. ...@@ -49,6 +51,7 @@ algorithm variants to recognize sleepers.
3. THE RBTREE 3. THE RBTREE
==============
CFS's design is quite radical: it does not use the old data structures for the CFS's design is quite radical: it does not use the old data structures for the
runqueues, but it uses a time-ordered rbtree to build a "timeline" of future runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
...@@ -84,6 +87,7 @@ picked and the current task is preempted. ...@@ -84,6 +87,7 @@ picked and the current task is preempted.
4. SOME FEATURES OF CFS 4. SOME FEATURES OF CFS
========================
CFS uses nanosecond granularity accounting and does not rely on any jiffies or CFS uses nanosecond granularity accounting and does not rely on any jiffies or
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
...@@ -113,6 +117,7 @@ result. ...@@ -113,6 +117,7 @@ result.
5. Scheduling policies 5. Scheduling policies
======================
CFS implements three scheduling policies: CFS implements three scheduling policies:
...@@ -137,6 +142,7 @@ SCHED_IDLE. ...@@ -137,6 +142,7 @@ SCHED_IDLE.
6. SCHEDULING CLASSES 6. SCHEDULING CLASSES
======================
The new CFS scheduler has been designed in such a way to introduce "Scheduling The new CFS scheduler has been designed in such a way to introduce "Scheduling
Classes," an extensible hierarchy of scheduler modules. These modules Classes," an extensible hierarchy of scheduler modules. These modules
...@@ -197,6 +203,7 @@ This is the (partial) list of the hooks: ...@@ -197,6 +203,7 @@ This is the (partial) list of the hooks:
7. GROUP SCHEDULER EXTENSIONS TO CFS 7. GROUP SCHEDULER EXTENSIONS TO CFS
=====================================
Normally, the scheduler operates on individual tasks and strives to provide Normally, the scheduler operates on individual tasks and strives to provide
fair CPU time to each task. Sometimes, it may be desirable to group tasks and fair CPU time to each task. Sometimes, it may be desirable to group tasks and
...@@ -219,7 +226,7 @@ SCHED_BATCH) tasks. ...@@ -219,7 +226,7 @@ SCHED_BATCH) tasks.
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create group created using the pseudo filesystem. See example steps below to create
task groups and modify their CPU share using the "cgroups" pseudo filesystem. task groups and modify their CPU share using the "cgroups" pseudo filesystem::
# mount -t tmpfs cgroup_root /sys/fs/cgroup # mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu # mkdir /sys/fs/cgroup/cpu
......
=================
Scheduler Domains
=================
Each CPU has a "base" scheduling domain (struct sched_domain). The domain Each CPU has a "base" scheduling domain (struct sched_domain). The domain
hierarchy is built from these base domains via the ->parent pointer. ->parent hierarchy is built from these base domains via the ->parent pointer. ->parent
MUST be NULL terminated, and domain structures should be per-CPU as they are MUST be NULL terminated, and domain structures should be per-CPU as they are
...@@ -46,7 +50,9 @@ CPU's runqueue and the newly found busiest one and starts moving tasks from it ...@@ -46,7 +50,9 @@ CPU's runqueue and the newly found busiest one and starts moving tasks from it
to our runqueue. The exact number of tasks amounts to an imbalance previously to our runqueue. The exact number of tasks amounts to an imbalance previously
computed while iterating over this sched domain's groups. computed while iterating over this sched domain's groups.
*** Implementing sched domains *** Implementing sched domains
==========================
The "base" domain will "span" the first level of the hierarchy. In the case The "base" domain will "span" the first level of the hierarchy. In the case
of SMT, you'll span all siblings of the physical CPU, with each group being of SMT, you'll span all siblings of the physical CPU, with each group being
a single virtual CPU. a single virtual CPU.
......
======================= =======================
Energy Aware Scheduling Energy Aware Scheduling
======================= =======================
1. Introduction 1. Introduction
--------------- ---------------
...@@ -12,7 +12,7 @@ with a minimal impact on throughput. This document aims at providing an ...@@ -12,7 +12,7 @@ with a minimal impact on throughput. This document aims at providing an
introduction on how EAS works, what are the main design decisions behind it, and introduction on how EAS works, what are the main design decisions behind it, and
details what is needed to get it to run. details what is needed to get it to run.
Before going any further, please note that at the time of writing: Before going any further, please note that at the time of writing::
/!\ EAS does not support platforms with symmetric CPU topologies /!\ /!\ EAS does not support platforms with symmetric CPU topologies /!\
...@@ -33,13 +33,13 @@ To make it clear from the start: ...@@ -33,13 +33,13 @@ To make it clear from the start:
- power = energy/time = [joule/second] = [watt] - power = energy/time = [joule/second] = [watt]
The goal of EAS is to minimize energy, while still getting the job done. That The goal of EAS is to minimize energy, while still getting the job done. That
is, we want to maximize: is, we want to maximize::
performance [inst/s] performance [inst/s]
-------------------- --------------------
power [W] power [W]
which is equivalent to minimizing: which is equivalent to minimizing::
energy [J] energy [J]
----------- -----------
...@@ -97,7 +97,7 @@ domains can contain duplicate elements. ...@@ -97,7 +97,7 @@ domains can contain duplicate elements.
Example 1. Example 1.
Let us consider a platform with 12 CPUs, split in 3 performance domains Let us consider a platform with 12 CPUs, split in 3 performance domains
(pd0, pd4 and pd8), organized as follows: (pd0, pd4 and pd8), organized as follows::
CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
PDs: |--pd0--|--pd4--|---pd8---| PDs: |--pd0--|--pd4--|---pd8---|
...@@ -108,6 +108,7 @@ Example 1. ...@@ -108,6 +108,7 @@ Example 1.
containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
above figure. Since pd4 intersects with both rd1 and rd2, it will be above figure. Since pd4 intersects with both rd1 and rd2, it will be
present in the linked list '->pd' attached to each of them: present in the linked list '->pd' attached to each of them:
* rd1->pd: pd0 -> pd4 * rd1->pd: pd0 -> pd4
* rd2->pd: pd4 -> pd8 * rd2->pd: pd4 -> pd8
...@@ -159,9 +160,9 @@ Example 2. ...@@ -159,9 +160,9 @@ Example 2.
Each performance domain has three Operating Performance Points (OPPs). Each performance domain has three Operating Performance Points (OPPs).
The CPU capacity and power cost associated with each OPP is listed in The CPU capacity and power cost associated with each OPP is listed in
the Energy Model table. The util_avg of P is shown on the figures the Energy Model table. The util_avg of P is shown on the figures
below as 'PP'. below as 'PP'::
CPU util. CPU util.
1024 - - - - - - - Energy Model 1024 - - - - - - - Energy Model
+-----------+-------------+ +-----------+-------------+
| Little | Big | | Little | Big |
...@@ -188,8 +189,7 @@ Example 2. ...@@ -188,8 +189,7 @@ Example 2.
(which is coherent with the behaviour of the schedutil CPUFreq (which is coherent with the behaviour of the schedutil CPUFreq
governor, see Section 6. for more details on this topic). governor, see Section 6. for more details on this topic).
Case 1. P is migrated to CPU1 **Case 1. P is migrated to CPU1**::
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1024 - - - - - - - 1024 - - - - - - -
...@@ -207,8 +207,7 @@ Example 2. ...@@ -207,8 +207,7 @@ Example 2.
CPU0 CPU1 CPU2 CPU3 CPU0 CPU1 CPU2 CPU3
Case 2. P is migrated to CPU3 **Case 2. P is migrated to CPU3**::
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1024 - - - - - - - 1024 - - - - - - -
...@@ -226,8 +225,7 @@ Example 2. ...@@ -226,8 +225,7 @@ Example 2.
CPU0 CPU1 CPU2 CPU3 CPU0 CPU1 CPU2 CPU3
Case 3. P stays on prev_cpu / CPU 0 **Case 3. P stays on prev_cpu / CPU 0**::
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1024 - - - - - - - 1024 - - - - - - -
...@@ -324,7 +322,9 @@ hardware properties and on other features of the kernel being enabled. This ...@@ -324,7 +322,9 @@ hardware properties and on other features of the kernel being enabled. This
section lists these dependencies and provides hints as to how they can be met. section lists these dependencies and provides hints as to how they can be met.
6.1 - Asymmetric CPU topology 6.1 - Asymmetric CPU topology
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned in the introduction, EAS is only supported on platforms with As mentioned in the introduction, EAS is only supported on platforms with
asymmetric CPU topologies for now. This requirement is checked at run-time by asymmetric CPU topologies for now. This requirement is checked at run-time by
...@@ -347,7 +347,8 @@ significant savings on SMP platforms have been observed yet. This restriction ...@@ -347,7 +347,8 @@ significant savings on SMP platforms have been observed yet. This restriction
could be amended in the future if proven otherwise. could be amended in the future if proven otherwise.
6.2 - Energy Model presence 6.2 - Energy Model presence
^^^^^^^^^^^^^^^^^^^^^^^^^^^
EAS uses the EM of a platform to estimate the impact of scheduling decisions on EAS uses the EM of a platform to estimate the impact of scheduling decisions on
energy. So, your platform must provide power cost tables to the EM framework in energy. So, your platform must provide power cost tables to the EM framework in
...@@ -358,7 +359,8 @@ Please also note that the scheduling domains need to be re-built after the ...@@ -358,7 +359,8 @@ Please also note that the scheduling domains need to be re-built after the
EM has been registered in order to start EAS. EM has been registered in order to start EAS.
6.3 - Energy Model complexity 6.3 - Energy Model complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The task wake-up path is very latency-sensitive. When the EM of a platform is The task wake-up path is very latency-sensitive. When the EM of a platform is
too complex (too many CPUs, too many performance domains, too many performance too complex (too many CPUs, too many performance domains, too many performance
...@@ -388,7 +390,8 @@ two possible options: ...@@ -388,7 +390,8 @@ two possible options:
hence enabling it to cope with larger EMs in reasonable time. hence enabling it to cope with larger EMs in reasonable time.
6.4 - Schedutil governor 6.4 - Schedutil governor
^^^^^^^^^^^^^^^^^^^^^^^^
EAS tries to predict at which OPP will the CPUs be running in the close future EAS tries to predict at which OPP will the CPUs be running in the close future
in order to estimate their energy consumption. To do so, it is assumed that OPPs in order to estimate their energy consumption. To do so, it is assumed that OPPs
...@@ -405,7 +408,8 @@ frequency requests and energy predictions. ...@@ -405,7 +408,8 @@ frequency requests and energy predictions.
Using EAS with any other governor than schedutil is not supported. Using EAS with any other governor than schedutil is not supported.
6.5 Scale-invariant utilization signals 6.5 Scale-invariant utilization signals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to make accurate prediction across CPUs and for all performance In order to make accurate prediction across CPUs and for all performance
states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
...@@ -416,7 +420,8 @@ Using EAS on a platform that doesn't implement these two callbacks is not ...@@ -416,7 +420,8 @@ Using EAS on a platform that doesn't implement these two callbacks is not
supported. supported.
6.6 Multithreading (SMT) 6.6 Multithreading (SMT)
^^^^^^^^^^^^^^^^^^^^^^^^
EAS in its current form is SMT unaware and is not able to leverage EAS in its current form is SMT unaware and is not able to leverage
multithreaded hardware to save energy. EAS considers threads as independent multithreaded hardware to save energy. EAS considers threads as independent
......
=====================
Scheduler Nice Design
=====================
This document explains the thinking about the revamped and streamlined This document explains the thinking about the revamped and streamlined
nice-levels implementation in the new Linux scheduler. nice-levels implementation in the new Linux scheduler.
...@@ -14,7 +18,7 @@ much stronger than they were before in 2.4 (and people were happy about ...@@ -14,7 +18,7 @@ much stronger than they were before in 2.4 (and people were happy about
that change), and we also intentionally calibrated the linear timeslice that change), and we also intentionally calibrated the linear timeslice
rule so that nice +19 level would be _exactly_ 1 jiffy. To better rule so that nice +19 level would be _exactly_ 1 jiffy. To better
understand it, the timeslice graph went like this (cheesy ASCII art understand it, the timeslice graph went like this (cheesy ASCII art
alert!): alert!)::
A A
......
Real-Time group scheduling ==========================
-------------------------- Real-Time group scheduling
==========================
CONTENTS .. CONTENTS
========
0. WARNING 0. WARNING
1. Overview 1. Overview
1.1 The problem 1.1 The problem
1.2 The solution 1.2 The solution
2. The interface 2. The interface
2.1 System-wide settings 2.1 System-wide settings
2.2 Default behaviour 2.2 Default behaviour
2.3 Basis for grouping tasks 2.3 Basis for grouping tasks
3. Future plans 3. Future plans
0. WARNING 0. WARNING
...@@ -159,9 +159,11 @@ Consider two sibling groups A and B; both have 50% bandwidth, but A's ...@@ -159,9 +159,11 @@ Consider two sibling groups A and B; both have 50% bandwidth, but A's
period is twice the length of B's. period is twice the length of B's.
* group A: period=100000us, runtime=50000us * group A: period=100000us, runtime=50000us
- this runs for 0.05s once every 0.1s - this runs for 0.05s once every 0.1s
* group B: period= 50000us, runtime=25000us * group B: period= 50000us, runtime=25000us
- this runs for 0.025s twice every 0.1s (or once every 0.05 sec). - this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
This means that currently a while (1) loop in A will run for the full period of This means that currently a while (1) loop in A will run for the full period of
......
====================
Scheduler Statistics
====================
Version 15 of schedstats dropped counters for some sched_yield: Version 15 of schedstats dropped counters for some sched_yield:
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
identical to version 14. identical to version 14.
...@@ -35,19 +39,23 @@ CPU statistics ...@@ -35,19 +39,23 @@ CPU statistics
cpu<N> 1 2 3 4 5 6 7 8 9 cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic: First field is a sched_yield() statistic:
1) # of times sched_yield() was called 1) # of times sched_yield() was called
Next three are schedule() statistics: Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1) 2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero. scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called 3) # of times schedule() was called
4) # of times schedule() left the processor idle 4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics: Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called 5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu 6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency: Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies) 7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in 8) sum of all time spent waiting to run by tasks on this processor (in
jiffies) jiffies)
...@@ -67,24 +75,23 @@ The first field is a bit mask indicating what cpus this domain operates over. ...@@ -67,24 +75,23 @@ The first field is a bit mask indicating what cpus this domain operates over.
The next 24 are a variety of load_balance() statistics in grouped into types The next 24 are a variety of load_balance() statistics in grouped into types
of idleness (idle, busy, and newly idle): of idleness (idle, busy, and newly idle):
1) # of times in this domain load_balance() was called when the 1) # of times in this domain load_balance() was called when the
cpu was idle cpu was idle
2) # of times in this domain load_balance() checked but found 2) # of times in this domain load_balance() checked but found
the load did not require balancing when the cpu was idle the load did not require balancing when the cpu was idle
3) # of times in this domain load_balance() tried to move one or 3) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was idle more tasks and failed, when the cpu was idle
4) sum of imbalances discovered (if any) with each call to 4) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle load_balance() in this domain when the cpu was idle
5) # of times in this domain pull_task() was called when the cpu 5) # of times in this domain pull_task() was called when the cpu
was idle was idle
6) # of times in this domain pull_task() was called even though 6) # of times in this domain pull_task() was called even though
the target task was cache-hot when idle the target task was cache-hot when idle
7) # of times in this domain load_balance() was called but did 7) # of times in this domain load_balance() was called but did
not find a busier queue while the cpu was idle not find a busier queue while the cpu was idle
8) # of times in this domain a busier queue was found while the 8) # of times in this domain a busier queue was found while the
cpu was idle but no busier group was found cpu was idle but no busier group was found
9) # of times in this domain load_balance() was called when the
9) # of times in this domain load_balance() was called when the
cpu was busy cpu was busy
10) # of times in this domain load_balance() checked but found the 10) # of times in this domain load_balance() checked but found the
load did not require balancing when busy load did not require balancing when busy
...@@ -117,21 +124,25 @@ of idleness (idle, busy, and newly idle): ...@@ -117,21 +124,25 @@ of idleness (idle, busy, and newly idle):
was just becoming idle but no busier group was found was just becoming idle but no busier group was found
Next three are active_load_balance() statistics: Next three are active_load_balance() statistics:
25) # of times active_load_balance() was called 25) # of times active_load_balance() was called
26) # of times active_load_balance() tried to move a task and failed 26) # of times active_load_balance() tried to move a task and failed
27) # of times active_load_balance() successfully moved a task 27) # of times active_load_balance() successfully moved a task
Next three are sched_balance_exec() statistics: Next three are sched_balance_exec() statistics:
28) sbe_cnt is not used 28) sbe_cnt is not used
29) sbe_balanced is not used 29) sbe_balanced is not used
30) sbe_pushed is not used 30) sbe_pushed is not used
Next three are sched_balance_fork() statistics: Next three are sched_balance_fork() statistics:
31) sbf_cnt is not used 31) sbf_cnt is not used
32) sbf_balanced is not used 32) sbf_balanced is not used
33) sbf_pushed is not used 33) sbf_pushed is not used
Next three are try_to_wake_up() statistics: Next three are try_to_wake_up() statistics:
34) # of times in this domain try_to_wake_up() awoke a task that 34) # of times in this domain try_to_wake_up() awoke a task that
last ran on a different cpu in this domain last ran on a different cpu in this domain
35) # of times in this domain try_to_wake_up() moved a task to the 35) # of times in this domain try_to_wake_up() moved a task to the
...@@ -139,10 +150,11 @@ of idleness (idle, busy, and newly idle): ...@@ -139,10 +150,11 @@ of idleness (idle, busy, and newly idle):
36) # of times in this domain try_to_wake_up() started passive balancing 36) # of times in this domain try_to_wake_up() started passive balancing
/proc/<pid>/schedstat /proc/<pid>/schedstat
---------------- ---------------------
schedstats also adds a new /proc/<pid>/schedstat file to include some of schedstats also adds a new /proc/<pid>/schedstat file to include some of
the same information on a per-process level. There are three fields in the same information on a per-process level. There are three fields in
this file correlating for that process to: this file correlating for that process to:
1) time spent on the cpu 1) time spent on the cpu
2) time spent waiting on a runqueue 2) time spent waiting on a runqueue
3) # of timeslices run on this cpu 3) # of timeslices run on this cpu
...@@ -151,4 +163,5 @@ A program could be easily written to make use of these extra fields to ...@@ -151,4 +163,5 @@ A program could be easily written to make use of these extra fields to
report on how well a particular process or set of processes is faring report on how well a particular process or set of processes is faring
under the scheduler's policies. A simple version of such a program is under the scheduler's policies. A simple version of such a program is
available at available at
http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c
Scheduler pelt c program
------------------------
.. literalinclude:: sched-pelt.c
:language: c
...@@ -99,7 +99,7 @@ Local allocation will tend to keep subsequent access to the allocated memory ...@@ -99,7 +99,7 @@ Local allocation will tend to keep subsequent access to the allocated memory
as long as the task on whose behalf the kernel allocated some memory does not as long as the task on whose behalf the kernel allocated some memory does not
later migrate away from that memory. The Linux scheduler is aware of the later migrate away from that memory. The Linux scheduler is aware of the
NUMA topology of the platform--embodied in the "scheduling domains" data NUMA topology of the platform--embodied in the "scheduling domains" data
structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler structures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
attempts to minimize task migration to distant scheduling domains. However, attempts to minimize task migration to distant scheduling domains. However,
the scheduler does not take a task's NUMA footprint into account directly. the scheduler does not take a task's NUMA footprint into account directly.
Thus, under sufficient imbalance, tasks can migrate between nodes, remote Thus, under sufficient imbalance, tasks can migrate between nodes, remote
......
...@@ -734,7 +734,7 @@ menuconfig CGROUPS ...@@ -734,7 +734,7 @@ menuconfig CGROUPS
use with process control subsystems such as Cpusets, CFS, memory use with process control subsystems such as Cpusets, CFS, memory
controls or device isolation. controls or device isolation.
See See
- Documentation/scheduler/sched-design-CFS.txt (CFS) - Documentation/scheduler/sched-design-CFS.rst (CFS)
- Documentation/cgroup-v1/ (features for grouping, isolation - Documentation/cgroup-v1/ (features for grouping, isolation
and resource control) and resource control)
...@@ -835,7 +835,7 @@ config CFS_BANDWIDTH ...@@ -835,7 +835,7 @@ config CFS_BANDWIDTH
tasks running within the fair group scheduler. Groups with no limit tasks running within the fair group scheduler. Groups with no limit
set are considered to be unconstrained and will run with no set are considered to be unconstrained and will run with no
restriction. restriction.
See Documentation/scheduler/sched-bwc.txt for more information. See Documentation/scheduler/sched-bwc.rst for more information.
config RT_GROUP_SCHED config RT_GROUP_SCHED
bool "Group scheduling for SCHED_RR/FIFO" bool "Group scheduling for SCHED_RR/FIFO"
...@@ -846,7 +846,7 @@ config RT_GROUP_SCHED ...@@ -846,7 +846,7 @@ config RT_GROUP_SCHED
to task groups. If enabled, it will also make it impossible to to task groups. If enabled, it will also make it impossible to
schedule realtime tasks for non-root users until you allocate schedule realtime tasks for non-root users until you allocate
realtime bandwidth for them. realtime bandwidth for them.
See Documentation/scheduler/sched-rt-group.txt for more information. See Documentation/scheduler/sched-rt-group.rst for more information.
endif #CGROUP_SCHED endif #CGROUP_SCHED
......
...@@ -726,7 +726,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se, ...@@ -726,7 +726,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
* refill the runtime and set the deadline a period in the future, * refill the runtime and set the deadline a period in the future,
* because keeping the current (absolute) deadline of the task would * because keeping the current (absolute) deadline of the task would
* result in breaking guarantees promised to other tasks (refer to * result in breaking guarantees promised to other tasks (refer to
* Documentation/scheduler/sched-deadline.txt for more information). * Documentation/scheduler/sched-deadline.rst for more information).
* *
* This function returns true if: * This function returns true if:
* *
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment