Commit 6c292092 authored by Tejun Heo's avatar Tejun Heo

cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation

Now that cgroup v2 is almost out of the door, replace the development
documentation unified-hierarchy.txt with Documentation/cgroup.txt
which is a superset of unified-hierarchy.txt and authoritatively
describes all userland-visible aspects of cgroup.

v2: Updated to include all information from blkio-controller.txt and
    list filesystems which support cgroup writeback as suggested by
    Vivek.
Signed-off-by: default avatarTejun Heo <tj@kernel.org>
Acked-by: default avatarLi Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
parent 0d942766
......@@ -374,82 +374,3 @@ One can experience an overall throughput drop if you have created multiple
groups and put applications in that group which are not driving enough
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
on individual groups and throughput should improve.
Writeback
=========
Page cache is dirtied through buffered writes and shared mmaps and
written asynchronously to the backing filesystem by the writeback
mechanism. Writeback sits between the memory and IO domains and
regulates the proportion of dirty memory by balancing dirtying and
write IOs.
On traditional cgroup hierarchies, relationships between different
controllers cannot be established making it impossible for writeback
to operate accounting for cgroup resource restrictions and all
writeback IOs are attributed to the root cgroup.
If both the blkio and memory controllers are used on the v2 hierarchy
and the filesystem supports cgroup writeback, writeback operations
correctly follow the resource restrictions imposed by both memory and
blkio controllers.
Writeback examines both system-wide and per-cgroup dirty memory status
and enforces the more restrictive of the two. Also, writeback control
parameters which are absolute values - vm.dirty_bytes and
vm.dirty_background_bytes - are distributed across cgroups according
to their current writeback bandwidth.
There's a peculiarity stemming from the discrepancy in ownership
granularity between memory controller and writeback. While memory
controller tracks ownership per page, writeback operates on inode
basis. cgroup writeback bridges the gap by tracking ownership by
inode but migrating ownership if too many foreign pages, pages which
don't match the current inode ownership, have been encountered while
writing back the inode.
This is a conscious design choice as writeback operations are
inherently tied to inodes making strictly following page ownership
complicated and inefficient. The only use case which suffers from
this compromise is multiple cgroups concurrently dirtying disjoint
regions of the same inode, which is an unlikely use case and decided
to be unsupported. Note that as memory controller assigns page
ownership on the first use and doesn't update it until the page is
released, even if cgroup writeback strictly follows page ownership,
multiple cgroups dirtying overlapping areas wouldn't work as expected.
In general, write-sharing an inode across multiple cgroups is not well
supported.
Filesystem support for cgroup writeback
---------------------------------------
A filesystem can make writeback IOs cgroup-aware by updating
address_space_operations->writepage[s]() to annotate bio's using the
following two functions.
* wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and associates
the bio with the inode's owner cgroup. Can be called anytime
between bio allocation and submission.
* wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out. While
this function doesn't care exactly when it's called during the
writeback session, it's the easiest and most natural to call it as
data segments are added to a bio.
With writeback bio's annotated, cgroup support can be enabled per
super_block by setting MS_CGROUPWB in ->s_flags. This allows for
selective disabling of cgroup writeback support which is helpful when
certain filesystem features, e.g. journaled data mode, are
incompatible.
wbc_init_bio() binds the specified bio to its cgroup. Depending on
the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
directly.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment