Commit c1859213 authored by Andrew Morton's avatar Andrew Morton Committed by Jaroslav Kysela

[PATCH] Add /proc/sys/vm/lower_zone_protection

This allows us to control the aggressiveness of the lower-zone defense
algorithm.  The `incremental min'.  For workloads which are using a
serious amount of mlocked memory, a few megabytes is not enough.

So the `lower_zone_protection' tunable allows the administrator to
increase the amount of protection which lower zones receive against
allocations which _could_ use higher zones.

The default value of lower_zone_protection is zero, giving unchanged
behaviour.  We should not normally make large amounts of memory
unavailable for pagecache just in case someone mlocks many hundreds of
megabytes.
parent 20b96b52
......@@ -989,42 +989,58 @@ for writeout by the pdflush daemons. It is expressed in 100'ths of a second.
Data which has been dirty in-memory for longer than this interval will be
written out next time a pdflush daemon wakes up.
lower_zone_protection
---------------------
kswapd
------
For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone. This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.
Kswapd is the kernel swap out daemon. That is, kswapd is that piece of the
kernel that frees memory when it gets fragmented or full. Since every system
is different, you'll probably want some control over this piece of the system.
And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.
The file contains three numbers:
So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem. This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.
tries_base
----------
(The same argument applies to the old 16 megabyte ISA DMA region. This
mechanism will also defend that region from allocations which could use
highmem or lowmem).
The maximum number of pages kswapd tries to free in one round is calculated
from this number. Usually this number will be divided by 4 or 8 (see
mm/vmscan.c), so it isn't as big as it looks.
The `lower_zone_protection' tunable determines how aggressive the kernel is
in defending these lower zones. The default value is zero - no
protection at all.
When you need to increase the bandwidth to/from swap, you'll want to increase
this number.
If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should increase the lower_zone_protection setting.
tries_min
---------
The units of this tunable are fairly vague. It is approximately equal
to "megabytes". So setting lower_zone_protection=100 will protect around 100
megabytes of the lowmem zone from user allocations. It will also make
those 100 megabytes unavaliable for use by applications and by
pagecache, so there is a cost.
The effects of this tunable may be observed by monitoring
/proc/meminfo:LowFree. Write a single huge file and observe the point
at which LowFree ceases to fall.
This is the minimum number of times kswapd tries to free a page each time it
is called. Basically it's just there to make sure that kswapd frees some pages
even when it's being called with minimum priority.
A reasonable value for lower_zone_protection is 100.
swap_cluster
page-cluster
------------
This is probably the greatest influence on system performance.
page-cluster controls the number of pages which are written to swap in
a single attempt. The swap I/O size.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
swap_cluster is the number of pages kswapd writes in one turn. You'll want
this value to be large so that kswapd does its I/O in large chunks and the
disk doesn't have to seek as often, but you don't want it to be too large
since that would flood the request queue.
The default value is three (eight pages at a time). There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.
overcommit_memory
-----------------
......
......@@ -154,6 +154,7 @@ enum
VM_PAGEBUF=17, /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18, /* int: Number of available Huge Pages */
VM_SWAPPINESS=19, /* Tendency to steal mapped memory */
VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
};
......
......@@ -53,6 +53,7 @@ extern int core_uses_pid;
extern char core_pattern[];
extern int cad_pid;
extern int pid_max;
extern int sysctl_lower_zone_protection;
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
......@@ -310,8 +311,13 @@ static ctl_table vm_table[] = {
0644, NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL, &zero,
&one_hundred },
#ifdef CONFIG_HUGETLB_PAGE
{VM_HUGETLB_PAGES, "nr_hugepages", &htlbpage_max, sizeof(int), 0644, NULL, &hugetlb_sysctl_handler},
{VM_HUGETLB_PAGES, "nr_hugepages", &htlbpage_max, sizeof(int), 0644,
NULL, &hugetlb_sysctl_handler},
#endif
{VM_LOWER_ZONE_PROTECTION, "lower_zone_protection",
&sysctl_lower_zone_protection, sizeof(sysctl_lower_zone_protection),
0644, NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL, &zero,
NULL, },
{0}
};
......
......@@ -38,7 +38,7 @@ unsigned long totalram_pages;
unsigned long totalhigh_pages;
int nr_swap_pages;
int numnodes = 1;
int sysctl_lower_zone_protection = 0;
/*
* Used by page_zone() to look up the address of the struct zone whose
......@@ -470,6 +470,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
if (page)
return page;
}
min += z->pages_low * sysctl_lower_zone_protection;
}
/* we're somewhat low on memory, failed to find what we needed */
......@@ -492,6 +493,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
if (page)
return page;
}
min += local_min * sysctl_lower_zone_protection;
}
/* here we're in the low on memory slow path */
......@@ -529,6 +531,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
if (page)
return page;
}
min += z->pages_low * sysctl_lower_zone_protection;
}
/*
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment