[PATCH] Add /proc/sys/vm/lower_zone_protection

This allows us to control the aggressiveness of the lower-zone defense algorithm. The `incremental min'. For workloads which are using a serious amount of mlocked memory, a few megabytes is not enough. So the `lower_zone_protection' tunable allows the administrator to increase the amount of protection which lower zones receive against allocations which _could_ use higher zones. The default value of lower_zone_protection is zero, giving unchanged behaviour. We should not normally make large amounts of memory unavailable for pagecache just in case someone mlocks many hundreds of megabytes.

[PATCH] Add /proc/sys/vm/lower_zone_protection
This allows us to control the aggressiveness of the lower-zone defense algorithm. The `incremental min'. For workloads which are using a serious amount of mlocked memory, a few megabytes is not enough. So the `lower_zone_protection' tunable allows the administrator to increase the amount of protection which lower zones receive against allocations which _could_ use higher zones. The default value of lower_zone_protection is zero, giving unchanged behaviour. We should not normally make large amounts of memory unavailable for pagecache just in case someone mlocks many hundreds of megabytes.
c1859213 · Andrew Morton · Jaroslav Kysela · 20b96b52 · c1859213 · c1859213
Commit c1859213 authored Dec 14, 2002 by Andrew Morton Committed by Jaroslav Kysela Dec 14, 2002
Showing with 52 additions and 26 deletions

Documentation/filesystems/proc.txt Documentation/filesystems/proc.txt +40 -24

include/linux/sysctl.h include/linux/sysctl.h +1 -0

kernel/sysctl.c kernel/sysctl.c +7 -1

mm/page_alloc.c mm/page_alloc.c +4 -1

No files found.
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -989,42 +989,58 @@ for writeout by the pdflush daemons.  It is expressed in 100'ths of a second.
 Data which has been dirty in-memory for longer than this interval will be
 written out next time a pdflush daemon wakes up.

+lower_zone_protection
+---------------------

-kswapd
------
+For some specialised workloads on highmem machines it is dangerous for
+the kernel to allow process memory to be allocated from the "lowmem"
+zone.  This is because that memory could then be pinned via the mlock()
+system call, or by unavailability of swapspace.

-Kswapd is  the  kernel  swap  out daemon. That is, kswapd is that piece of the
-kernel that  frees  memory when it gets fragmented or full. Since every system
-is different, you'll probably want some control over this piece of the system.
+And on large highmem machines this lack of reclaimable lowmem memory
+can be fatal.

-The file contains three numbers:
+So the Linux page allocator has a mechanism which prevents allocations
+which _could_ use highmem from using too much lowmem.  This means that
+a certain amount of lowmem is defended from the possibility of being
+captured into pinned user memory.

-tries_base
----------
+(The same argument applies to the old 16 megabyte ISA DMA region.  This
+mechanism will also defend that region from allocations which could use
+highmem or lowmem).

-The maximum  number  of  pages kswapd tries to free in one round is calculated
-from this  number.  Usually  this  number  will  be  divided  by  4  or 8 (see
-mm/vmscan.c), so it isn't as big as it looks.
+The `lower_zone_protection' tunable determines how aggressive the kernel is
+in defending these lower zones.  The default value is zero - no
+protection at all.

-When you  need to increase the bandwidth to/from swap, you'll want to increase
-this number.
+If you have a machine which uses highmem or ISA DMA and your
+applications are using mlock(), or if you are running with no swap then
+you probably should increase the lower_zone_protection setting.

-tries_min
---------
+The units of this tunable are fairly vague.  It is approximately equal
+to "megabytes".  So setting lower_zone_protection=100 will protect around 100
+megabytes of the lowmem zone from user allocations.  It will also make
+those 100 megabytes unavaliable for use by applications and by
+pagecache, so there is a cost.
+
+The effects of this tunable may be observed by monitoring
+/proc/meminfo:LowFree.  Write a single huge file and observe the point
+at which LowFree ceases to fall.

-This is  the  minimum number of times kswapd tries to free a page each time it
-is called. Basically it's just there to make sure that kswapd frees some pages
-even when it's being called with minimum priority.
+A reasonable value for lower_zone_protection is 100.

-swap_cluster
+page-cluster
 ------------

-This is probably the greatest influence on system performance.
+page-cluster controls the number of pages which are written to swap in
+a single attempt.  The swap I/O size.
+
+It is a logarithmic value - setting it to zero means "1 page", setting
+it to 1 means "2 pages", setting it to 2 means "4 pages", etc.

-swap_cluster is  the  number  of  pages kswapd writes in one turn. You'll want
-this value  to  be  large  so that kswapd does its I/O in large chunks and the
-disk doesn't  have  to  seek  as  often, but you don't want it to be too large
-since that would flood the request queue.
+The default value is three (eight pages at a time).  There may be some
+small benefits in tuning this to a different value if your workload is
+swap-intensive.

 overcommit_memory
 -----------------

--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -154,6 +154,7 @@ enum
 	VM_PAGEBUF=17,		/* struct: Control pagebuf parameters */
 	VM_HUGETLB_PAGES=18,	/* int: Number of available Huge Pages */
 	VM_SWAPPINESS=19,	/* Tendency to steal mapped memory */
+	VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
 };



--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -53,6 +53,7 @@ extern int core_uses_pid;
 extern char core_pattern[];
 extern int cad_pid;
 extern int pid_max;
+extern int sysctl_lower_zone_protection;

 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -310,8 +311,13 @@ static ctl_table vm_table[] = {
 	 0644, NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL, &zero,
 	 &one_hundred },
 #ifdef CONFIG_HUGETLB_PAGE
-	 {VM_HUGETLB_PAGES, "nr_hugepages", &htlbpage_max, sizeof(int), 0644, NULL, &hugetlb_sysctl_handler},
+	 {VM_HUGETLB_PAGES, "nr_hugepages", &htlbpage_max, sizeof(int), 0644,
+	  NULL, &hugetlb_sysctl_handler},
 #endif
+	{VM_LOWER_ZONE_PROTECTION, "lower_zone_protection",
+	 &sysctl_lower_zone_protection, sizeof(sysctl_lower_zone_protection),
+	 0644, NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL, &zero,
+	 NULL, },
 	{0}
 };


--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -38,7 +38,7 @@ unsigned long totalram_pages;
 unsigned long totalhigh_pages;
 int nr_swap_pages;
 int numnodes = 1;
-
+int sysctl_lower_zone_protection = 0;

 /*
 * Used by page_zone() to look up the address of the struct zone whose
@@ -470,6 +470,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
 			if (page)
 				return page;
 		}
+		min += z->pages_low * sysctl_lower_zone_protection;
 	}

 	/* we're somewhat low on memory, failed to find what we needed */
@@ -492,6 +493,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
 			if (page)
 				return page;
 		}
+		min += local_min * sysctl_lower_zone_protection;
 	}

 	/* here we're in the low on memory slow path */
@@ -529,6 +531,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
 			if (page)
 				return page;
 		}
+		min += z->pages_low * sysctl_lower_zone_protection;
 	}

 	/*