Commits · b25bb6080a63c5a284ef9fc2ed3fa8fe75ce4f7c · nexedi / linux

An error occurred fetching the project authors.

31 Aug, 2003 1 commit

[PATCH] vmscan: zone pressure calculation fix · b25bb608

Andrew Morton authored 21 years ago

Off-by-one in balance_pgdat(): `priority' can never go negative. It causes
the scanning priority thresholds to be quite wrong and kswapd tends to go
berzerk when there is a lot of mapped memory around.

b25bb608

21 Aug, 2003 1 commit
- [power] Make swsusp-only mm functions available when CONFIG_PM=y · 2d3a2291
  Patrick Mochel authored 21 years ago
```
Calls were moved to the PM core, so they must be compiled in to use them.
```
  2d3a2291
20 Aug, 2003 1 commit

[PATCH] vmscan: give dirty referenced pages another pass · d55158b5

Andrew Morton authored 21 years ago

In a further attempt to prevent dirty pages from being written out from the
LRU, don't write them if they were referenced.  This gives those pages
another trip around the inactive list.  So more of them are written via
balance_dirty_pages().

It speeds up an untar-of-five-kernel trees by 5% on a 256M box, presumably
because balance_dirty_pages() has better IO patterns.

It largely fixes the problem which Gerrit talked about at the kernel summit:
the individual writepage()s of dirty pages coming off the tail of the LRU are
reduced by 83% in their database workload.

I'm a bit worried that it increases scanning and OOM possibilities under
nutty VM stress cases, but nothing untoward has been noted during its four
weeks in -mm, so...

d55158b5

19 Aug, 2003 2 commits

[PATCH] async write errors: use flags in address space · fcad2b42

Andrew Morton authored 21 years ago

From: Oliver Xymoron <oxymoron@waste.org>

This patch just saves a few bytes in the inode by turning mapping->gfp_mask
into an unsigned long mapping->flags.

The mapping's gfp mask is placed in the 16 high bits of mapping->flags and
two of the remaining 16 bits are used for tracking EIO and ENOSPC errors.

This leaves 14 bits in the mapping for future use.  They should be accessed
with the atomic bitops.

fcad2b42

[PATCH] async write errors: report truncate and io errors on · fe7e689f

Andrew Morton authored 21 years ago

From: Oliver Xymoron <oxymoron@waste.org>

These patches add the infrastructure for reporting asynchronous write errors
to block devices to userspace.  Error which are detected due to pdflush or VM
writeout are reported at the next fsync, fdatasync, or msync on the given
file, and on close if the error occurs in time.

We do this by propagating any errors into page->mapping->error when they are
detected.  In fsync(), msync(), fdatasync() and close() we return that error
and zero it out.


The Open Group say close() _may_ fail if an I/O error occurred while reading
from or writing to the file system.  Well, in this implementation close() can
return -EIO or -ENOSPC.  And in that case it will succeed, not fail - perhaps
that is what they meant.


There are three patches in this series and testing has only been performed
with all three applied.

fe7e689f

18 Aug, 2003 1 commit

[PATCH] cpumask_t: allow more than BITS_PER_LONG CPUs · bf8cb61f

Andrew Morton authored 21 years ago

From: William Lee Irwin III <wli@holomorphy.com>

Contributions from:
Jan Dittmer <jdittmer@sfhq.hn.org>
Arnd Bergmann <arnd@arndb.de>
"Bryan O'Sullivan" <bos@serpentine.com>
"David S. Miller" <davem@redhat.com>
Badari Pulavarty <pbadari@us.ibm.com>
"Martin J. Bligh" <mbligh@aracnet.com>
Zwane Mwaikambo <zwane@linuxpower.ca>

It has ben tested on x86, sparc64, x86_64, ia64 (I think), ppc and ppc64.

cpumask_t enables systems with NR_CPUS > BITS_PER_LONG to utilize all their
cpus by creating an abstract data type dedicated to representing cpu
bitmasks, similar to fd sets from userspace, and sweeping the appropriate
code to update callers to the access API. The fd set-like structure is
according to Linus' own suggestion; the macro calling convention to ambiguate
representations with minimal code impact is my own invention.

Specifically, a new set of inline functions for manipulating arbitrary-width
bitmaps is introduced with a relatively simple implementation, in tandem with
a new data type representing bitmaps of width NR_CPUS, cpumask_t, whose
accessor functions are defined in terms of the bitmap manipulation inlines.
This bitmap ADT found an additional use in i386 arch code handling sparse
physical APIC ID's, which was convenient to use in this case as the
accounting structure was required to be wider to accommodate the physids
consumed by larger numbers of cpus.

For the sake of simplicity and low code impact, these cpu bitmasks are passed
primarily by value; however, an additional set of accessors along with an
auxiliary data type with const call-by-reference semantics is provided to
address performance concerns raised in connection with very large systems,
such as SGI's larger models, where copying and call-by-value overhead would
be prohibitive. Few (if any) users of the call-by-reference API are
immediately introduced.

Also, in order to avoid calling convention overhead on architectures where
structures are required to be passed by value, NR_CPUS <= BITS_PER_LONG is
special-cased so that cpumask_t falls back to an unsigned long and the
accessors perform the usual bit twiddling on unsigned longs as opposed to
arrays thereof. Audits were done with the structure overhead in-place,
restoring this special-casing only afterward so as to ensure a more complete
API conversion while undergoing the majority of its end-user exposure in -mm.
More -mm's were shipped after its restoration to be sure that was tested,
too.

The immediate users of this functionality are Sun sparc64 systems, SGI mips64
and ia64 systems, and IBM ia32, ppc64, and s390 systems. Of these, only the
ppc64 machines needing the functionality have yet to be released; all others
have had systems requiring it for full functionality for at least 6 months,
and in some cases, since the initial Linux port to the affected architecture.

bf8cb61f

01 Aug, 2003 4 commits

[PATCH] vmscan: use zone_pressure for page unmapping · 14d927a3

Andrew Morton authored 21 years ago

From: Nikita Danilov <Nikita@Namesys.COM>

Use zone->pressure (rathar than scanning priority) to determine when to
start reclaiming mapped pages in refill_inactive_zone(). When using
priority every call to try_to_free_pages() starts with scanning parts of
active list and skipping mapped pages (because reclaim_mapped evaluates to
0 on low priorities) no matter how high memory pressure is.

14d927a3

[PATCH] vmscan: decaying average of zone pressure · ecbeb4b2

Andrew Morton authored 21 years ago

From: Nikita Danilov <Nikita@Namesys.COM>

The vmscan logic at present will scan the inactive list with increasing
priority until a threshold is triggered.  At that threshold we start
unmapping pages from pagetables.

The problem is that each time someone calls into this code, the priority is
initially low, so some mapped pages will be refiled event hough we really
should be unmapping them now.

Nikita's patch adds the `pressure' field to struct zone.  it is a decaying
average of the zone's memory pressure and allows us to start unmapping pages
immediately on entry to page reclaim, based on measurements which were made
in earlier reclaim attempts.

ecbeb4b2

[PATCH] fix kswapd throttling · 00401a44

Andrew Morton authored 21 years ago

kswapd currently takes a throttling nap even if it freed all the pages it
was asked to free.

Change it so we only throttle if reclaim is not being sufficiently
successful.

00401a44

[PATCH] kwsapd can free too much memory · f76a4338

Andrew Morton authored 21 years ago

We need to subtract the number of freed slab pages from the number of pages
to free, not add it.

f76a4338

12 Jul, 2003 1 commit

[PATCH] asm-generic/div64.h breakage · ed08e6df

Bernardo Innocenti authored 21 years ago

 - __div64_32(): remove __attribute_pure__ qualifier from the prototype
   since this function obviously clobbers memory through &(n);

 - do_div(): add a check to ensure (n) is type-compatible with uint64_t;

 - as_update_iohist(): Use sector_div() instead of do_div().
   (Whether the result of the addition should always be stored in 64bits
   regardless of CONFIG_LBD is still being discussed, therefore it's
   unadderessed here);

 - Fix all places where do_div() was being called with a bad divisor argument.

ed08e6df

14 Jun, 2003 1 commit

[PATCH] NUMA fixes · 1d292c60

Andrew Morton authored 21 years ago

From: Anton Blanchard <anton@samba.org>


Anton has been testing odd setups:

/* node 0 - no cpus, no memory */
/* node 1 - 1 cpu, no memory */
/* node 2 - 0 cpus, 1GB memory */
/* node 3 - 3 cpus, 3GB memory */

Two things tripped so far.  Firstly the ppc64 debug check for invalid cpus
in cpu_to_node().  Fix that in kernel/sched.c:node_nr_running_init().

The other problem concerned nodes with memory but no cpus.  kswapd tries to
set_cpus_allowed(0) and bad things happen.  So we only set cpu affinity
for kswapd if there are cpus in the node.

1d292c60

06 Jun, 2003 1 commit

[PATCH] Don't let processes be scheduled on CPU-less nodes (1/3) · 2eb57dd2

Andrew Morton authored 21 years ago

From: Matthew Dobson <colpatch@us.ibm.com>

sched_best_cpu schedules processes on nodes based on node_nr_running.  For
CPU-less nodes, this is always 0, and thus sched_best_cpu tends to migrate
tasks to these nodes, which eventually get remigrated elsewhere.

This patch adds include/linux/topology.h, and modifies all includes of
asm/topology.h to linux/topology.h.  A subsequent patch in this series adds
helper functions to linux/topology.h to ensure processes are only migrated
to nodes with CPUs.

Test compiled and booted by Andrew Theurer (habanero@us.ibm.com) on both
x440 and ppc64.

2eb57dd2

22 May, 2003 1 commit

[PATCH] shrink_all_memory() fix · 6171369f

Andrew Morton authored 21 years ago

The calling task must have a valid reclaim_state when running page
reclaim.  But I had forgotten about shrink_all_memory().

6171369f

07 May, 2003 1 commit

[PATCH] account for slab reclaim in try_to_free_pages() · f31fd780

Andrew Morton authored 21 years ago

try_to_free_pages() currently fails to notice that it successfully freed slab
pages via shrink_slab().  So it can keep looping and eventually call
out_of_memory(), even though there's a lot of memory now free.

And even if it doesn't do that, it can free too much memory.

The patch changes try_to_free_pages() so that it will notice freed slab pages
and will return when enough memory has been freed via shrink_slab().

Many options were considered, but must of them were unacceptably inaccurate,
intrusive or sleazy.  I ended up putting the accounting into a stack-local
structure which is pointed to by current->reclaim_state.

One reason for this is that we can cleanly resurrect the current->local_pages
pool by putting it into struct reclaim_state.

(current->local_pages was removed because the per-cpu page pools in the page
allocator largely duplicate its function.  But it is still possible for
interrupt-time allocations to steal just-freed pages, so we might want to put
it back some time.)

f31fd780

30 Apr, 2003 1 commit

[PATCH] zone accounting race fix · 98605ba9

Andrew Morton authored 21 years ago

Fix a bug identified by Nikita Danilov: refill_inactive_zone() is deferring
the update of zone->nr_inactive and zone->nr_active for too long - it needs
to be consistent whenever zone->lock is not held.

98605ba9

20 Apr, 2003 3 commits

[PATCH] don't shrink slab for highmem allocations · 5a08774a

Andrew Morton authored 21 years ago

From: William Lee Irwin III <wli@holomorphy.com>

If one's goal is to free highmem pages, shrink_slab() is an ineffective
method of recovering them, as slab pages are all ZONE_NORMAL or ZONE_DMA.
Hence, this "FIXME: do not do for zone highmem". Presumably this is a
question of policy, as highmem allocations may be satisfied by reaping slab
pages and handing them back; but the FIXME says what we should do.

5a08774a

[PATCH] implement __GFP_REPEAT, __GFP_NOFAIL, __GFP_NORETRY · 75908778

Andrew Morton authored 21 years ago

This is a cleanup patch.

There are quite a lot of places in the kernel which will infinitely retry a
memory allocation.

Generally, they get it wrong.  Some do yield(), the semantics of which have
changed over time.  Some do schedule(), which can lock up if the caller is
SCHED_FIFO/RR.  Some do schedule_timeout(), etc.

And often it is unnecessary, because the page allocator will do the retry
internally anyway.  But we cannot rely on that - this behaviour may change
(-aa and -rmap kernels do not do this, for instance).

So it is good to formalise and to centralise this operation.  If an
allocation specifies __GFP_REPEAT then the page allocator must infinitely
retry the allocation.

The semantics of __GFP_REPEAT are "try harder".  The allocation _may_ fail
(the 2.4 -aa and -rmap VM's do not retry infinitely by default).

The semantics of __GFP_NOFAIL are "cannot fail".  It is a no-op in this VM,
but needs to be honoured (or fix up the callers) if the VM ischanged to not
retry infinitely by default.

The semantics of __GFP_NOREPEAT are "try once, don't loop".  This isn't used
at present (although perhaps it should be, in swapoff).  It is mainly for
completeness.

75908778

[PATCH] Clean up various buffer-head dependencies · cda55f33

Andrew Morton authored 21 years ago

From: William Lee Irwin III <wli@holomorphy.com>

Remove page_has_buffers() from various functions, document the dependencies
on buffer_head.h from other files besides filemap.c, and s/this file/core VM/
in filemap.c

cda55f33

09 Apr, 2003 1 commit

[PATCH] Replace the radix-tree rwlock with a spinlock · 8e98702b

Andrew Morton authored 21 years ago

Spinlocks don't have a buslocked unlock and are faster.

On a P4, time to write a 4M file with 4M one-byte-write()s:

Before:
	0.72s user 5.47s system 99% cpu 6.227 total
	0.76s user 5.40s system 100% cpu 6.154 total
	0.77s user 5.38s system 100% cpu 6.146 total

After:
	1.09s user 4.92s system 99% cpu 6.014 total
	0.74s user 5.28s system 99% cpu 6.023 total
	1.03s user 4.97s system 100% cpu 5.991 total

8e98702b

28 Mar, 2003 2 commits

[PATCH] permit page unmapping if !CONFIG_SWAP · 09efe93d

Andrew Morton authored 21 years ago

From: Hugh Dickins <hugh@veritas.com>

Raised #endif CONFIG_SWAP in shrink_list, it was excluding
try_to_unmap of file pages.  Suspect !CONFIG_MMU relied on
that to suppress try_to_unmap, added SWAP_FAIL stub for it.

09efe93d

[PATCH] remove SWAP_ERROR · 255373b8

Andrew Morton authored 21 years ago

From: Hugh Dickins <hugh@veritas.com>

Delete unused SWAP_ERROR and non-existent page_over_rsslimit().

255373b8

15 Feb, 2003 1 commit

[PATCH] blk_congestion_wait tuning and lockup fix · ecc3f712

Andrew Morton authored 21 years ago

blk_congestion_wait() will currently not wait if there are no write requests
in flight.  Which is a potential problem if all the dirty data is against NFS
filesystems.

For write(2) traffic against NFS, things work nicely, because writers
throttle in nfs_wait_on_requests().  But for MAP_SHARED dirtyings we need to
avoid spinning in balance_dirty_pages().  So allow callers to fall through to
the explicit sleep in that case.

This will also fix a weird lockup which the reiser4 developers report.  In
that case they have managed to have _all_ inodes against a superblock in
locked state, yet there are no write requests in flight.  Taking a nap in
blk_congestion_wait() in this case will yield the CPU to the threads which
are trying to write out pages.

Also tune up the sleep durations in various callers - 250 milliseconds seems
rather long.

ecc3f712

12 Feb, 2003 1 commit
- [PATCH] kill warning in vmscan.c · 8b7ee204
  Andrew Morton authored 21 years ago
```
Make the "duplicate const" warning go away.  Arguably a compiler bug...
```
  8b7ee204
11 Feb, 2003 1 commit

Sanitize kernel daemon signal handling and process naming. · 43fea1be

Linus Torvalds authored 21 years ago

Add a name argument to daemonize() (va_arg) to avoid all the
kernel threads having to duplicate the name setting over and
over again.

Make daemonize() disable all signals by default, and add a
"allow_signal()" function to let daemons say they explicitly
want to support a signal.

Make flush_signal() take the signal lock, so that callers do
not need to.

43fea1be

06 Feb, 2003 1 commit

[PATCH] Remove most of the blk_run_queues() calls · 418f398e

Andrew Morton authored 21 years ago

We don't need these with self-unplugging queues.

The patch also contains a couple of microopts suggested by Andrea: we
don't need to run sync_page() if the page just came unlocked.

418f398e

04 Feb, 2003 2 commits

[PATCH] Remove __ from topology macros · 8c4ea5db

Andrew Morton authored 21 years ago

Patch from Matthew Dobson <colpatch@us.ibm.com>

When I originally wrote the patches implementing the in-kernel topology
macros, they were meant to be called as a second layer of functions,
sans underbars.  This additional layer was deemed unnecessary and
summarily dropped.  As such, carrying around (and typing!) all these
extra underbars is quite pointless.  Here's a patch to nip this in the
(sorta) bud.  The macros only appear in 16 files so far, most of them
being the definitions themselves.

8c4ea5db

[PATCH] remove __GFP_HIGHIO · 3ac8c845

Andrew Morton authored 21 years ago

Patch From: Hugh Dickins <hugh@veritas.com>

Recently noticed that __GFP_HIGHIO has played no real part since bounce
buffering was converted to mempool in 2.5.12: so this patch (over 2.5.58-mm1)
removes it and GFP_NOHIGHIO and SLAB_NOHIGHIO.

Also removes GFP_KSWAPD, in 2.5 same as GFP_KERNEL; leaves GFP_USER, which
can be a useful comment, even though in 2.5 same as GFP_KERNEL.

One anomaly needs comment: strictly, if there's no __GFP_HIGHIO, then
GFP_NOHIGHIO translates to GFP_NOFS; but GFP_NOFS looks wrong in the block
layer, and if you follow them down, you find that GFP_NOFS and GFP_NOIO
behave the same way in mempool_alloc - so I've used the less surprising
GFP_NOIO to replace GFP_NOHIGHIO.

3ac8c845

21 Dec, 2002 2 commits

[PATCH] Give kswapd writeback higher priority than pdflush · e386771c

Andrew Morton authored 22 years ago

The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).

This has a problem under some situations.  pdflush (or a write(2)
caller) could be saturating the queue with highmem pages.  This
prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
enormous amounts of scenning.

A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
then kill the mmapping applications.  The machine instantly goes from
0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
the least-recently-dirtied pages, which are all highmem.  The queue is
congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
efficiency (pages_reclaimed/pages_scanned) falls to 2%.

So this patch changes the policy for kswapd.  kswapd may use all of a
request queue, and is prepared to block on request queues.

What will now happen in the above scenario is:

1: The page alloctor scans some pages, fails to reclaim enough
   memory and takes a nap in blk_congetion_wait().

2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
   back pages.  (These pages will be rotated to the tail of the
   inactive list at IO-completion interrupt time).

   This writeback will saturate the queue with ZONE_NORMAL pages.
   Conveniently, pdflush will avoid the congested queues.  So we end up
   writing the correct pages.

In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
efficiency rises from 2% to 40% and things are generally a lot happier.


The downside is that kswapd may now do a lot less page reclaim,
increasing page allocation latency, causing more direct reclaim,
increasing lock contention in the VM, etc.  But I have not been able to
demonstrate that in testing.


The other problem is that there is only one kswapd, and there are lots
of disks.  That is a generic problem - without being able to co-opt
user processes we don't have enough threads to keep lots of disks saturated.

One fix for this would be to add an additional "really congested"
threshold in the request queues, so kswapd can still perform
nonblocking writeout.  This gives kswapd priority over pdflush while
allowing kswapd to feed many disk queues.  I doubt if this will be
called for.

e386771c

[PATCH] fix a page dirtying race in vmscan.c · 985babe8

Andrew Morton authored 22 years ago

There's a small window in which another CPU could dirty the page after
we've cleaned it, and before we've moved it to mapping->dirty_pages().
The end result is a dirty page on mapping->locked_pages, which is
wrong.

So take mapping->page_lock before clearing the dirty bit.

985babe8

14 Dec, 2002 5 commits

[PATCH] remove a vm debug check · d8259d09

Andrew Morton authored 22 years ago

This ad-hoc assertion is no longer true.  If all zones are in the `all
unreclaimable' state it can trigger.  When testing with a tiny amount
of physical memory.

d8259d09

[PATCH] remove PF_SYNC · 577c516f

Andrew Morton authored 22 years ago

current->flags:PF_SYNC was a hack I added because I didn't want to
change all ->writepage implementations.

It's foul.  And it means that if someone happens to run direct page
reclaim within the context of (say) sys_sync, the writepage invokations
from the VM will be treated as "data integrity" operations, not "memory
cleansing" operations, which would cause latency.

So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage.
 It is the `writeback_control' structure which contains the full context
information about why writepage was called.

The initial version of this patch just passed in a bare `int sync', but
the XFS team need more info so they can perform writearound from within
page reclaim.

The patch also adds writeback_control.for_reclaim, so writepage
implementations can inspect that to work out the call context rather
than peeking at current->flags:PF_MEMALLOC.

577c516f

[PATCH] vm accounting fixes and addition · c720c50a

Andrew Morton authored 22 years ago

- /proc/vmstat:pageoutrun and /proc/vmstat:allocstall are always
  identical.  Rework this so that

  - "allocstall" is the number of times a page allocator ran diect reclaim

  - "pageoutrun" is the number of times kswapd ran page reclaim

- Add a new stat: "pgrotated".  The number of pages which were
  rotated to the tail of the LRU for immediate reclaim by
  rotate_reclaimable_page().

- Document things a bit.

c720c50a

[PATCH] Remove fail_writepage, redux · 3e9afe4c

Andrew Morton authored 22 years ago

fail_writepage() does not work.  Its activate_page() call cannot
activate the page because it is not on the LRU.

So perform that function (more efficiently) in the VM.  Remove
fail_writepage() and, if the filesystem does not implement
->writepage() then activate the page from shrink_list().

A special case is tmpfs, which does have a writepage, but which
sometimes wants to activate the pages anyway.  The most important case
is when there is no swap online and we don't want to keep all those
pages on the inactive list.  So just as a tmpfs special-case, allow
writepage() to return WRITEPAGE_ACTIVATE, and handle that in the VM.

Also, the whole idea of allowing ->writepage() to return -EAGAIN, and
handling that in the caller has been reverted.  If a writepage()
implementation wants to back out and not write the page, it must
redirty the page, unlock it and return zero.  (This is Hugh's preferred
way).

And remove the now-unneeded shmem_writepages() - shmem inodes are
marked as `memory backed' so it will not be called.

And remove the test for non-null ->writepage() in generic_file_mmap().
Memory-backed files _are_ mmappable, and they do not have a
writepage().  It just isn't called.

So the locking rules for writepage() are unchanged.  They are:

- Called with the page locked
- Returns with the page unlocked
- Must redirty the page itself if it wasn't all written.

But there is a new, special, hidden, undocumented, secret hack for
tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move
the page to the active list.  The page must be kept locked in this one
case.

3e9afe4c

[PATCH] Fix rmap locking for CONFIG_SWAP=n · c7d7f43a
Andrew Morton authored 22 years ago
```
The pte_chain_unlock() needs to be outside the ifdef.
```
c7d7f43a

03 Dec, 2002 4 commits

[PATCH] Move unreleasable pages onto the active list · 1c0f3462

Andrew Morton authored 22 years ago

With some workloads a large number of pages coming off the LRU are
pinned blockdev pagecache - things like ext2 group descriptors, pages
which have buffers in the per-cpu buffer LRUs, etc.

They keep churning around the inactive list, reducing the overall page
reclaim effectiveness.

So move these pages onto the active list.

1c0f3462

[PATCH] Special-case fail_writepage() in page reclaim · 32b51ef2

Andrew Morton authored 22 years ago

Pages from memory-backed filesystems are supposed to be moved up onto
the active list, but that's not working because fail_writepage() is
called when the page is not on the LRU.

So look for this case in page reclaim and handle it there.

And it's more efficient, the VM knows more about what is going on and
it later leads to the removal of fail_writepage().

32b51ef2

[PATCH] Move reclaimable pages to the tail ofthe inactive list on · 3b0db538

Andrew Morton authored 22 years ago

The patch addresses some search complexity failures which occur when
there is a large amount of dirty data on the inactive list.

Normally we attempt to write out those pages and then move them to the
head of the inactive list.  But this goes against page aging, and means
that the page has to traverse the entire list again before it can be
reclaimed.

But the VM really wants to reclaim that page - it has reached the tail
of the LRU.

So what we do in this patch is to mark the page as needing reclamation,
and then start I/O.  In the IO completion handler we check to see if
the page is still probably reclaimable and if so, move it to the tail of
the inactive list, where it can be reclaimed immediately.

Under really heavy swap-intensive loads this increases the page reclaim
efficiency (pages reclaimed/pages scanned) from 10% to 25%.  Which is
OK for that sort of load.  Not great, but OK.

This code path takes the LRU lock once per page.  I didn't bother
playing games with batching up the locking work - it's a rare code
path, and the machine has plenty of CPU to spare when this is
happening.

3b0db538

[PATCH] Remove the final per-page throttling site in the VM · 3139a3ec

Andrew Morton authored 22 years ago

This removes the last remnant of the 2.4 way of throttling page
allocators: the wait_on_page_writeback() against mapped-or-swapcache
pages.

I did this because:

a) It's not used much.
b) It's already causing big latencies
c) With Jens' large-queue stuff, it can cause huuuuuuuuge latencies.
   Like: ninety seconds.

So kill it, and rely on blk_congestion_wait() to slow the allocator
down to match the rate at which the IO system can retire writes.

3139a3ec

26 Nov, 2002 1 commit

[PATCH] reduced latency in dentry and inode cache shrinking · 23e77b64

Andrew Morton authored 22 years ago

Shrinking a huge number of dentries or inodes can hold dcache_lock or
inode_lock for a long time.  Not only does this hold off preemption -
holding those locks basically shuts down the whole VFS.

A neat fix for all such caches is to chunk the work up at the
shrink_slab() level.

I made the chunksize pretty small, for scalability reasons - avoid
holding the lock for too long so another CPU can come in, acquire it
and go off to do some work.

23e77b64