Commits · 80b1573f2a01c5138a4f87d46fde2148864cb312 · nexedi / linux

12 Apr, 2004 40 commits

[PATCH] Honour the readahead tunable in filemap_nopage() · 80b1573f

Andrew Morton authored Apr 11, 2004

Remove the hardwired pagefault readaround distance in filemap_nopage() and
use the per-file readahead setting.

The main reason for this is in fact laptop-mode. If you want to prevent the
disk from spinning up then you want all of your application's pages to be
pulled into memory in one hit. Otherwise the disk will spin up each time you
use a new part of whatever application(s) you are running.

80b1573f

[PATCH] Add commit=0 to ext3, meaning "set commit to default". · 26f14a57

Andrew Morton authored Apr 11, 2004

From: Bart Samwel <bart@samwel.tk>

Add support for the value "0" to ext3's "commit" option.  When this value
is given, ext3 substitutes it by the default commit interval.  Introduce a
constant JBD_DEFAULT_MAX_COMMIT_AGE for this.

26f14a57

[PATCH] laptop mode · 93d33a48

Andrew Morton authored Apr 11, 2004

From: Bart Samwel <bart@samwel.tk>

Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop".
In this mode the kernel will attempt to avoid spinning disks up.

Algorithm: the idea is to hold dirty data in memory for a long time, but to
flush everything which has been accumulated if the disk happens to spin up
for other reasons.

- Whenever a disk request completes (read or write), schedule a timer a few
  seconds hence.  If the timer was already pending, reset it to a few seconds
  hence.

- When the timer expires, write back the whole world.  We use
  sync_filesystems() for this because it will force ext3 journal commits as
  well.

- In balance_dirty_pages(), kick off background writeback when we hit the
  high threshold (dirty_ratio), not when we hit the low threshold.  This has
  the effect of causing "lumpy" writeback which is something I spent a year
  fixing, but in laptop mode, it is desirable.

- In try_to_free_pages(), only kick pdflush if the VM is getting into
  distress: we want to keep scanning for clean pages, deferring writeback.

- In page reclaim, avoid writing back the odd random dirty page off the
  LRU: only start I/O if the scanning is working harder.

The effect is to perform a sync() a few seconds after all I/O has ceased.

The value which was written into /proc/sys/vm/laptop-mode determines, in
seconds, the delay between the final I/O and the flush.

Additionally, the patch adds tools which help answer the question "why the
heck does my disk spin up all the time?".  The user may set
/proc/sys/vm/block_dump to a non-zero value and the kernel will print out
information which will identify the process which is performing disk reads or
which is dirtying pagecache.

The user should probably disable syslogd before setting block-dump.

93d33a48

[PATCH] kswapd: remove pages_scanned local · 77fe0a19
Andrew Morton authored Apr 11, 2004
```
This is always equal to constant zero.
```
77fe0a19

[PATCH] Fix rmap comment · 2d47875a

Andrew Morton authored Apr 11, 2004

From: Hugh Dickins <hugh@veritas.com>

rmap's try_to_unmap_one comments on find_vma failure, that a page may
temporarily be absent from a vma during mremap: no longer, though it is still
possible for this find_vma to fail, while unmap_vmas drops page_table_lock
(but that is no problem for file truncation).

2d47875a

[PATCH] mremap: check map_count · 59da95c4

Andrew Morton authored Apr 11, 2004

From: Hugh Dickins <hugh@veritas.com>

mremap's move_vma should think ahead to lessen the chance of failure during
its rewind on failure: running out of memory always possible, but it's silly
for it to embark when it's near the map_count limit.

59da95c4

[PATCH] mremap: vma_relink_file race fix · 2039e7b5

Andrew Morton authored Apr 11, 2004

From: Hugh Dickins <hugh@veritas.com>

Subtle point from Rajesh Venkatasubramanian: when mremap's move_vma fails and
so rewinds, before moving the file-based ptes back, we must move new_vma
before old vma in the i_mmap or i_mmap_shared list, so that when racing
against vmtruncate we cannot propagate pages to be truncated back from
new_vma into the just cleaned old_vma.

2039e7b5

[PATCH] mremap: move_vma fixes and cleanup · e2ea8374

Andrew Morton authored Apr 11, 2004

From: Hugh Dickins <hugh@veritas.com>

Partial rewrite of mremap's move_vma. Rajesh Venkatasubramanian has pointed
out that vmtruncate could miss ptes, leaving orphaned pages, because move_vma
only made the new vma visible after filling it. We see no good reason for
that, and time to make move_vma more robust.

Removed all its vma merging decisions, leave them to mmap.c's vma_merge, with
copy_vma added. Removed duplicated is_mergeable_vma test from vma_merge, and
duplicated validate_mm from insert_vm_struct.

move_vma move from old to new then unmap old; but on error move back from new
to old and unmap new. Don't unwind within move_page_tables, let move_vma
call it explicitly to unwind, with the right source vma. Get the
VM_ACCOUNTing right even when the final do_munmap fails.

e2ea8374

[PATCH] mremap: copy_one_pte cleanup · 209b450c

Andrew Morton authored Apr 11, 2004

From: Hugh Dickins <hugh@veritas.com>

Clean up mremap move's copy_one_pte:

- get_one_pte_map_nested already weeded out the pte_none case,
  now don't even call copy_one_pte if it has nothing to do.

- check pfn_valid before passing page to page_remove_rmap.

209b450c

[PATCH] fork vma ordering during fork · 424e44d1

Andrew Morton authored Apr 11, 2004

From: Hugh Dickins <hugh@veritas.com>

First of six patches against 2.6.5-rc3, cleaning up mremap's move_vma, and
fixing truncation orphan issues raised by Rajesh Venkatasubramanian.
Originally done as part of the anonymous objrmap work on mremap move, but
useful fixes now extracted for mainline. The mremap changes need some
exposure in the -mm tree first, but the first (fork one-liner) is safe enough
to go straight into 2.6.5.

From: Rajesh Venkatasubramanian. Despite the comment that child vma should
be inserted just after parent vma, 2.5.6 did exactly the reverse: thus a
racing vmtruncate may free the child's ptes, then advance to the parent, and
meanwhile copy_page_range has propagated more ptes from the parent to the
child, leaving file pages still mapped after truncation.

424e44d1

[PATCH] use compound pages for hugetlb pages only · 3c7011b3

Andrew Morton authored Apr 11, 2004

The compound page logic is a little fragile - it relies on additional
metadata in the pageframes which some other kernel code likes to stomp on
(xfs was doing this).

Also, because we're treating all higher-order pages as compound pages it is
no longer possible to free individual lower-order pages from the middle of
higher-order pages.  At least one ARM driver insists on doing this.

We only really need the compound page logic for higher-order pages which can
be mapped into user pagetables and placed under direct-io.  This covers
hugetlb pages and, conceivably, soundcard DMA buffers which were allcoated
with a higher-order allocation but which weren't marked PageReserved.

The patch arranges for the hugetlb implications to allocate their pages with
compound page metadata, and all other higher-order allocations go back to the
old way.

(Andrea supplied the GFP_LEVEL_MASK fix)

3c7011b3

[PATCH] mpage_writepages() cleanup · 60af4464
Andrew Morton authored Apr 11, 2004
```
Rework the code layout a bit.  No logic change.
```
60af4464

[PATCH] Add mpage_writepages() scheduling point · 082825b6

Andrew Morton authored Apr 11, 2004

From: Jens Axboe <axboe@suse.de>

Takashi did some nice latency testing of the current kernel (with -mm
writeback changes), and the biggest offender in general core is
mpage_writepages().

082825b6

[PATCH] writeback efficiency and QoS improvements · 9672a337

Andrew Morton authored Apr 11, 2004

The radix-tree walk for writeback has a couple of problems:

a) It always scans a file from its first dirty page, so if someone
   is repeatedly dirtying the front part of a file, pages near the end
   may be starved of writeout.  (Well, not completely: the `kupdate'
   function will write an entire file once the file's dirty timestamp
   has expired).  

b) When the disk queues are huge (10000 requests), there can be a
   very large number of locked pages.  Scanning past these in writeback
   consumes quite some CPU time.

So in each address_space we record the index at which the last batch of
writeout terminated and start the next batch of writeback from that
point.

9672a337

[PATCH] don't allow background writes to hide dirty buffers · bd134f27

Andrew Morton authored Apr 11, 2004

If pdflush hits a locked-and-clean buffer in __block_write_full_page() it
will just pass over the buffer. Typically the buffer is an ext3 data=ordered
buffer which is being written by kjournald, but a similar thing can happen
with blockdev buffers and ll_rw_block().

This is bad because the buffer is still under I/O and a subsequent fsync's
fdatawait() needs to know about it.

It is not practical to tag the page for writeback - only the submitter of the
I/O can do that, because the submitter has control of the end_io handler.

So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on
the underway I/O.

There is a risk that pdflush::background_writeout() will lock up, repeatedly
trying and failing to write the same page. This is prevented by ensuring
that background_writeout() always throttles when it made no progress.

bd134f27

[PATCH] fdatasync integrity fix · d3eb546e

Andrew Morton authored Apr 11, 2004

fdatasync can fail to wait on some pages due to a race.

If some task (eg pdflush) is flushing the same mapping it can remove a page's
dirty tag but not then mark that page as being under writeback, because
pdflush hit a locked buffer in __block_write_full_page().  This will happen
because kjournald is writing the buffer.  In this situation
__block_write_full_page() will redirty the page so that fsync notices it, but
there is a window where the page eludes the radix tree dirty page walk.

Consequently a concurrent fsync will fail to notice the page when walking the
radix tree's dirty pages.

The approach taken by this patch is to leave the page marked as dirty in the
radix tree while ->writepage is working out what to do with it.  This ensures
that a concurrent write-for-sync will successfully locate the page and will
then block in lock_page() until the non-write-for-sync code has finished
altering the page state.

d3eb546e

[PATCH] remove page.list · be5ceb40
Andrew Morton authored Apr 11, 2004
```
Remove the now-unneeded page.list field.
```
be5ceb40
[PATCH] switch the m68k pointer-table code over to page->lru · 67817afb
Andrew Morton authored Apr 11, 2004
```
Switch the m68k pointer-table code over to page->lru.
```
67817afb
[PATCH] arm: stop using page->list · de894013
Andrew Morton authored Apr 11, 2004
```
Switch the ARM `small_page' code over to page->lru.
```
de894013

[PATCH] stop using page->lru in compound pages · 0fcb51fd

Andrew Morton authored Apr 11, 2004

The compound page logic is using page->lru, and these get will scribbled on
in various places so switch the Compound page logic over to using ->mapping
and ->private.

0fcb51fd

[PATCH] stop using page.list in readahead · bd64f049

Andrew Morton authored Apr 11, 2004

The address_space.readapges() function currently takes a list of pages,
strung together via page->list.  Switch it to using page->lru.

This changes the API into filesystems.

bd64f049

[PATCH] stop using page.list in pageattr.c · 90687aa1
Andrew Morton authored Apr 11, 2004
```
Switch it to ->lru
```
90687aa1
[PATCH] stop using page->list in the hugetlbpage implementations · c41bb9c4
Andrew Morton authored Apr 11, 2004
```
Switch them over to page.lru
```
c41bb9c4
[PATCH] stop using page.list in the page allocator · 62e52945
Andrew Morton authored Apr 11, 2004
```
Switch the page allocator over to using page.lru for the buddy lists.
```
62e52945

[PATCH] slab: stop using page.list · 02979dcb

Andrew Morton authored Apr 11, 2004

slab.c is using page->list.  Switch it over to using page->lru so we can
remove page.list.

02979dcb

[PATCH] revert the slabification of i386 pgd's and pmd's · c33c9e78

Andrew Morton authored Apr 11, 2004

This code is playing with page->lru from pages which came from slab.  But to
remove page->list we need to convert slab over to using page->lru.  So we
cannot allow the i386 pagetable code to go scribbling on the ->lru field of
active slab pages.

This optimisation was pretty thin, and it is more important to shrink the
pageframe (on all architectures).

c33c9e78

[PATCH] stop using address_space.clean_pages · d672c382
Andrew Morton authored Apr 11, 2004
```
Remove remaining references to address_space.clean_pages.
```
d672c382

[PATCH] Stop using address_space.locked_pages · a1513309

Andrew Morton authored Apr 11, 2004

Instead, use a radix-tree walk of the pages which are tagged as being under
writeback.

The new function wait_on_page_writeback_range() was generalised out of
filemap_fdatawait().  We can later use this to provide concurrent fsync of
just a section of a file.

a1513309

[PATCH] remove address_space.io_pages · 3c1ed9b2
Andrew Morton authored Apr 11, 2004
```
Now remove address_space.io_pages.
```
3c1ed9b2

[PATCH] fix the kupdate function · b79a8408

Andrew Morton authored Apr 11, 2004

Juggle dirty pages and dirty inodes and dirty superblocks and various
different writeback modes and livelock avoidance and fairness to recover from
the loss of mapping->io_pages.

b79a8408

[PATCH] stop using the address_space dirty_pages list · 1d7d3304

Andrew Morton authored Apr 11, 2004

Move everything over to walking the radix tree via the PAGECACHE_TAG_DIRTY
tag.  Remove address_space.dirty_pages.

1d7d3304

[PATCH] tag writeback pages as such in their radix tree · 40c8348e
Andrew Morton authored Apr 11, 2004
```
Arrange for under-writeback pages to be marked thus in their pagecache radix
tree.
```
40c8348e
[PATCH] tag dirty pages as such in the radix tree · 8ece6262
Andrew Morton authored Apr 11, 2004
```
Arrange for all dirty pagecache pages to be tagged as dirty within their
radix tree.
```
8ece6262

[PATCH] make the pagecache lock irq-safe. · 89261aab

Andrew Morton authored Apr 11, 2004

Intro to these patches:

- Major surgery against the pagecache, radix-tree and writeback code. This
work is to address the O_DIRECT-vs-buffered data exposure horrors which
we've been struggling with for months.

As a side-effect, 32 bytes are saved from struct inode and eight bytes
are removed from struct page. At a cost of approximately 2.5 bits per page
in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
populated. Not all pages are pagecache; other pages gain the full 8 byte
saving.

This change will break any arch code which is using page->list and will
also break any arch code which is using page->lru of memory which was
obtained from slab.

The basic problem which we (mainly Daniel McNeil) have been struggling
with is in getting a really reliable fsync() across the page lists while
other processes are performing writeback against the same file. It's like
juggling four bars of wet soap with your eyes shut while someone is
whacking you with a baseball bat. Daniel pretty much has the problem
plugged but I suspect that's just because we don't have testcases to
trigger the remaining problems. The complexity and additional locking
which those patches add is worrisome.

So the approach taken here is to remove the page lists altogether and
replace the list-based writeback and wait operations with in-order
radix-tree walks.

The radix-tree code has been enhanced to support "tagging" of pages, for
later searches for pages which have a particular tag set. This means that
we can ask the radix tree code "find me the next 16 dirty pages starting at
pagecache index N" and it will do that in O(log64(N)) time.

This affects I/O scheduling potentially quite significantly. It is no
longer the case that the kernel will submit pages for I/O in the order in
which the application dirtied them. We instead submit them in file-offset
order all the time.

This is likely to be advantageous when applications are seeking all over
a large file randomly writing small amounts of data. I haven't performed
much benchmarking, but tiobench random write throughput seems to be
increased by 30%. Other tests appear to be unaltered. dbench may have got
10-20% quicker, but it's variable.

There is one large file which everyone seeks all over randomly writing
small amounts of data: the blockdev mapping which caches filesystem
metadata. The kernel's IO submission patterns for this are now ideal.

Because writeback and wait-for-writeback use a tree walk instead of a
list walk they are no longer livelockable. This probably means that we no
longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
fdatasync(). This may be beneficial for databases: multiple processes
writing and syncing different parts of the same file at the same time can
now all submit and wait upon writes to just their own little bit of the
file, so we can get a lot more data into the queues.

It is trivial to implement a part-file-fdatasync() as well, so
applications can say "sync the file from byte N to byte M", and multiple
applications can do this concurrently. This is easy for ext2 filesystems,
but probably needs lots of work for data-journalled filesystems and XFS and
it probably doesn't offer much benefit over an i_semless O_SYNC write.

These patches can end up making ext3 (even) slower:

for i in 1 2 3 4
do
dd if=/dev/zero of=$i bs=1M count=2000 &
done

runs awfully slow on SMP. This is, yet again, because all the file
blocks are jumbled up and the per-file linear writeout causes tons of
seeking. The above test runs sweetly on UP because the on UP we don't
allocate blocks to different files in parallel.

Mingming and Badari are working on getting block reservation working for
ext3 (preallocation on steroids). That should fix ext3 up.

This patch:

- Later, we'll need to access the radix trees from inside disk I/O
completion handlers. So make mapping->page_lock irq-safe. And rename it
to tree_lock to reliably break any missed conversions.

89261aab

[PATCH] radix-tree tags for selective lookup · 8691fb83

Andrew Morton authored Apr 11, 2004

Add radix-tree tagging so we can look up dirty or writeback pages in
O(log64(n)) time.

Each radix-tree node gains two bits for each slot: one for page dirtiness and
one for page writebackness.

If a tag bit is set on a leaf node, it indicates that item at the
corresponding slot is tagged (say, a dirty page).

If a tag bit is set in a non-leaf node it indicates that the same tag bit is
set in the subtree which lies under the corresponding slot.  ie: "there is a
dirty page under here somewhere, but you need to search down further to find
it".

A gang lookup function is provided which can walk the radix tree in
logarithmic time looking for items which are tagged, starting from a
specified offset.  We use this for in-order searches for dirty or writeback
pages.

There is a userspace test harness for this code at

http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz

8691fb83

[PATCH] rw_swap_page_sync(): place the pages in swapcache · e279bfef

Andrew Morton authored Apr 11, 2004

This function is setting page->mapping = swapper_space, but isn't actually
adding the page to swapcache.  This triggers soon-to-be-added BUGs in the
radix tree code.

So temporarily add these pages to swapcache for real.

Also, make rw_swap_page_sync() go away if it has no callers.

e279bfef

[PATCH] AIO+DIO bio_count race fix · c58d3aeb

Andrew Morton authored Apr 11, 2004

From: Suparna Bhattacharya <suparna@in.ibm.com>,
Daniel McNeil <daniel@osdl.org>

This patch ensures that when the DIO code falls back to buffered i/o after
having submitted part of the i/o, then buffered i/o is issued only for the
remaining part of the request (i.e. the part not already covered by DIO),
rather than redo the entire i/o. Now, instead of returning written ==
-ENOTBLK, generic_file_direct_IO returns the number of bytes already handled
by DIO, so that the caller knows how much of the I/O is left to be handled
via fallback to buffered write.

We need to careful not to access dio fields if its possible that the dio
could already have been freed asynchronously during i/o completion. A tricky
part of this involves plugging the window between the decrement of bio_count
and accessing dio->waiter during i/o completion where the dio could get freed
by the submission path. This potential "bio_count race" was tackled (by
Daniel) by changing bio_list_lock into bio_lock and using that for all the
bio fields. Now bio_count and bios_in_flight have been converted from
atomics into int and are both protected by the bio_lock. The race in
finished_one_bio() could thus be fixed by leaving the bio_count at 1 until
after the dio_complete() and then doing the bio_count decrement and wakeup
holding the bio_lock. It appears that shifting to the spin_lock instead of
atomic_inc/decs is ok performance wise as well.

Update:

An AIO O_DIRECT request was extending the file so it was done
synchronously. However, the request got an EFAULT and direct_io_worker()
was calling aio_complete() on the iocb and returning the EFAULT. When
io_submit_one() got the EFAULT return, it assume it had to call
aio_complete() since the i/o never got queued.

The fix is for direct_io_worker() to only call aio_complete() when the
upper layer is going to return -EIOCBQUEUED and not when getting errors
that are being return to the submit path.

c58d3aeb

[PATCH] direct-io AIO fixes · 332c8cf1

Andrew Morton authored Apr 11, 2004

From: Suparna Bhattacharya <suparna@in.ibm.com>

Fixes the following remaining issues with the DIO code:

1. During DIO file extends, intermediate writes could extend i_size
exposing unwritten blocks to intermediate reads (Soln: Don't drop i_sem
for file extends)

2. AIO-DIO file extends may update i_size before I/O completes,
exposing unwritten blocks to intermediate reads. (Soln: Force AIO-DIO
file extends to be synchronous)

3. AIO-DIO writes to holes call aio_complete() before falling back to
buffered I/O ! (Soln: Avoid calling aio_complete() if -ENOTBLK)

4. AIO-DIO writes to an allocated region followed by a hole, falls back
to buffered i/o without waiting for already submitted i/o to complete;
might return to user-space, which could overwrite the buffer contents
while they are still being written out by the kernel (Soln: Always wait
for submitted i/o to complete before falling back to buffered i/o)

332c8cf1

[PATCH] blockdev direct-io speedups · aa34baa2

Andrew Morton authored Apr 11, 2004

From: Badari Pulavarty <pbadari@us.ibm.com>

1) blkdev_direct_IO() calls blockdev_direct_IO() instead of
   blockdev_direct_IO_no_locking().

2) writev entry point is generic_file_writev() which grabs i_sem.  It
   should use generic_file_write_nolock() instead.

aa34baa2

[PATCH] Fix race between ll_rw_block() and block_write_full_page() · c2179a48

Andrew Morton authored Apr 11, 2004

Fix a race which was identified by Daniel McNeil <daniel@osdl.org>

If a buffer_head is under I/O due to JBD's ordered data writeout (which uses
ll_rw_block()) then either filemap_fdatawrite() or filemap_fdatawait() need
to wait on the buffer's existing I/O.

Presently neither will do so, because __block_write_full_page() will not
actually submit any I/O and will hence not mark the page as being under
writeback.

The best-performing fix would be to somehow mark the page as being under
writeback and defer waiting for the ll_rw_block-initiated I/O until
filemap_fdatawait()-time.  But this is hard, because in
__block_write_full_page() we do not have control of the buffer_head's end_io
handler.  Possibly we could make JBD call into end_buffer_async_write(), but
that gets nasty.

This patch makes __block_write_full_page() wait for any buffer_head I/O to
complete before inspecting the buffer_head state.  It only does this in the
case where __block_write_full_page() was called for a "data-integrity" write:
(wbc->sync_mode != WB_SYNC_NONE).

Probably it doesn't matter, because kjournald is currently submitting (or has
already submitted) all dirty buffers anyway.

c2179a48