Commits · 5409c2b52ffea911ba1e47b5cbf8d911efb5d0c6 · Kirill Smelkov / linux

19 May, 2002 25 commits

[PATCH] fix ext3 race with writeback · 5409c2b5

Andrew Morton authored May 19, 2002

The ext3-no-steal patch has exposed a long-standing race in ext3. It
has been there all the time in 2.4, but never triggered until some
timing change in the ext3-no-steal patch exposed it. The race was not
present in 2.2 because 2.2's bdflush runs inside lock_kernel().

The problem is that when ext3 is shuffling a buffer between journalling
lists there is a small window where the buffer is marked BH_dirty.
Aonther CPU can grab it, mark it clean and write it out. Then ext3
puts the buffer onto a list of buffers which are expected to be dirty,
and gets confused later on when the buffer turns out to be clean.

The patch from Stephen records the expected dirtiness of the buffer in
a local variable, so BH_dirty is not transiently set while ext3
shuffles.

5409c2b5

[PATCH] fix ext3 buffer-stealing · d9ae0cee

Andrew Morton authored May 19, 2002

Patch from sct fixes a long-standing (I did it!) and rather complex
problem with ext3.

The problem is to do with buffers which are continually being dirtied
by an external agent.  I had code in there (for easily-triggerable
livelock avoidance) which steals the buffer from checkpoint mode and
reattaches it to the running transaction.  This violates ext3 ordering
requirements - it can permit journal space to be reclaimed before the
relevant data has really been written out.

Also, we do have to reliably get a lock on the buffer when moving it
between lists and inspecting its internal state.  Otherwise a competing
read from the underlying block device can trigger an assertion failure,
and a competing write to the underlying block device can confuse ext3
journalling state completely.

d9ae0cee

[PATCH] improved I/O scheduling for indirect blocks · 799391cc

Andrew Morton authored May 19, 2002

Fixes a performance problem with many-small-file writeout.

At present, files are written out via their mapping and their indirect
blocks are written out via the blockdev mapping.  As we know that
indirects are disk-adjacent to the data it is better to start I/O
against the indirects at the same time as the data.

The delalloc pathes have code in ext2_writepage() which recognises when
the target page->index was at an indirect boundary and does an explicit
hunt-and-write against the neighbouring indirect block.  Which is
ideal.  (Unless the file was dirtied seekily and the page which is next
to the indirect was not dirtied).

This patch does it the other way: when we start writeback against a
mapping, also start writeback against any dirty buffers which are
attached to mapping->private_list.  Let the elevator take care of the
rest.

The patch makes a number of tuning changes to the writeback path in
fs-writeback.c.  This is very fiddly code: getting the throughput
tuned, getting the data-integrity "sync" operations right, avoiding
most of the livelock opportunities, getting the `kupdate' function
working efficiently, keeping it all least somewhat comprehensible.

An important intent here is to ensure that metadata blocks for inodes
are marked dirty before writeback starts working the blockdev mapping,
so all the inode blocks are efficiently written back.

The patch removes try_to_writeback_unused_inodes(), which became
unreferenced in vm-writeback.patch.

The patch has a tweak in ext2_put_inode() to prevent ext2 from
incorrectly droppping its preallocation window in response to a random
iput().


Generally, many-small-file writeout is a lot faster than 2.5.7 (which
is linux-before-I-futzed-with-it).  The workload which was optimised was

	tar xfz /nfs/mountpoint/linux-2.4.18.tar.gz ; sync

on mem=128M and mem=2048M.

With these patches, 2.5.15 is completing in about 2/3 of the time of
2.5.7.  But it is only a shade faster than 2.4.19-pre7.  Why is 2.5.7
so much slower than 2.4.19?  Not sure yet.

Heavy dbench loads (dbench 32 on mem=128M) are slightly faster than
2.5.7 and significantly slower than 2.4.19.  It appears that the cause
is poor read throughput at the later stages of the run.  Because there
are background writeback threads operating at the same time.

The 2.4.19-pre8 write scheduling manages to stop writeback during the
latter stages of the dbench run in a way which I haven't been able to
sanely emulate yet.  It may not be desirable to do this anyway - it's
optimising for the case where the files are about to be deleted.  But
it would be good to find a way of "pausing" the writeback for a few
seconds to allow readers to get an interval of decent bandwidth.

tiobench throughput is basically the same across all recent kernels.
CPU load on writes is down maybe 30% in 2.5.15.

799391cc

[PATCH] ext2: preread inode backing blocks · a9f525e6

Andrew Morton authored May 19, 2002

When ext2 creates a new inode, perform an asynchronous preread against
its backing block.

Without this patch, many-file writeout gets stalled by having to read
many individual inode table blocks in the middle of writeback.

It's worth about a 20% gain in writeback bandwidth for the many-file
writeback case.

ext3 already reads the inode's backing block in
ext3_new_inode->ext3_mark_inode_dirty, so no change is needed there.

A backport to 2.4 would make sense.

a9f525e6

[PATCH] writeback tuning · acb5f6f9

Andrew Morton authored May 19, 2002

Tune up the VM-based writeback a bit.

- Always use the multipage clustered-writeback function from within
  shrink_cache(), even if the page's mapping has a NULL ->vm_writeback().  So
  clustered writeback is turned on for all address_spaces, not just ext2.

  Subtle effect of this change: it is now the case that *all* writeback
  proceeds along the mapping->dirty_pages list.  The orderedness of the page
  LRUs no longer has an impact on disk scheduling.  So we only have one list
  to keep well-sorted rather than two, and churning pages around on the LRU
  will no longer damage write bandwidth - it's all up to the filesystem.

- Decrease the clustered writeback from 1024 pages(!) to 32 pages.

  (1024 was a leftover from when this code was always dispatching writeback
  to a pdflush thread).

- Fix wakeup_bdflush() so that it actually does write something (duh).

  do_wp_page() needs to call balance_dirty_pages_ratelimited(), so we
  throttle mmap page-dirtiers in the same way as write(2) page-dirtiers.
  This may make wakeup_bdflush() obsolete, but it doesn't hurt.

- Converts generic_vm_writeback() to directly call ->writeback_mapping(),
  rather that going through writeback_single_inode().  This prevents memory
  allocators from blocking on the inode's I_LOCK.  But it does mean that two
  processes can be writing pages from the same mapping at the same time.  If
  filesystems care about this (for layout reasons) then they should serialise
  in their ->writeback_mapping a_op.

  This means that memory-allocators will writeback only pages, not pages
  and inodes.  There are no locks in that writeback path (except for request
  queue exhaustion).  Reduces memory allocation latency.

- Implement new background_writeback function, which when kicked off
  will perform writeback until dirty memory falls below the background
  threshold.

- Put written-back pages onto the remote end of the page LRU.  It
  does this in the slow-and-stupid way at present.  pagemap_lru_lock
  stress-relief is planned...

- Remove the funny writeback_unused_inodes() stuff from prune_icache().
  Writeback from wakeup_bdflush() and the `kupdate' function now just
  naturally cleanses the oldest inodes so we don't need to do anything
  there.

- Dirty memory balancing is still using magic numbers: "after you
  dirtied your 1,000th page, go write 1,500".  Obviously, this needs
  more work.

acb5f6f9

[PATCH] pdflush exclusion · 17a74e88

Andrew Morton authored May 19, 2002

Use the pdflush exclusion infrastructure to ensure that only one
pdlfush thread is ever performing writeback against a particular
request_queue.

This works rather well.  It requires a lot of activity against a lot of
disks to cause more pdflush threads to start up.  Possibly the
thread-creation logic is a little weak: it starts more threads when a
pdflush thread goes back to sleep.  It may be better to start new
threads within pdlfush_operation().

All non-request_queue-backed address_spaces share the global
default_backing_dev_info structure.  So at present only a single
pdflush instance will be available for background writeback of *all*
NFS filesystems (for example).

If there is benefit in concurrent background writeback for multiple NFS
mounts then NFS would need to create per-mount backing_dev_info
structures and install those into new inode's address_spaces in some
manner.

17a74e88

[PATCH] pdflush exclusion infrastructure · 1f6acea0

Andrew Morton authored May 19, 2002

Collision avoidance for pdflush threads.

Turns the request_queue-based `unsigned long ra_pages' into a structure
which contains ra_pages as well as a longword.

That longword is used to record the fact that a pdflush thread is
currently writing something back against this request_queue.

Avoids the situation where several pdflush threads are sleeping on the
same request_queue.

This patch provides only the infrastructure for the pdflush exclusion.
This infrastructure gets used in pdflush-single.patch

1f6acea0

[PATCH] dirty inode management · 610c5ab8

Andrew Morton authored May 19, 2002

Fix the "race with umount" in __sync_list().  __sync_list() no longer
puts inodes onto a local list while writing them out.

The super_block.sb_dirty list is kept time-ordered.  Mappings which
have the "oldest" ->dirtied_when are kept at sb->s_dirty.prev.

So the time-based writeback (kupdate) can just bale out when it
encounters a not-old-enough mapping, rather than walking the entire
list.

dirtied_when is set on the *first* dirtying of a mapping.  So once the
mapping is marked dirty it strictly retains its place on s_dirty until
it reaches the oldest end and is written out.  So frequently-dirtied
mappings don't stay dirty at the head of the list for all time.

That local inode list was there for livelock avoidance.  Livelock is
instead avoided by looking at each mapping's ->dirtied_when.  If we
encounter one which was dirtied after this invokation of __sync_list(),
then just bale out - the sync functions are only required to write out
data which was dirty at the time when they were called.

Keeping the s_dirty list in time-order is the right thing to do anyway
- so all the various writeback callers always work against the oldest
data.

610c5ab8

[PATCH] larger b_size, and misc fixlets · 2d8f24d0

Andrew Morton authored May 19, 2002

Miscellany.

- make the printk in buffer_io_error() sector_t-aware.

- Some buffer.c cleanups from AntonA: remove a couple of !uptodate
  checks, and set a new buffer's b_blocknr to -1 in a more sensible
  place.

- Make buffer_head.b_size a 32-bit quantity.  Needed for 64k pagesize
  on ia64.  Does not increase sizeof(struct buffer_head).

2d8f24d0

[PATCH] reiserfs locking fix · 943acef9

Andrew Morton authored May 19, 2002

reiserfs is using b_inode_buffers and fsync_buffers_list() for
attaching dependent buffers to its journal.  For writeout prior to
commit.

This worked OK when a global lock was used everywhere, but the locking
is currently incorrect - try_to_free_buffers() is taking a different
lock when detaching buffers from their "foreign" inode.  So list_head
corruption could occur on SMP.

The patch implements a reiserfs_releasepage() which holds the
journal-wide buffer lock while it runs try_to_free_buffers(), so all
those list_heads are protected.  The lock is held across the
try_to_free_buffers() call as well, so nobody will attach one of this
page's buffers to a list while try_to_free_buffers() is running.

943acef9

[PATCH] fix dirty page management · 0f9268b8

Andrew Morton authored May 19, 2002

This fixes a bug in ext3 - when ext3 decides that it wants to fail its
writepage(), it is running SetPageDirty().  But ->writepage has just put
the page on ->clean_pages().  The page ends up dirty, on ->clean_pages
and the normal writeback paths don't know about it any more.

So run set_page_dirty() instead, to place the page back on the dirty
list.

And in move_from_swap_cache(), shuffle the page across to ->dirty_pages
so that it's eligible for writeout.  ___add_to_page_cache() forgets to
look at the page state when deciding which list to attach it to.

All SetPageDirty() callers otherwise look OK.

0f9268b8

[PATCH] i_dirty_buffers locking fix · 43152186

Andrew Morton authored May 19, 2002

This fixes a race between try_to_free_buffers' call to
__remove_inode_queue() and other users of b_inode_buffers
(fsync_inode_buffers and mark_buffer_dirty_inode()).  They are
presently taking different locks.

The patch relocates and redefines and clarifies(?) the role of
inode.i_dirty_buffers.

The 2.4 definition of i_dirty_buffers is "a list of random buffers
which is protected by a kernel-wide lock".  This definition needs to be
narrowed in the 2.5 context.  It is now

"a list of buffers from a different mapping, protected by a lock within
that mapping".  This list of buffers is specifically for fsync().

As this is a "data plane" operation, all the structures have been moved
out of the inode and into the address_space.  So address_space now has:

list_head private_list;

     A list, available to the address_space for any purpose.  If
     that address_space chooses to use the helper functions
     mark_buffer_dirty_inode and sync_mapping_buffers() then this list
     will contain buffer_heads, attached via
     buffer_head.b_assoc_buffers.

     If the address_space does not call those helper functions
     then the list is free for other usage.  The only requirement is
     that the list be list_empty() at destroy_inode() time.

     At least, this is the objective.  At present,
     generic_file_write() will call generic_osync_inode(), which
     expects that list to contain buffer_heads.  So private_list isn't
     useful for anything else yet.

spinlock_t private_lock;

     A spinlock, available to the address_space.

     If the address_space is using try_to_free_buffers(),
     mark_inode_dirty_buffers() and fsync_inode_buffers() then this
     lock is used to protect the private_list of *other* mappings which
     have listed buffers from *this* mapping onto themselves.

     That is: for buffer_heads, mapping_A->private_lock does not
     protect mapping_A->private_list!  It protects the b_assoc_buffers
     list from buffers which are backed by mapping_A and it protects
     mapping_B->private_list, mapping_C->private_list, ...

     So what we have here is a cross-mapping association.  S_ISREG
     mappings maintain a list of buffers from the blockdev's
     address_space which they need to know about for a successful
     fsync().  The locking follows the buffers: the lock in in the
     blockdev's mapping, not in the S_ISREG file's mapping.

     For address_spaces which use try_to_free_buffers,
     private_lock is also (and quite unrelatedly) used for protection
     of the buffer ring at page->private.  Exclusion between
     try_to_free_buffers(), __get_hash_table() and
     __set_page_dirty_buffers().  This is in fact its major use.

address_space *assoc_mapping

    Sigh.  This is the address of the mapping which backs the
    buffers which are attached to private_list.  It's here so that
    generic_osync_inode() can locate the lock which protects this
    mapping's private_list.  Will probably go away.


A consequence of all the above is that:

    a) All the buffers at a mapping_A's ->private_list must come
       from the same mapping, mapping_B.  There is no requirement that
       mapping_B be a blockdev mapping, but that's how it's used.

       There is a BUG() check in mark_buffer_dirty_inode() for this.

    b) blockdev mappings never have any buffers on ->private_list.
       It just never happens, and doesn't make a lot of sense.

reiserfs is using b_inode_buffers for attaching dependent buffers to its
journal and that caused a few problems.  Fixed in reiserfs_releasepage.patch

43152186

[PATCH] check for dirtying of non-uptodate buffers · 6b9f3b41

Andrew Morton authored May 19, 2002

- Add a debug check to catch people who are marking non-uptodate
  buffers as dirty.

  This is either a source of data corruption, or sloppy programming.

- Fix sloppy programming in ext3 ;)

6b9f3b41

[PATCH] reduce lock contention in do_pagecache_readahead · cd016d80

Andrew Morton authored May 19, 2002

Anton Blanchard has a workload (the SDET benchmark) which is showing some
moderate lock contention in do_pagecache_readahead().

Seems that SDET has many threads performing seeky reads against a
cached file. The average number of pagecache probes in a single
do_pagecache_readahead() is six, which seems reasonable.

The patch (from Anton) flips the locking around to optimise for the
fast case (page was present). So the kernel takes the lock less often,
and does more work once it has been acquired.

cd016d80

- sound/{core,pci}/*.c · f5737b71
Arnaldo Carvalho de Melo authored May 19, 2002
```
	- fix copy_{to,from}_user error handling (thanks to Rusty for pointing this out)
```
f5737b71
Merge http://linux-isdn.bkbits.net/linux-2.5.make · a252cfb4
Linus Torvalds authored May 18, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
a252cfb4

kbuild: Use $(CURDIR) · 8f3ee280

Kai Germaschewski authored May 19, 2002

Not a big change, but make provides the current directory,
so why not use it ;-)

8f3ee280

kbuild: Suppress printing of '$(MAKE) -C command' line · 8202c057

Kai Germaschewski authored May 19, 2002

Don't print the actual command to call make in a subdir, make will
print 'Entering directory <foo>' anyway, so we don't lose that
information.

8202c057

Small fix for net/irda/Makefile · 56de0f3c

Kai Germaschewski authored May 19, 2002

This Makefile would add irlan/irlan.o to $(obj-m) when selected as
modular, which is wrong. The module will get compiled just fine after
descending into that subdirectory anyway (whereas in the current
directory we have no idea how to build it).

56de0f3c

kbuild: Fix object-specific CFLAGS_foo.o · 18a8b891

Kai Germaschewski authored May 19, 2002

Make CFLAGS_foo.o work also when generating preprocessed (.i) and
assembler (.s) files.
  
Same for AFLAGS_foo.o.

18a8b891

Merge http://linux-isdn.bkbits.net/linux-2.5.make · d852a144
Kai Germaschewski authored May 19, 2002
```
into tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5.make
```
d852a144
Merge http://kernel-acme.bkbits.net:8080/usb-copy_tofrom_user-2.5 · e1b37299
Linus Torvalds authored May 18, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
e1b37299
Merge http://kernel-acme.bkbits.net:8080/isdn-copy_tofrom_user-2.5 · 96f1f1d5
Linus Torvalds authored May 18, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
96f1f1d5

drivers/usr/*.c · a465121e

Arnaldo Carvalho de Melo authored May 18, 2002

	- fix copy_{to,from}_user error handling (thanks to Rusty for pointing this out)

a465121e

drivers/isdn/*.c · 7ccde684

Arnaldo Carvalho de Melo authored May 18, 2002

	- fix copy_{to,from}_user error handling (thanks to Rusty for pointing this out)

7ccde684

18 May, 2002 3 commits
- Update /BitKeeper/etc/ignore · 570d5042
  Kai Germaschewski authored May 18, 2002
```
.<object>.flags are gone, but we have .<object>.cmd instead now and
surely don't want to add the to the repository.
```
  570d5042
- Makefile: fix merge · 9abb0c54
  Kai Germaschewski authored May 18, 2002
  
  9abb0c54
- Merge http://kernel-acme.bkbits.net:8080/oss-copy_tofrom_user-2.5 · 291884c9
  Linus Torvalds authored May 18, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
  291884c9
19 May, 2002 1 commit
- Merge kernel-acme@kernel-acme.bkbits.net:oss-copy_tofrom_user-2.5 · 4724cb02
  Arnaldo Carvalho de Melo authored May 18, 2002
```
into conectiva.com.br:/home/acme/bk/oss-copy_tofrom_user-2.5
```
  4724cb02
18 May, 2002 11 commits
- drivers/sound/*.c · 353e8377
  Arnaldo Carvalho de Melo authored May 18, 2002
```
	- fix copy_{to,from}_user error handling (thanks to Rusty for pointing this out)
```
  353e8377
- fs/intermezzo/ext_attr.c · 280067b0
  Arnaldo Carvalho de Melo authored May 18, 2002
```
fs/intermezzo/kml.c
fs/intermezzo/psdev.c

	- fix copy_{to,from}_user error handling (thans to Rusty for pointing this out)
```
  280067b0
- Merge http://linux-isdn.bkbits.net/linux-2.5.isdn · 3e12a6dc
  Linus Torvalds authored May 18, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
  3e12a6dc
- Manual merge · a06ae2d9
  Linus Torvalds authored May 18, 2002
  
  a06ae2d9
- Merge http://linux-isdn.bkbits.net/linux-2.5.make · 20c6b26d
  Linus Torvalds authored May 18, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
  20c6b26d
- Merge http://linux-isdn.bkbits.net/linux-2.5.make-next · 67a670bd
  Linus Torvalds authored May 18, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
  67a670bd
- Merge http://kernel-acme.bkbits.net:8080/intermezzo-copy_tofrom_user-2.5 · 66ebd50b
  Linus Torvalds authored May 18, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
  66ebd50b
- Merge tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5 · 977e3db5
  Kai Germaschewski authored May 18, 2002
```
into tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5.make-as
```
  977e3db5
- Merge tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5 · dd7728a8
  Kai Germaschewski authored May 18, 2002
```
into tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5.isdn
```
  dd7728a8
- Merge tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5 · 8876c643
  Kai Germaschewski authored May 18, 2002
```
into tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5.make.next
```
  8876c643
- Merge tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5 · 9ba364a1
  Kai Germaschewski authored May 18, 2002
```
into tp1.ruhr-uni-bochum.de:/home/kai/kernel/v2.5/linux-2.5.make
```
  9ba364a1