Commits · 7176142ac11b365db1d118412849cbf0a29a02a5 · Kirill Smelkov / linux

An error occurred fetching the project authors.

12 Apr, 2004 1 commit

[PATCH] speed up ext2 fsync() and fdatasync() · 7176142a

Andrew Morton authored 20 years ago

ext2_sync_file() forgets to clear the inode's dirty bits, so we write the
inode on every fsync(), even if it hasn't changed.

Fix that up via the new sync_file() API which correctly manages the inode
state bits and the superblock inode lists.

When performing file overwrite on IDE with and without writeback caching
enabled this patch approximately doubles fsync() speed, bringing it into line
with O_SYNC writes.

Also, fix up the return value handling in ext2_sync_file().

Credit due to Jeffrey Siegal <jbs@quiotix.com> who noticed the performance
discrepancy and wrote a test app.

7176142a

19 Jan, 2004 1 commit

[PATCH] bdev: switch to f_mapping · 32d66678

Andrew Morton authored 21 years ago

From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>

A lot of places used to use ->f_dentry->d_inode->i_mapping all over the
place. Replaced with use of ->f_mapping. For now - just the places where we
literally could do search-and-replace.

32d66678

01 Oct, 2003 1 commit

[PATCH] dev_t forward compatibility fix · 1885b3f1

Andrew Morton authored 21 years ago

From: Andries.Brouwer@cwi.nl

ext2 used a 32-bit field for dev_t, with possibly undefined storage
following; thus, no action was required to go to 32-bit dev_t, but going to
64-bit dev_t required some subtlety: 0 was written in the first word and
the 64 bits in the following two.  Al truncated my 64-bit stuff to 32 bits
but did not understand why there was this split, and wrote 0 followed by a
single word.  We should at least zero the word following to have
well-defined storage later.

1885b3f1

23 Sep, 2003 1 commit

[PATCH] 32-bit dev_t: switch-over · 1c2c2a8f

Alexander Viro authored 21 years ago

Real conversion to 32bit dev_t.  Expansion to:
	* mknod() - 32
	* newstat() - 32 on 64bit platforms
	* stat64() - 32 on mips, 64 on everything else (mips has weird struct
stat64 and can't get more than 32 bits).  Note that right now the difference
is purely theoretical - we don't have internal values above 32 bits, so
huge_... vs. new_... only marks the places where 64bit conversion will need
extra work.
	* arch-dependent stat variants - depending on width available.
	* ustat et.al. - 32
	* filesystems that can handle 32 bits right now - 32
	* ext2 and ext3 - 32, with large dev_t inodes having 0 in the first
element of i_data[] (where we store dev_t value for small device numbers) and
keeping the value in the second element.
	* nfsd - 32; it can be driven to 64, but we'll get several issues with
NFSv2 support.
	* RAID - 32
	* devmapper - with v1 it's still 16 (nothing to do here), with v4 it's
64.
	* loop - 64
	* initramfs - 32
	* do_mounts code - 32.  Parts that scan devfs tree are using newstat()
on 64bit platforms and stat64() on the rest (IOW, the latest stat variant on
given platform).
	* old_valid_dev()/new_valid_dev() added where needed (stat variants,
mostly - we fail with -EOVERFLOW if values do not fit).

1c2c2a8f

05 Sep, 2003 2 commits

[PATCH] large dev_t - second series (15/15) · a1f6ff21

Alexander Viro authored 21 years ago

old_decode_dev()/old_encode_dev() added where needed in other
filesystems. Parts in different filesystems are independent, but IMO
it's not worse splitting into a dozen of half-kilobyte patches.

a1f6ff21

[PATCH] large dev_t - second series (7/15) · ad1da81a
Alexander Viro authored 21 years ago
```
	the last kdev_t object is gone; ->i_rdev switched to dev_t.
```
ad1da81a

01 Aug, 2003 2 commits

[PATCH] don't init statics to 0 (fs/) · 9cf89014

Randy Dunlap authored 21 years ago

From: Leann Ogasawara <ogasawara@osdl.org>

Uninitialize static variables initialized to 0 so they are pushed to the
.bss instead of .data.

9cf89014

[PATCH] direct-io support for XFS unwritten extents · 359a5de1

Andrew Morton authored 21 years ago

From: Nathan Scott <nathans@sgi.com>

This patch adds a mechanism by which a filesystem can register an interest in
the completion of direct I/O. The completion routine will be given the
inode, an offset and a length, and an optional filesystem-private field.

We have extended the use of the buffer_head-based interface (i.e.
get_block_t) for direct I/O such that the b_private field is now utilised.
It is defined to be initially zero at the start of I/O, and will be passed
into the filesystem unmodified by the VFS with each map request, while
setting up the direct I/O. Once I/O has completed the final value of this
pointer will be passed into a filesystems I/O completion handler. This
mechanism can be used to keep track of all of the mapping requests which
encompass an individual direct I/O request.

This has been implemented specifically for XFS, but is done so as to be as
generic as possible. XFS uses this mechanism to provide support for
unwritten extents - these are file extents which have been pre-allocated
on-disk, but not yet written to (once written, these become regular file
extents, but only once I/O is complete).

359a5de1

25 Jul, 2003 1 commit
- Add an IO completion handler to the direct_IO path to allow the initiator · 48d86a41
  Stephen Lord authored 21 years ago
```
to take an action at completion time. XFS uses this to 
```
  48d86a41
03 Apr, 2003 1 commit

[PATCH] handle bad inodes in put_inode · 68fa8120

Andrew Morton authored 21 years ago

From: "J. Bruce Fields" <bfields@fieldses.org>

If the NFS daemon is presented with a filehandle for a file that has
been deleted, it does an iget() in fs/exportfs/expfs.c:export_iget() and
gets a bad inode back. When it subsequently iput()s the inode, the
result is:

Mar 27 12:53:40 snoopy kernel: EXT2-fs error (device ide0(3,3)): ext2_free_blocks: Freeing blocks not in datazone - block = 1802201963, count = 27499
Mar 27 12:53:40 snoopy kernel: Remounting filesystem read-only

The same can happen if ext2_get_inode() returns an error - ext2_read_inode()
will return an uninitialised inode and ext2_put_inode() is not allowed to go
looking inside the bad inode.

68fa8120

16 Mar, 2003 1 commit

[PATCH] Ext2/3 noatime and dirsync fixes · 3bdfab20

Andrew Morton authored 21 years ago

Patch from "Theodore Ts'o" <tytso@mit.edu>

I recently noticed a bug in ext2/3; newly created inodes which inherit
the noatime flag from their containing directory do not respect noatime
until the inode is flushed from the inode cache and then re-read later.
This is because the code which checks the ext2 no-atime attribute and
then sets the S_NOATIME in inode->i_flags is present in
ext2_read_inode(), but not in ext2_new_inode().

I fixed this in 2.4, and then found an even worse bug in the 2.5 code;
the DIRSYNC flag is completely ignored *except* in the case where a
directory is newly created using mkdir and its parent directory has the
DIRSYNC flag.  S_DIRSYNC doesn't get set in the ext2_new_inode() or the
ext2_ioctl() paths (which is used by chattr).

This patch centralizes the code which translates the ext2 flags in the
raw ext2 inode to the appropriate flag values in inode->i_flags in a
single location.  This fixes the bug, makes things cleaner, and also
removes 30 lines of code and 128 bytes of compiled x86 text in the
bargain.

3bdfab20

10 Feb, 2003 1 commit

[PATCH] Fix synchronous writers to wait properly for the result · 8d49bf3f

Andrew Morton authored 22 years ago

Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in
ll_rw_block() usage.

Typical usage is:

	mark_buffer_dirty(bh);
	ll_rw_block(WRITE, 1, &bh);
	wait_on_buffer(bh);

the problem is that if the buffer was locked on entry to this code sequence
(due to in-progress I/O), ll_rw_block() will not wait, and start new I/O.  So
this code will wait on the _old_ I/O, and will then continue execution,
leaving the buffer dirty.

It turns out that all callers were only writing one buffer, and they were all
waiting on that writeout.  So I added a new sync_dirty_buffer() function:

	void sync_dirty_buffer(struct buffer_head *bh)
	{
		lock_buffer(bh);
		if (test_clear_buffer_dirty(bh)) {
			get_bh(bh);
			bh->b_end_io = end_buffer_io_sync;
			submit_bh(WRITE, bh);
		} else {
			unlock_buffer(bh);
		}
	}

which allowed a fair amount of code to be removed, while adding the desired
data-integrity guarantees.

UFS has its own wrappers around ll_rw_block() which got in the way, so this
operation was open-coded in that case.

8d49bf3f

02 Feb, 2003 1 commit
- [PATCH] quota semaphore fix · df38988c
  Andrew Morton authored 22 years ago
```
The second quota locking fix.  Sorry, I seem to have misplaced the
changelog.
```
  df38988c
08 Jan, 2003 1 commit

[PATCH] AIO support for raw/O_DIRECT · 08e6749e

Andrew Morton authored 22 years ago

Patch from Badari Pulavarty <pbadari@us.ibm.com> and myself

This patch adds the infrastructure for performing asynchronous (AIO) blockdev
direct-IO.

- Adds generic_file_aio_write_nolock() and make other
  generic_file_*_write() to use it.

- Modify generic_file_direct_IO() and ->direct_IO() functions to take
  "kiocb *" instead of "file *".

- Renames generic_direct_IO() to blockdev_direct_IO().

- Move generic_file_direct_IO() to mm/filemap.c (it is not
  blockdev-specific, whereas the rest of fs/direct-io.c is).

- Add AIO read/write support to the raw driver.

08e6749e

21 Dec, 2002 1 commit
- [PATCH] ext2: smarter block allocation startup · 7dcaa802
  Andrew Morton authored 22 years ago
```
The same thing, for ext2.
```
  7dcaa802
14 Dec, 2002 2 commits

[PATCH] ext2 synchronous mount fix · 7cc9ee3d

Andrew Morton authored 22 years ago

The optimisation for synchronous mounts was only correct for S_ISREG
files. Directories do not pass through generic_osync_inode() and we
still need to synchronously write out their indirect blocks.

7cc9ee3d

[PATCH] remove PF_SYNC · 577c516f

Andrew Morton authored 22 years ago

current->flags:PF_SYNC was a hack I added because I didn't want to
change all ->writepage implementations.

It's foul.  And it means that if someone happens to run direct page
reclaim within the context of (say) sys_sync, the writepage invokations
from the VM will be treated as "data integrity" operations, not "memory
cleansing" operations, which would cause latency.

So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage.
 It is the `writeback_control' structure which contains the full context
information about why writepage was called.

The initial version of this patch just passed in a bare `int sync', but
the XFS team need more info so they can perform writearound from within
page reclaim.

The patch also adds writeback_control.for_reclaim, so writepage
implementations can inspect that to work out the call context rather
than peeking at current->flags:PF_MEMALLOC.

577c516f

22 Nov, 2002 2 commits

[PATCH] no-buffer-head ext2 option · b1ad1f4e

Andrew Morton authored 22 years ago

Implements a new set of block address_space_operations which will never
attach buffer_heads to file pagecache.  These can be turned on for ext2
with the `nobh' mount option.

During write-intensive testing on a 7G machine, total buffer_head
storage remained below 0.3 megabytes.  And those buffer_heads are
against ZONE_NORMAL pagecache and will be reclaimed by ZONE_NORMAL
memory pressure.

This work is, of course, a special for the huge highmem machines.
Possibly it obsoletes the buffer_heads_over_limit stuff (which doesn't
work terribly well), but that code is simple, and will provide relief
for other filesystems.


It should be noted that the nobh_prepare_write() function and the
PageMappedToDisk() infrastructure is what is needed to solve the
problem of user data corruption when the filesystem which backs a
sparse MAP_SHARED mapping runs out of space.  We can use this code in
filemap_nopage() to ensure that all mapped pages have space allocated
on-disk.  Deliver SIGBUS on ENOSPC.

This will require a new address_space op, I expect.

b1ad1f4e

[PATCH] Remove mapping->vm_writeback · 53bf7bef

Andrew Morton authored 22 years ago

The vm_writeback address_space operation was designed to provide the VM
with a "clustered writeout" capability.  It allowed the filesystem to
perform more intelligent writearound decisions when the VM was trying
to clean a particular page.

I can't say I ever saw any real benefit from this - not much writeout
actually happens on that path - quite a lot of work has gone into
minimising it actually.

The default ->vm_writeback a_op which I provided wrote back the pages
in ->dirty_pages order.  But there is one scenario in which this causes
problems - writing a single 4G file with mem=4G.  We end up with all of
ZONE_NORMAL full of dirty pages, but all writeback effort is against
highmem pages.  (Because there is about 1.5G of dirty memory total).

Net effect: the machine stalls ZONE_NORMAL allocation attempts until
the ->dirty_pages writeback advances onto ZONE_NORMAL pages.

This can be fixed most sweetly with additional radix-tree
infrastructure which will be quite complex.  Later.


So this patch dumps it all, and goes back to using writepage
against individual pages as they come off the LRU.

53bf7bef

17 Nov, 2002 1 commit

[PATCH] nanosecond stat timefields · 5d62665d

Andi Kleen authored 22 years ago

stat64 has been changed to return jiffies granuality as nsec in previously
unused fields. This allows make to make better decisions on when
to recompile a file. Follows losely the Solaris API.

CURRENT_TIME has been redefined to return struct timespec. The users
who don't use it in a inode/attr context have been changed to use a new
get_seconds() function. CURRENT_TIME is implemented by an out-of-line
function.

There is a small performance penalty in this patch. The previous
filemap code had an optimization to flush atime only once a second.
This is currently gone, which will increase flushes a bit. I believe
the correct solution if it should be a problem is to have per super
block fields that give an arbitary atime flush granuality - so that you
can set it to be only flushed once a hour if you prefer that. I will
work on that later in separate patches if the need should arise.

struct inode and the attr struct has been changed to store struct
timespec instead of time_t for [cma]time. Not all file systems support
this granuality, but some like XFS,NFSv3,CIFS,JFS do. The others will
currently truncate the nsec part on flushing to disk. There was some
discussion on this rounding on l-k previously. I went for simple
truncation because there is not much evidence IMHO that the more
complicated roundings have any advantages. In practice application will
be rather unlikely to notice the rounding anyways - they can only see a
difference when an inode is flush from memory and reloaded in less than
a second, which is rather unlikely.

5d62665d

05 Nov, 2002 2 commits

[PATCH] Make ->readpages palatable to NFS · b729e488

Trond Myklebust authored 22 years ago

The following patch makes the ->readpages() address_space_operation
take a struct file argument just like ->readpage().

b729e488

[PATCH] `event' removal: ext2 · 9aefc010

Andrew Morton authored 22 years ago

Patch from Manfred Spraul

Use a local counter instead of the global 'event' variable for the
readdir() optimization.

Depends on patch-event-II

Background:
  The only user of i_version and f_version in ext2 is
  ext2_readdir(). As an optimization, ext2 performs the
  validation of the start position for readdir() only if
        flip->f_version != inode->i_version.
  If there was no llseek and no directory change since the
  last readdir() call, then f_pos can be trusted.
  f_version is set to 0 in get_empty_flip and during llseek.
  Right now, i_version set to ++event during ext2_read_inode
  and commit_chunk, i.e. at inode creation and if a directory
  is changed.
  Initializing i_version to 1, and updating with i_version++
  achieves the same effect, without the need of a global variable.
  Global uniqueness is not required, there are no other uses
  of [if]_version in ext2.

Change relative to the patch you have right now:
i_version is initialized to 1 instead of 0. For ext2 it's doesn't
matter [there is always a valid 'len' value at the beginning of a
directory data block], but it's cleaner.

9aefc010

30 Oct, 2002 3 commits

Port of (bugfixed) 0.8.50 acl-ext2 to 2.5 · 12538ad0
Theodore Y. Ts'o authored 22 years ago
```
This patch adds ACL support to the ext2 filesystem.
```
12538ad0

Port of (bugfixed) 0.8.50 xattr-ext2 to 2.5 (w/ hch cleanups. mbcache API) · 8603affb

Theodore Y. Ts'o authored 22 years ago

This patch adds extended attribute support to the ext2 filesystem. This
uses the generic extended attribute patch which was developed by Andreas
Gruenbacher and the XFS team. As a result, the user space utilities
which work for XFS will also work with these patches.

8603affb

Ext2/3 forward compatibility: inode size · 216114b9

Theodore Y. Ts'o authored 22 years ago

This patch allows filesystems with expanded inodes to be mounted.
(compatibility feature flags will be used to control whether or not the
filesystem should be mounted in case the new inode fields will result in
compatibility issues).  This allows for future compatibility with newer
versions of ext2fs.

216114b9

29 Oct, 2002 1 commit

[PATCH] permit direct IO with finer-than-fs-blocksize alignments · 4a4c6811

Andrew Morton authored 22 years ago

Mainly from Badari Pulavarty

Traditionally we have only supported O_DIRECT I/O at an alignment and
granularity which matches the underlying filesystem.  That typically
means that all IO must be 4k-aligned and a multiple of 4k in size.

Here, we relax that so that direct I/O happens with (typically)
512-byte alignment and multiple-of-512-byte size.

The tricky part is when a write starts and/or ends partway through a
filesystem block which has just been added.  We need to zero out the
parts of that block which lie outside the written region.

We handle that by putting appropriately-sized parts of the ZERO_PAGE
into sepatate BIOs.

The generic_direct_IO() function has been changed so that the
filesystem must pass in the address of the block_device against which
the IO is to be performed.  I'd have preferred to not do this, but we
do need that info at that time so that alignment checks can be
performed.

If the filesystem passes in a NULL block_device pointer then we fall
back to the old behaviour - must align with the fs blocksize.

There is no trivial way for userspace to know what the minimum
alignment is - it depends on what bdev_hardsect_size() says about the
device.  It is _usually_ 512 bytes, but not always.  This introduces
the risk that someone will develop and test applications which work
fine on their hardware, but will fail on someone else's hardware.

It is possible to query the hardsect size using the BLKSSZGET ioctl
against the backing block device.  This can be performed at runtime or
at application installation time.

4a4c6811

12 Oct, 2002 1 commit

Fix warnings of the form · 2a022093

Richard Henderson authored 22 years ago

  warning: long int format, different type arg (arg 5)
by casting ino_t arguments to unsigned long for printf formats.
In some instances, change %ld to %lu.

2a022093

09 Oct, 2002 1 commit

[PATCH] 64-bit sector_t - filesystems · 763fb9a3

Andrew Morton authored 22 years ago

From Peter Chubb

Filesystem migration to possibly 64-bit sector_t:
 - bmap() now takes and returns a sector_t to allow filesystems
   (e.g., JFS, XFS) that are 64-bit clean to deal with large files
 - buffer handling now 64-bit clean

Enable 64-bit sector_t on IA32 and PPC.

kiobufs takes sector_t array, not array of long.
Fix blkmtd.c to deal in such an array.

Miscellaneous fixes for 64-bit sector_t.
 	 - missed printk formats
	 - ide_floppy_do_request had incorrect signature
	 - in blkmtd.c there was a pointer used to
	   manipulate an array to be used by kiobuf --
 	   it was unsigned long, needed to be sector_t

763fb9a3

07 Oct, 2002 1 commit

[PATCH] add struct file* to ->direct_IO addr space op · 3a453bd4

Chuck Lever authored 22 years ago

This makes file credentials available to the ->direct_IO address space
operation by replacing its struct inode* argument with a struct file*
argument.  this patch is a prerequisite for NFS direct I/O support.  it
breaks the raw device driver.

3a453bd4

05 Oct, 2002 1 commit

[PATCH] remove write_mapping_buffers() · 4ac833da

Andrew Morton authored 22 years ago

When the global buffer LRU was present, dirty ext2 indirect blocks were
automatically scheduled for writeback alongside their data.

I added write_mapping_buffers() to replace this - the idea was to
schedule the indirects close in time to the scheduling of their data.

It works OK for small-to-medium sized files but for large, linear writes
it doesn't work: the request queue is completely full of file data and
when we later come to scheduling the indirects, their neighbouring data
has already been written.

So writeback of really huge files tends to be a bit seeky.

So. Kill it. Will fix this problem by other means.

4ac833da

19 Sep, 2002 1 commit

[PATCH] clean up argument passing in writeback paths · 967e6864

Andrew Morton authored 22 years ago

The writeback code paths which walk the superblocks and inodes are
getting an increasing arguments passed to them.

The patch wraps those args into the new `struct writeback_control',
and uses that instead.  There is no functional change.

The new writeback_control structure is passed down through the
writeback paths in the place where the old `nr_to_write' pointer used
to be.

writeback_control will be used to pass new information up and down the
writeback paths.  Such as whether the writeback should be non-blocking,
and whether queue congestion was encountered.

967e6864

13 Sep, 2002 1 commit

[PATCH] readv/writev speedup · a83638a4

Andrew Morton authored 22 years ago

This is Janet Morgan's patch which converts the readv/writev code
to submit all segments for IO before waiting on them, rather than
submitting each segment separately.

This is a critical performance fix for O_DIRECT reads and writes.
Prior to this change, O_DIRECT vectored IO was forced to wait for
completion against each segment of the iovec rather than submitting all
segments and waiting on the lot.  ie: for ten segments, this code will
be ten times faster.

There will also be moderate improvements for buffered IO - smaller code
paths, plus writev() only takes i_sem once.

The patch ended up quite large unfortunately - turned out that the only
sane way to implement this without duplicating significant amounts of
code (the generic_file_write() bounds checking, all the O_DIRECT
handling, etc) was to redo generic_file_read() and generic_file_write()
to take an iovec/nr_segs pair rather than `buf, count'.

New exported functions generic_file_readv() and generic_file_writev()
have been added:

ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
                          unsigned long nr_segs, loff_t *ppos);
ssize_t generic_file_writev(struct file *file, const struct iovec *iov,
                          unsigned long nr_segs, loff_t * ppos);

If a driver does not use these in their file_operations then they will
continue to use the old readv/writev code, which sits in a loop calling
calls fops->read() or fops->write().

ext2, ext3, JFS and the blockdev driver are currently using this
capability.

Some coding cleanups were made in fs/read_write.c.  Mainly:

- pass "READ" or "WRITE" around to indicate the diretion of the
  operation, rather than the (confusing, inverted)
  VERIFY_READ/VERIFY_WRITE.

- Use the identifier `nr_segs' everywhere to indicate the iovec
  length rather than `count', which is often used to indicate the
  number of bytes in the syscall.  It was confusing the heck out of me.

- Some cleanups to the raw driver.

- Some additional generality in fs/direct_io.c: the core `struct dio'
  used to be a "populate-and-go" thing.  Janet has broken that up so
  you can initialise a struct dio once, then loop around feeding it
  more file segments, then wait on completion against everything.

- In a couple of places we needed to handle the situation where we
  knew, a-priori, that the user was going to get a short read or write.
  File size limit exceeded, read past i_size, etc.  We handled that by
  shortening the iovec in-place with iov_shorten().  Which is not
  particularly pretty, but neither were the alternatives.

a83638a4

13 Aug, 2002 1 commit
- [PATCH] designated initialisers for ext2 · d001e08a
  Andrew Morton authored 22 years ago
```
Convert ext2 initialisers to c99 format.  From Art Haas.
```
  d001e08a
28 Jul, 2002 1 commit

[PATCH] direct IO updates · 0d85f8bf

Andrew Morton authored 22 years ago

This patch is a performance and correctness update to the direct-IO
code: O_DIRECT and the raw driver.  It mainly affects IO against
blockdevs.

The direct_io code was returning -EINVAL for a filesystem hole.  Change
it to clear the userspace page instead.

There were a few restrictions and weirdnesses wrt blocksize and
alignments.  The code has been reworked so we now lay out maximum-sized
BIOs at any sector alignment.

Because of this, the raw driver has been altered to set the blockdev's
soft blocksize to the minimum possible at open() time.  Typically, 512
bytes.  There are now no performance disadvantages to using small
blocksizes, and this gives the finest possible alignment.

There is no API here for setting or querying the soft blocksize of the
raw driver (there never was, really), which could conceivably be a
problem.  If it is, we can permit BLKBSZSET and BLKBSZGET against the
fd which /dev/raw/rawN returned, but that would require that
blk_ioctl() be exported to modules again.

This code is wickedly quick.  Here's an oprofile of a single 500MHz
PIII reading from four (old) scsi disks (two aic7xxx controllers) via
the raw driver.  Aggregate throughput is 72 megabytes/second:

c013363c 24       0.0896492   __set_page_dirty_buffers
c021b8cc 24       0.0896492   ahc_linux_isr
c012b5dc 25       0.0933846   kmem_cache_free
c014d894 26       0.09712     dio_bio_complete
c01cc78c 26       0.09712     number
c0123bd4 40       0.149415    follow_page
c01eed8c 46       0.171828    end_that_request_first
c01ed410 49       0.183034    blk_recount_segments
c01ed574 65       0.2428      blk_rq_map_sg
c014db38 85       0.317508    do_direct_IO
c021b090 90       0.336185    ahc_linux_run_device_queue
c010bb78 236      0.881551    timer_interrupt
c01052d8 25354    94.707      poll_idle

A testament to the efficiency of the 2.5 block layer.

And against four IDE disks on an HPT374 controller.  Throughput is 120
megabytes/sec:

c01eed8c 80       0.292462    end_that_request_first
c01fe850 87       0.318052    hpt3xx_intrproc
c01ed574 123      0.44966     blk_rq_map_sg
c01f8f10 141      0.515464    ata_select
c014db38 153      0.559333    do_direct_IO
c010bb78 235      0.859107    timer_interrupt
c01f9144 281      1.02727     ata_irq_enable
c01ff990 290      1.06017     udma_pci_init
c01fe878 308      1.12598     hpt3xx_maskproc
c02006f8 379      1.38554     idedisk_do_request
c02356a0 609      2.22637     pci_conf1_read
c01ff8dc 611      2.23368     udma_pci_start
c01ff950 922      3.37062     udma_pci_irq_status
c01f8fac 1002     3.66308     ata_status
c01ff26c 1059     3.87146     ata_start_dma
c01feb70 1141     4.17124     hpt374_udma_stop
c01f9228 3072     11.2305     ata_out_regfile
c01052d8 15193    55.5422     poll_idle

Not so good.

One problem which has been identified with O_DIRECT is the cost of
repeated calls into the mapping's get_block() callback.  Not a big
problem with ext2 but other filesystems have more complex get_block
implementations.

So what I have done is to require that callers of generic_direct_IO()
implement the new `get_blocks()' interface.  This is a small extension
to get_block().  It gets passed another argument which indicates the
maximum number of blocks which should be mapped, and it returns the
number of blocks which it did map in bh_result->b_size.  This allows
the fs to map up to 4G of disk (or of hole) in a single get_block()
invokation.

There are some other caveats and requirements of get_blocks() which are
documented in the comment block over fs/direct_io.c:get_more_blocks().

Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block
mapping.  It certainly allows good speedups.  But it doesn't allow the
fs to return a scatter list of blocks - it only understands linear
chunks of disk.  I think that's really all it _should_ do.

I'll let get_blocks() sit for a while and wait for some feedback.  If
it is sufficient and nobody objects too much, I shall convert all
get_block() instances in the kernel to be get_blocks() instances.  And
I'll teach readahead (at least) to use the get_blocks() extension.

Delayed allocate writeback could use get_blocks().  As could
block_prepare_write() for blocksize < PAGE_CACHE_SIZE.  There's no
mileage using it in mpage_writepages() because all our filesystems are
syncalloc, and nobody uses MAP_SHARED for much.

It will be tricky to use get_blocks() for writes, because if a ton of
blocks have been mapped into the file and then something goes wrong,
the kernel needs to either remove those blocks from the file or zero
them out.  The direct_io code zeroes them out.

btw, some time ago you mentioned that some drivers and/or hardware may
get upset if there are multiple simultaneous IOs in progress against
the same block.  Well, the raw driver has always allowed that to
happen.  O_DIRECT writes to blockdevs do as well now.

todo:

1) The driver will probably explode if someone runs BLKBSZSET while
   IO is in progress.  Need to use bdclaim() somewhere.

2) readv() and writev() need to become direct_io-aware.  At present
   we're doing stop-and-wait for each segment when performing
   readv/writev against the raw driver and O_DIRECT blockdevs.

0d85f8bf

15 Jul, 2002 1 commit

[PATCH] 2.5 i_size_high fixup · 884e7cce

Andreas Dilger authored 22 years ago

this patch is a minor fixup to ext2/inode.c to avoid displaying the
high 32 bits of the size for anything other than regular files. For
sockets, pipes, symlinks, etc it doesn't make sense to have a value
larger than 2GB, and this has already been fixed in ext3 and e2fsprogs.

884e7cce

14 Jul, 2002 1 commit

[PATCH] direct-to-BIO for O_DIRECT · 42ec8bc1

Andrew Morton authored 22 years ago

Here's a patch which converts O_DIRECT to go direct-to-BIO, bypassing
the kiovec layer.  It's followed by a patch which converts the raw
driver to use the O_DIRECT engine.

CPU utilisation is about the same as the kiovec-based implementation.
Read and write bandwidth are the same too, for 128k chunks.   But with
one megabyte chunks, this implementation is 20% faster at writing.

I assume this is because the kiobuf-based implementation has to stop
and wait for each 128k chunk, whereas this code streams the entire
request, regardless of its size.

This is with a single (oldish) scsi disk on aic7xxx.  I'd expect the
margin to widen on higher-end hardware which likes to have more
requests in flight.

Question is: what do we want to do with this sucker?  These are the
remaining users of kiovecs:

	drivers/md/lvm-snap.c
	drivers/media/video/video-buf.c
	drivers/mtd/devices/blkmtd.c
	drivers/scsi/sg.c

the video and mtd drivers seems to be fairly easy to de-kiobufize.
I'm aware of one proprietary driver which uses kiobufs.  XFS uses
kiobufs a little bit - just to map the pages.

So with a bit of effort and maintainer-irritation, we can extract
the kiobuf layer from the kernel.

42ec8bc1

12 Jun, 2002 1 commit
- [PATCH] ext2_put_inode race fix · 63959896
  Andrew Morton authored 22 years ago
```
Removes the put_iode optimisation.  It's racy, as
Chris pointed out.
```
  63959896
27 May, 2002 3 commits

[PATCH] dirsync · bb772c58

Andrew Morton authored 22 years ago

An implementation of directory-synchronous mounts.

I sent this out some months ago and it didn't generate a lot of
interest.  Later we had one of the usual cheery exchanges with Wietse
Venema (postfix development) and he agreed that directory synchronous
mounts were something that he could use, and that there was benefit in
implementing them in Linux.  If you choose to apply this I'll push the
2.4 patch.



Patch against e2fsprogs-1.26:
        http://www.zip.com.au/~akpm/linux/dirsync/e2fsprogs-1.26.patch

Patch against util-linux-2.11n:
        http://www.zip.com.au/~akpm/linux/dirsync/util-linux-2.11n.patch


The kernel patch includes implementations for ext2 and ext3. It's
pretty simple.

- When dirsync is in operation against a directory, the following operations
  are synchronous within that directory:  create, link, unlink, symlink,
  mkdir, rmdir, mknod, rename (synchronous if either the source or dest
  directory is dirsync).

- dirsync is a subset of sync.  So `mount -o sync' or `chattr +S'
  give you everything which `mount -o dirsync' or `chattr +D' gives,
  plus synchronous file writes.

- ext2's inode.i_attr_flags is unused, and is removed.

- mount /dev/foo /mnt/bar -o dirsync  works as expected.

- An ext2 or ext3 directory tree can be set dirsync with `chattr +D -R'.

- dirsync is maintained as new directories are created under
  a `chattr +D' directory.  Like `chattr +S'.

- Other filesystems can trivially be taught about dirsync.  It's just
  a matter of replacing `IS_SYNC(inode)' with `IS_DIRSYNC(inode)' in
  the directory update functions.  IS_SYNC will still be honoured when
  IS_DIRSYNC is used.

- Non-directory files do not have their dirsync flag propagated.  So
  an S_ISREG file which is created inside a dirsync directory will not
  have its dirsync bit set.  chattr needs to do this as well.

- There was a bit of version skew between e2fsprogs' idea of the
  inode flags and the kernel's.  That is sorted out here.

- `lsattr' shows the dirsync flag as "D".  The letter "D" was
  previously being used for Compressed_Dirty_File.  I changed
  Compressed_Dirty_File to use "Z".  Is that OK?

The mount(2) manpage needs to be taught about MS_DIRSYNC.

bb772c58

[PATCH] rename writeback_mapping to writepages · 7d608fac

Andrew Morton authored 22 years ago

Spot the difference:

aops.readpage
aops.readpages
aops.writepage
aops.writeback_mapping

The patch renames `writeback_mapping' to `writepages'

7d608fac

[PATCH] direct-to-BIO writeback · ab9e8941

Andrew Morton authored 22 years ago

Multipage BIO writeout from the pagecache.

It's pretty much the same as multipage reads.  It falls back to buffers
if things got complex.

The write case is a little more complex because it handles pages which
have buffers and pages which do not.  If the page didn't have buffers
this code does not add them.

ab9e8941