An error occurred fetching the project authors.
- 12 Apr, 2004 1 commit
-
-
Andrew Morton authored
ext2_sync_file() forgets to clear the inode's dirty bits, so we write the inode on every fsync(), even if it hasn't changed. Fix that up via the new sync_file() API which correctly manages the inode state bits and the superblock inode lists. When performing file overwrite on IDE with and without writeback caching enabled this patch approximately doubles fsync() speed, bringing it into line with O_SYNC writes. Also, fix up the return value handling in ext2_sync_file(). Credit due to Jeffrey Siegal <jbs@quiotix.com> who noticed the performance discrepancy and wrote a test app.
-
- 19 Jan, 2004 1 commit
-
-
Andrew Morton authored
From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk> A lot of places used to use ->f_dentry->d_inode->i_mapping all over the place. Replaced with use of ->f_mapping. For now - just the places where we literally could do search-and-replace.
-
- 01 Oct, 2003 1 commit
-
-
Andrew Morton authored
From: Andries.Brouwer@cwi.nl ext2 used a 32-bit field for dev_t, with possibly undefined storage following; thus, no action was required to go to 32-bit dev_t, but going to 64-bit dev_t required some subtlety: 0 was written in the first word and the 64 bits in the following two. Al truncated my 64-bit stuff to 32 bits but did not understand why there was this split, and wrote 0 followed by a single word. We should at least zero the word following to have well-defined storage later.
-
- 23 Sep, 2003 1 commit
-
-
Alexander Viro authored
Real conversion to 32bit dev_t. Expansion to: * mknod() - 32 * newstat() - 32 on 64bit platforms * stat64() - 32 on mips, 64 on everything else (mips has weird struct stat64 and can't get more than 32 bits). Note that right now the difference is purely theoretical - we don't have internal values above 32 bits, so huge_... vs. new_... only marks the places where 64bit conversion will need extra work. * arch-dependent stat variants - depending on width available. * ustat et.al. - 32 * filesystems that can handle 32 bits right now - 32 * ext2 and ext3 - 32, with large dev_t inodes having 0 in the first element of i_data[] (where we store dev_t value for small device numbers) and keeping the value in the second element. * nfsd - 32; it can be driven to 64, but we'll get several issues with NFSv2 support. * RAID - 32 * devmapper - with v1 it's still 16 (nothing to do here), with v4 it's 64. * loop - 64 * initramfs - 32 * do_mounts code - 32. Parts that scan devfs tree are using newstat() on 64bit platforms and stat64() on the rest (IOW, the latest stat variant on given platform). * old_valid_dev()/new_valid_dev() added where needed (stat variants, mostly - we fail with -EOVERFLOW if values do not fit).
-
- 05 Sep, 2003 2 commits
-
-
Alexander Viro authored
old_decode_dev()/old_encode_dev() added where needed in other filesystems. Parts in different filesystems are independent, but IMO it's not worse splitting into a dozen of half-kilobyte patches.
-
Alexander Viro authored
the last kdev_t object is gone; ->i_rdev switched to dev_t.
-
- 01 Aug, 2003 2 commits
-
-
Randy Dunlap authored
From: Leann Ogasawara <ogasawara@osdl.org> Uninitialize static variables initialized to 0 so they are pushed to the .bss instead of .data.
-
Andrew Morton authored
From: Nathan Scott <nathans@sgi.com> This patch adds a mechanism by which a filesystem can register an interest in the completion of direct I/O. The completion routine will be given the inode, an offset and a length, and an optional filesystem-private field. We have extended the use of the buffer_head-based interface (i.e. get_block_t) for direct I/O such that the b_private field is now utilised. It is defined to be initially zero at the start of I/O, and will be passed into the filesystem unmodified by the VFS with each map request, while setting up the direct I/O. Once I/O has completed the final value of this pointer will be passed into a filesystems I/O completion handler. This mechanism can be used to keep track of all of the mapping requests which encompass an individual direct I/O request. This has been implemented specifically for XFS, but is done so as to be as generic as possible. XFS uses this mechanism to provide support for unwritten extents - these are file extents which have been pre-allocated on-disk, but not yet written to (once written, these become regular file extents, but only once I/O is complete).
-
- 25 Jul, 2003 1 commit
-
-
Stephen Lord authored
to take an action at completion time. XFS uses this to
-
- 03 Apr, 2003 1 commit
-
-
Andrew Morton authored
From: "J. Bruce Fields" <bfields@fieldses.org> If the NFS daemon is presented with a filehandle for a file that has been deleted, it does an iget() in fs/exportfs/expfs.c:export_iget() and gets a bad inode back. When it subsequently iput()s the inode, the result is: Mar 27 12:53:40 snoopy kernel: EXT2-fs error (device ide0(3,3)): ext2_free_blocks: Freeing blocks not in datazone - block = 1802201963, count = 27499 Mar 27 12:53:40 snoopy kernel: Remounting filesystem read-only The same can happen if ext2_get_inode() returns an error - ext2_read_inode() will return an uninitialised inode and ext2_put_inode() is not allowed to go looking inside the bad inode.
-
- 16 Mar, 2003 1 commit
-
-
Andrew Morton authored
Patch from "Theodore Ts'o" <tytso@mit.edu> I recently noticed a bug in ext2/3; newly created inodes which inherit the noatime flag from their containing directory do not respect noatime until the inode is flushed from the inode cache and then re-read later. This is because the code which checks the ext2 no-atime attribute and then sets the S_NOATIME in inode->i_flags is present in ext2_read_inode(), but not in ext2_new_inode(). I fixed this in 2.4, and then found an even worse bug in the 2.5 code; the DIRSYNC flag is completely ignored *except* in the case where a directory is newly created using mkdir and its parent directory has the DIRSYNC flag. S_DIRSYNC doesn't get set in the ext2_new_inode() or the ext2_ioctl() paths (which is used by chattr). This patch centralizes the code which translates the ext2 flags in the raw ext2 inode to the appropriate flag values in inode->i_flags in a single location. This fixes the bug, makes things cleaner, and also removes 30 lines of code and 128 bytes of compiled x86 text in the bargain.
-
- 10 Feb, 2003 1 commit
-
-
Andrew Morton authored
Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in ll_rw_block() usage. Typical usage is: mark_buffer_dirty(bh); ll_rw_block(WRITE, 1, &bh); wait_on_buffer(bh); the problem is that if the buffer was locked on entry to this code sequence (due to in-progress I/O), ll_rw_block() will not wait, and start new I/O. So this code will wait on the _old_ I/O, and will then continue execution, leaving the buffer dirty. It turns out that all callers were only writing one buffer, and they were all waiting on that writeout. So I added a new sync_dirty_buffer() function: void sync_dirty_buffer(struct buffer_head *bh) { lock_buffer(bh); if (test_clear_buffer_dirty(bh)) { get_bh(bh); bh->b_end_io = end_buffer_io_sync; submit_bh(WRITE, bh); } else { unlock_buffer(bh); } } which allowed a fair amount of code to be removed, while adding the desired data-integrity guarantees. UFS has its own wrappers around ll_rw_block() which got in the way, so this operation was open-coded in that case.
-
- 02 Feb, 2003 1 commit
-
-
Andrew Morton authored
The second quota locking fix. Sorry, I seem to have misplaced the changelog.
-
- 08 Jan, 2003 1 commit
-
-
Andrew Morton authored
Patch from Badari Pulavarty <pbadari@us.ibm.com> and myself This patch adds the infrastructure for performing asynchronous (AIO) blockdev direct-IO. - Adds generic_file_aio_write_nolock() and make other generic_file_*_write() to use it. - Modify generic_file_direct_IO() and ->direct_IO() functions to take "kiocb *" instead of "file *". - Renames generic_direct_IO() to blockdev_direct_IO(). - Move generic_file_direct_IO() to mm/filemap.c (it is not blockdev-specific, whereas the rest of fs/direct-io.c is). - Add AIO read/write support to the raw driver.
-
- 21 Dec, 2002 1 commit
-
-
Andrew Morton authored
The same thing, for ext2.
-
- 14 Dec, 2002 2 commits
-
-
Andrew Morton authored
The optimisation for synchronous mounts was only correct for S_ISREG files. Directories do not pass through generic_osync_inode() and we still need to synchronously write out their indirect blocks.
-
Andrew Morton authored
current->flags:PF_SYNC was a hack I added because I didn't want to change all ->writepage implementations. It's foul. And it means that if someone happens to run direct page reclaim within the context of (say) sys_sync, the writepage invokations from the VM will be treated as "data integrity" operations, not "memory cleansing" operations, which would cause latency. So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage. It is the `writeback_control' structure which contains the full context information about why writepage was called. The initial version of this patch just passed in a bare `int sync', but the XFS team need more info so they can perform writearound from within page reclaim. The patch also adds writeback_control.for_reclaim, so writepage implementations can inspect that to work out the call context rather than peeking at current->flags:PF_MEMALLOC.
-
- 22 Nov, 2002 2 commits
-
-
Andrew Morton authored
Implements a new set of block address_space_operations which will never attach buffer_heads to file pagecache. These can be turned on for ext2 with the `nobh' mount option. During write-intensive testing on a 7G machine, total buffer_head storage remained below 0.3 megabytes. And those buffer_heads are against ZONE_NORMAL pagecache and will be reclaimed by ZONE_NORMAL memory pressure. This work is, of course, a special for the huge highmem machines. Possibly it obsoletes the buffer_heads_over_limit stuff (which doesn't work terribly well), but that code is simple, and will provide relief for other filesystems. It should be noted that the nobh_prepare_write() function and the PageMappedToDisk() infrastructure is what is needed to solve the problem of user data corruption when the filesystem which backs a sparse MAP_SHARED mapping runs out of space. We can use this code in filemap_nopage() to ensure that all mapped pages have space allocated on-disk. Deliver SIGBUS on ENOSPC. This will require a new address_space op, I expect.
-
Andrew Morton authored
The vm_writeback address_space operation was designed to provide the VM with a "clustered writeout" capability. It allowed the filesystem to perform more intelligent writearound decisions when the VM was trying to clean a particular page. I can't say I ever saw any real benefit from this - not much writeout actually happens on that path - quite a lot of work has gone into minimising it actually. The default ->vm_writeback a_op which I provided wrote back the pages in ->dirty_pages order. But there is one scenario in which this causes problems - writing a single 4G file with mem=4G. We end up with all of ZONE_NORMAL full of dirty pages, but all writeback effort is against highmem pages. (Because there is about 1.5G of dirty memory total). Net effect: the machine stalls ZONE_NORMAL allocation attempts until the ->dirty_pages writeback advances onto ZONE_NORMAL pages. This can be fixed most sweetly with additional radix-tree infrastructure which will be quite complex. Later. So this patch dumps it all, and goes back to using writepage against individual pages as they come off the LRU.
-
- 17 Nov, 2002 1 commit
-
-
Andi Kleen authored
stat64 has been changed to return jiffies granuality as nsec in previously unused fields. This allows make to make better decisions on when to recompile a file. Follows losely the Solaris API. CURRENT_TIME has been redefined to return struct timespec. The users who don't use it in a inode/attr context have been changed to use a new get_seconds() function. CURRENT_TIME is implemented by an out-of-line function. There is a small performance penalty in this patch. The previous filemap code had an optimization to flush atime only once a second. This is currently gone, which will increase flushes a bit. I believe the correct solution if it should be a problem is to have per super block fields that give an arbitary atime flush granuality - so that you can set it to be only flushed once a hour if you prefer that. I will work on that later in separate patches if the need should arise. struct inode and the attr struct has been changed to store struct timespec instead of time_t for [cma]time. Not all file systems support this granuality, but some like XFS,NFSv3,CIFS,JFS do. The others will currently truncate the nsec part on flushing to disk. There was some discussion on this rounding on l-k previously. I went for simple truncation because there is not much evidence IMHO that the more complicated roundings have any advantages. In practice application will be rather unlikely to notice the rounding anyways - they can only see a difference when an inode is flush from memory and reloaded in less than a second, which is rather unlikely.
-
- 05 Nov, 2002 2 commits
-
-
Trond Myklebust authored
The following patch makes the ->readpages() address_space_operation take a struct file argument just like ->readpage().
-
Andrew Morton authored
Patch from Manfred Spraul Use a local counter instead of the global 'event' variable for the readdir() optimization. Depends on patch-event-II Background: The only user of i_version and f_version in ext2 is ext2_readdir(). As an optimization, ext2 performs the validation of the start position for readdir() only if flip->f_version != inode->i_version. If there was no llseek and no directory change since the last readdir() call, then f_pos can be trusted. f_version is set to 0 in get_empty_flip and during llseek. Right now, i_version set to ++event during ext2_read_inode and commit_chunk, i.e. at inode creation and if a directory is changed. Initializing i_version to 1, and updating with i_version++ achieves the same effect, without the need of a global variable. Global uniqueness is not required, there are no other uses of [if]_version in ext2. Change relative to the patch you have right now: i_version is initialized to 1 instead of 0. For ext2 it's doesn't matter [there is always a valid 'len' value at the beginning of a directory data block], but it's cleaner.
-
- 30 Oct, 2002 3 commits
-
-
Theodore Y. Ts'o authored
This patch adds ACL support to the ext2 filesystem.
-
Theodore Y. Ts'o authored
This patch adds extended attribute support to the ext2 filesystem. This uses the generic extended attribute patch which was developed by Andreas Gruenbacher and the XFS team. As a result, the user space utilities which work for XFS will also work with these patches.
-
Theodore Y. Ts'o authored
This patch allows filesystems with expanded inodes to be mounted. (compatibility feature flags will be used to control whether or not the filesystem should be mounted in case the new inode fields will result in compatibility issues). This allows for future compatibility with newer versions of ext2fs.
-
- 29 Oct, 2002 1 commit
-
-
Andrew Morton authored
Mainly from Badari Pulavarty Traditionally we have only supported O_DIRECT I/O at an alignment and granularity which matches the underlying filesystem. That typically means that all IO must be 4k-aligned and a multiple of 4k in size. Here, we relax that so that direct I/O happens with (typically) 512-byte alignment and multiple-of-512-byte size. The tricky part is when a write starts and/or ends partway through a filesystem block which has just been added. We need to zero out the parts of that block which lie outside the written region. We handle that by putting appropriately-sized parts of the ZERO_PAGE into sepatate BIOs. The generic_direct_IO() function has been changed so that the filesystem must pass in the address of the block_device against which the IO is to be performed. I'd have preferred to not do this, but we do need that info at that time so that alignment checks can be performed. If the filesystem passes in a NULL block_device pointer then we fall back to the old behaviour - must align with the fs blocksize. There is no trivial way for userspace to know what the minimum alignment is - it depends on what bdev_hardsect_size() says about the device. It is _usually_ 512 bytes, but not always. This introduces the risk that someone will develop and test applications which work fine on their hardware, but will fail on someone else's hardware. It is possible to query the hardsect size using the BLKSSZGET ioctl against the backing block device. This can be performed at runtime or at application installation time.
-
- 12 Oct, 2002 1 commit
-
-
Richard Henderson authored
warning: long int format, different type arg (arg 5) by casting ino_t arguments to unsigned long for printf formats. In some instances, change %ld to %lu.
-
- 09 Oct, 2002 1 commit
-
-
Andrew Morton authored
From Peter Chubb Filesystem migration to possibly 64-bit sector_t: - bmap() now takes and returns a sector_t to allow filesystems (e.g., JFS, XFS) that are 64-bit clean to deal with large files - buffer handling now 64-bit clean Enable 64-bit sector_t on IA32 and PPC. kiobufs takes sector_t array, not array of long. Fix blkmtd.c to deal in such an array. Miscellaneous fixes for 64-bit sector_t. - missed printk formats - ide_floppy_do_request had incorrect signature - in blkmtd.c there was a pointer used to manipulate an array to be used by kiobuf -- it was unsigned long, needed to be sector_t
-
- 07 Oct, 2002 1 commit
-
-
Chuck Lever authored
This makes file credentials available to the ->direct_IO address space operation by replacing its struct inode* argument with a struct file* argument. this patch is a prerequisite for NFS direct I/O support. it breaks the raw device driver.
-
- 05 Oct, 2002 1 commit
-
-
Andrew Morton authored
When the global buffer LRU was present, dirty ext2 indirect blocks were automatically scheduled for writeback alongside their data. I added write_mapping_buffers() to replace this - the idea was to schedule the indirects close in time to the scheduling of their data. It works OK for small-to-medium sized files but for large, linear writes it doesn't work: the request queue is completely full of file data and when we later come to scheduling the indirects, their neighbouring data has already been written. So writeback of really huge files tends to be a bit seeky. So. Kill it. Will fix this problem by other means.
-
- 19 Sep, 2002 1 commit
-
-
Andrew Morton authored
The writeback code paths which walk the superblocks and inodes are getting an increasing arguments passed to them. The patch wraps those args into the new `struct writeback_control', and uses that instead. There is no functional change. The new writeback_control structure is passed down through the writeback paths in the place where the old `nr_to_write' pointer used to be. writeback_control will be used to pass new information up and down the writeback paths. Such as whether the writeback should be non-blocking, and whether queue congestion was encountered.
-
- 13 Sep, 2002 1 commit
-
-
Andrew Morton authored
This is Janet Morgan's patch which converts the readv/writev code to submit all segments for IO before waiting on them, rather than submitting each segment separately. This is a critical performance fix for O_DIRECT reads and writes. Prior to this change, O_DIRECT vectored IO was forced to wait for completion against each segment of the iovec rather than submitting all segments and waiting on the lot. ie: for ten segments, this code will be ten times faster. There will also be moderate improvements for buffered IO - smaller code paths, plus writev() only takes i_sem once. The patch ended up quite large unfortunately - turned out that the only sane way to implement this without duplicating significant amounts of code (the generic_file_write() bounds checking, all the O_DIRECT handling, etc) was to redo generic_file_read() and generic_file_write() to take an iovec/nr_segs pair rather than `buf, count'. New exported functions generic_file_readv() and generic_file_writev() have been added: ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos); ssize_t generic_file_writev(struct file *file, const struct iovec *iov, unsigned long nr_segs, loff_t * ppos); If a driver does not use these in their file_operations then they will continue to use the old readv/writev code, which sits in a loop calling calls fops->read() or fops->write(). ext2, ext3, JFS and the blockdev driver are currently using this capability. Some coding cleanups were made in fs/read_write.c. Mainly: - pass "READ" or "WRITE" around to indicate the diretion of the operation, rather than the (confusing, inverted) VERIFY_READ/VERIFY_WRITE. - Use the identifier `nr_segs' everywhere to indicate the iovec length rather than `count', which is often used to indicate the number of bytes in the syscall. It was confusing the heck out of me. - Some cleanups to the raw driver. - Some additional generality in fs/direct_io.c: the core `struct dio' used to be a "populate-and-go" thing. Janet has broken that up so you can initialise a struct dio once, then loop around feeding it more file segments, then wait on completion against everything. - In a couple of places we needed to handle the situation where we knew, a-priori, that the user was going to get a short read or write. File size limit exceeded, read past i_size, etc. We handled that by shortening the iovec in-place with iov_shorten(). Which is not particularly pretty, but neither were the alternatives.
-
- 13 Aug, 2002 1 commit
-
-
Andrew Morton authored
Convert ext2 initialisers to c99 format. From Art Haas.
-
- 28 Jul, 2002 1 commit
-
-
Andrew Morton authored
This patch is a performance and correctness update to the direct-IO code: O_DIRECT and the raw driver. It mainly affects IO against blockdevs. The direct_io code was returning -EINVAL for a filesystem hole. Change it to clear the userspace page instead. There were a few restrictions and weirdnesses wrt blocksize and alignments. The code has been reworked so we now lay out maximum-sized BIOs at any sector alignment. Because of this, the raw driver has been altered to set the blockdev's soft blocksize to the minimum possible at open() time. Typically, 512 bytes. There are now no performance disadvantages to using small blocksizes, and this gives the finest possible alignment. There is no API here for setting or querying the soft blocksize of the raw driver (there never was, really), which could conceivably be a problem. If it is, we can permit BLKBSZSET and BLKBSZGET against the fd which /dev/raw/rawN returned, but that would require that blk_ioctl() be exported to modules again. This code is wickedly quick. Here's an oprofile of a single 500MHz PIII reading from four (old) scsi disks (two aic7xxx controllers) via the raw driver. Aggregate throughput is 72 megabytes/second: c013363c 24 0.0896492 __set_page_dirty_buffers c021b8cc 24 0.0896492 ahc_linux_isr c012b5dc 25 0.0933846 kmem_cache_free c014d894 26 0.09712 dio_bio_complete c01cc78c 26 0.09712 number c0123bd4 40 0.149415 follow_page c01eed8c 46 0.171828 end_that_request_first c01ed410 49 0.183034 blk_recount_segments c01ed574 65 0.2428 blk_rq_map_sg c014db38 85 0.317508 do_direct_IO c021b090 90 0.336185 ahc_linux_run_device_queue c010bb78 236 0.881551 timer_interrupt c01052d8 25354 94.707 poll_idle A testament to the efficiency of the 2.5 block layer. And against four IDE disks on an HPT374 controller. Throughput is 120 megabytes/sec: c01eed8c 80 0.292462 end_that_request_first c01fe850 87 0.318052 hpt3xx_intrproc c01ed574 123 0.44966 blk_rq_map_sg c01f8f10 141 0.515464 ata_select c014db38 153 0.559333 do_direct_IO c010bb78 235 0.859107 timer_interrupt c01f9144 281 1.02727 ata_irq_enable c01ff990 290 1.06017 udma_pci_init c01fe878 308 1.12598 hpt3xx_maskproc c02006f8 379 1.38554 idedisk_do_request c02356a0 609 2.22637 pci_conf1_read c01ff8dc 611 2.23368 udma_pci_start c01ff950 922 3.37062 udma_pci_irq_status c01f8fac 1002 3.66308 ata_status c01ff26c 1059 3.87146 ata_start_dma c01feb70 1141 4.17124 hpt374_udma_stop c01f9228 3072 11.2305 ata_out_regfile c01052d8 15193 55.5422 poll_idle Not so good. One problem which has been identified with O_DIRECT is the cost of repeated calls into the mapping's get_block() callback. Not a big problem with ext2 but other filesystems have more complex get_block implementations. So what I have done is to require that callers of generic_direct_IO() implement the new `get_blocks()' interface. This is a small extension to get_block(). It gets passed another argument which indicates the maximum number of blocks which should be mapped, and it returns the number of blocks which it did map in bh_result->b_size. This allows the fs to map up to 4G of disk (or of hole) in a single get_block() invokation. There are some other caveats and requirements of get_blocks() which are documented in the comment block over fs/direct_io.c:get_more_blocks(). Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block mapping. It certainly allows good speedups. But it doesn't allow the fs to return a scatter list of blocks - it only understands linear chunks of disk. I think that's really all it _should_ do. I'll let get_blocks() sit for a while and wait for some feedback. If it is sufficient and nobody objects too much, I shall convert all get_block() instances in the kernel to be get_blocks() instances. And I'll teach readahead (at least) to use the get_blocks() extension. Delayed allocate writeback could use get_blocks(). As could block_prepare_write() for blocksize < PAGE_CACHE_SIZE. There's no mileage using it in mpage_writepages() because all our filesystems are syncalloc, and nobody uses MAP_SHARED for much. It will be tricky to use get_blocks() for writes, because if a ton of blocks have been mapped into the file and then something goes wrong, the kernel needs to either remove those blocks from the file or zero them out. The direct_io code zeroes them out. btw, some time ago you mentioned that some drivers and/or hardware may get upset if there are multiple simultaneous IOs in progress against the same block. Well, the raw driver has always allowed that to happen. O_DIRECT writes to blockdevs do as well now. todo: 1) The driver will probably explode if someone runs BLKBSZSET while IO is in progress. Need to use bdclaim() somewhere. 2) readv() and writev() need to become direct_io-aware. At present we're doing stop-and-wait for each segment when performing readv/writev against the raw driver and O_DIRECT blockdevs.
-
- 15 Jul, 2002 1 commit
-
-
Andreas Dilger authored
this patch is a minor fixup to ext2/inode.c to avoid displaying the high 32 bits of the size for anything other than regular files. For sockets, pipes, symlinks, etc it doesn't make sense to have a value larger than 2GB, and this has already been fixed in ext3 and e2fsprogs.
-
- 14 Jul, 2002 1 commit
-
-
Andrew Morton authored
Here's a patch which converts O_DIRECT to go direct-to-BIO, bypassing the kiovec layer. It's followed by a patch which converts the raw driver to use the O_DIRECT engine. CPU utilisation is about the same as the kiovec-based implementation. Read and write bandwidth are the same too, for 128k chunks. But with one megabyte chunks, this implementation is 20% faster at writing. I assume this is because the kiobuf-based implementation has to stop and wait for each 128k chunk, whereas this code streams the entire request, regardless of its size. This is with a single (oldish) scsi disk on aic7xxx. I'd expect the margin to widen on higher-end hardware which likes to have more requests in flight. Question is: what do we want to do with this sucker? These are the remaining users of kiovecs: drivers/md/lvm-snap.c drivers/media/video/video-buf.c drivers/mtd/devices/blkmtd.c drivers/scsi/sg.c the video and mtd drivers seems to be fairly easy to de-kiobufize. I'm aware of one proprietary driver which uses kiobufs. XFS uses kiobufs a little bit - just to map the pages. So with a bit of effort and maintainer-irritation, we can extract the kiobuf layer from the kernel.
-
- 12 Jun, 2002 1 commit
-
-
Andrew Morton authored
Removes the put_iode optimisation. It's racy, as Chris pointed out.
-
- 27 May, 2002 3 commits
-
-
Andrew Morton authored
An implementation of directory-synchronous mounts. I sent this out some months ago and it didn't generate a lot of interest. Later we had one of the usual cheery exchanges with Wietse Venema (postfix development) and he agreed that directory synchronous mounts were something that he could use, and that there was benefit in implementing them in Linux. If you choose to apply this I'll push the 2.4 patch. Patch against e2fsprogs-1.26: http://www.zip.com.au/~akpm/linux/dirsync/e2fsprogs-1.26.patch Patch against util-linux-2.11n: http://www.zip.com.au/~akpm/linux/dirsync/util-linux-2.11n.patch The kernel patch includes implementations for ext2 and ext3. It's pretty simple. - When dirsync is in operation against a directory, the following operations are synchronous within that directory: create, link, unlink, symlink, mkdir, rmdir, mknod, rename (synchronous if either the source or dest directory is dirsync). - dirsync is a subset of sync. So `mount -o sync' or `chattr +S' give you everything which `mount -o dirsync' or `chattr +D' gives, plus synchronous file writes. - ext2's inode.i_attr_flags is unused, and is removed. - mount /dev/foo /mnt/bar -o dirsync works as expected. - An ext2 or ext3 directory tree can be set dirsync with `chattr +D -R'. - dirsync is maintained as new directories are created under a `chattr +D' directory. Like `chattr +S'. - Other filesystems can trivially be taught about dirsync. It's just a matter of replacing `IS_SYNC(inode)' with `IS_DIRSYNC(inode)' in the directory update functions. IS_SYNC will still be honoured when IS_DIRSYNC is used. - Non-directory files do not have their dirsync flag propagated. So an S_ISREG file which is created inside a dirsync directory will not have its dirsync bit set. chattr needs to do this as well. - There was a bit of version skew between e2fsprogs' idea of the inode flags and the kernel's. That is sorted out here. - `lsattr' shows the dirsync flag as "D". The letter "D" was previously being used for Compressed_Dirty_File. I changed Compressed_Dirty_File to use "Z". Is that OK? The mount(2) manpage needs to be taught about MS_DIRSYNC.
-
Andrew Morton authored
Spot the difference: aops.readpage aops.readpages aops.writepage aops.writeback_mapping The patch renames `writeback_mapping' to `writepages'
-
Andrew Morton authored
Multipage BIO writeout from the pagecache. It's pretty much the same as multipage reads. It falls back to buffers if things got complex. The write case is a little more complex because it handles pages which have buffers and pages which do not. If the page didn't have buffers this code does not add them.
-